Skip to content

Configurations for some of our SOLR indices, and some scripts for indexing, transforming and formatting TEI stuff. See also: https://github.com/kb-dk/text-service

Notifications You must be signed in to change notification settings

kb-dk/solr-and-snippets

Repository files navigation

Solrs and Snippets

All the software are variations upon a single theme. The theme can be described as follows:

Assume that we have (semi)structured (meta)data in a database or repository of some kind. Then the snippet server is used for performing various operations upon those data.

Examples of such operations are

  • returning documents prepared for indexing
  • pieces of texts for reading
  • image URIs for viewing pages
  • pieces of metadata for a detailed presentation

We refer to the store as being the Snippet Server, the data inside it as data and the results returned from the operations as Snippets. The word snippet comes from the fact that it is usually just a part of a document returned.

Currently

  • all the data is in XML format
  • the snippet server functionalities are written in XSLT or Xquery or both
  • the snippets are returned in JSON, HTML or XML

The Snippet Server has to support CRUD basic functionalities. The indexing is is currently SOLR and the snippet crud eXist

Granularity, Identifiers and Indexers

The data used are stored on github. For example, the Archive for Danish Literature corpus is on

https://github.com/kb-dk/public-adl-text-sources

Many of the corpora used are in private repositories.

All of them are in Text Encoding Initiative, TEI for short, markup. This means that they are basically ordered hierarchies of overlapping content objects

tree

On volumes, works, trunks and leafs

An object in the content hierarchy is a work if annotated with metadata. Work units are the ones returned by search engine in the result set. The granularity is an editorial issue. The higher density of metadata annotations the more work nodes there are in a volume and the less text there is in each work, the higher the granularity.

The leaf is the smallest unit of the tree which can be identified and therefore retrievable and possible to index. The user interface gives, for each work in a result set a list of leaf nodes that are relevant for the search. leaf nodes are possible to quote but they do usually not appear in table of contents.

The trunk nodes are contained in work nodes. They may contain other trunk, work or leaf nodes. It is possible to address a trunk so it is possible to send a URI to someone and say: Read chapter 5, it is so good! The trunk is indexed and searchable in principle. However, the user interface only support them in table of contents and quotation services.

A volume is what comes close to a physical book. It contains one or more work nodes. If a volume contains only one and only one work, we refer to it as a monograph.

All text is indexed down to leaf nodes, basically paragraph, level, which implies

  • Paragraph in prose: <p> ... </p>
  • Speech in drama: <sp> ... </sp>
  • Strophe in poetry: <lg> ... </lg>

Hence if a work is five hundred pages we can find paragraphs, speech or strophes relevant to a users search, and provide a means to address them.

Identifying a node

We have to have a method to identify text segments and reference and quote them. Both we and our users need that. Here is how we achieve that:

Documents are indexed in a SOLR search engine. The Indexer, our software for loading the search eninge, traverses each TEI document tree creating SOLR documents as it goes.

Before we do that, though, we make sure that every node is identifiable using an ID. I.e., we ensure that each element has an xml:id. The Indexer must check whether a node has metadata annotations, i.e., if it is a work, in which case it has to pick up those. Those data are stored in the TEI header. The convention is that each work carries a reference to its metadata.

Hence, we have a three dimensional space

  • document ID
  • node ID
  • metadata ID

Any thing that should be possible to find for user in the frontend must have a SOLR document; everything that should be possible to reference must have an ID. However, for most practical tasks you only need to take into account the first two.

The document with the following URI as source

https://github.com/kb-dk/public-adl-text-sources/blob/master/texts/munp1.xml

will have its user interface on

https://tekster.kb.dk/text/adl-texts-munp1-shoot-workid57881

(1) adl represents the collection, (2) texts-munp1 is short for the file-path, that is directory and and file name. I.e., the file is adl/texts/munp1.xml (3) Finally shoot-workid57881 identifies (contains the node ID) the part of the document containing Gustaf Munch-Petersen's poem søvnen which annotated as being a work.

The same poem can be referred to as a point of a collection of poem, ”det nøgne menneske”

https://tekster.kb.dk/text/adl-texts-munp1-shoot-workid57312#workid57881

in which case we loose the connection with the metadata annotation to the work søvnen

Since file-paths can be long and hyphens are permitted in an xml:id I separate file-path from node ID with -shoot-; the volume node ends with -root. I.e.,

https://tekster.kb.dk/text/adl-texts-munp1-root

Document capabilities

see corrs-and-ids.md

How to install the Snippet Server and its Data

The installation is more or less automatic. It is using the eXist servers REST API, so all data are sent to the server using PUT requests.

The installation is taking place by copying the data into a build directory in the source tree.

ant -p

show you the targets. The current ones are shown in the tables below.

Build and data preparation targets

Ant command Description Depends
ant clean Delete ./build
ant service Creates ./build/system and ./build/text-retriever. Copies text-service index definition to system all scripts and transforms common for adl, gv and sks into the file system (basically the content of exporters/common) clean
ant base_service Adds functions specific for adl, letters, tfs, gv, and sks service
ant other_services For installing pmm and lhv service
ant add_letters Adds scripts for Danmarks Breve deprecated
ant add_letter_data Adds data for Danmarks Breve deprecated
ant add_letters_ng Adds Danmarks Breve to text-service, i.e., ng as next generation
ant add_grundtvig Copies all gv data into the build area. A complicated task, since it creates an entirely new directory structure and forks external script base_service
ant add_base_data Copies adl, tfs and sks base_service
ant add_other_data Copies data for pmm and holberg other_services
ant upload -Dhostport=just.an.example.org:8080 Installs the text-service backend on http://just.an.example.org:8080. Requires password for the user "admin" on that server

The upload function is implemented as a perl script executed by ant. Requires perl library libwww-perl, available as standard package for Linux or from CPAN.

Example

To install a snippet server on a server with hostname and port number just.an.example.org:8080 use the following to build and install in the database:

 ant service
 ant base_service
 ant add_base_data
 ant upload -Dhostport=just.an.example.org:8080

Your new snippet server will contain adl, tfs and sks. To set the permissions of all scripts in one go, "retrieve" the following URI

 http://admin@just.an.example.org:8080/exist/rest/db/text-retriever/xchmod.xq

which obviously requires password for "admin" user on just.an.example.org:8080

I sets (at least on some eXist installations) the execute permissions on all *.xq files. It doesn't work always, and as of writing this, it is not yet known when and where it works. Then you have to do that manually according to the eXist manual. See your server

http://just.an.example.org:8080/exist/apps/dashboard/index.html

The Snippet Server and its arguments

The scripts and transforms are in the directory exports. For ADL the following are available

  • open-seadragon-config.xq (web service providing JSON for OSD)
  • present.xq (general purpose presentation script)
  • present-text.xq (a detagger, it extracts raw text from the file)

Most Snippet Server scripts support the following arguments

Some more examples

Ingest and Indexing utilities

These utilities require the presence a local file system with stuff to be loaded.

Storing to exist

./indexing/exist_loader.pl <options>
where options are
   --load <directory> 
        from where to read files for loading
   --get <directory>
        where to write retrieved files
   --delete <directory with a backup>
        the files in that are found in the directory will be deleted from the
        database if there exist files with the same name

    --suffix <suffix> 
        file suffix to look for in directory. for example xml

    --target <target name>
        Basically database name. Default is 

    --context <context>
        Root for the rest services. Default is /exist/rest/db/

    --user <user name>
    --password <password of user>
    --host-port <host and port for server>
        Default is localhost:8080

For example

exist_loader.pl --file-list files_to_be_indexed.text \
		--user admin \
		--password secret \
		--host-port localhost:8080  \
		--context /exist/rest/db/adl/

will load the xml-files named in files_to_be_indexed.text into a database with base URI

http://localhost:8080/exist/rest/db/adl/
 

Data sources

To instantiate a text service, you need the software but also the data.

Published collections

For generating the data source for lh and tfs you need https://github.com/kb-dk/alto-to-tei-tools

Unpublished demo collections

To add lhv (AKA Ludvig Holbergs Skrifter) would be fairly easy.

Copying GV

The structure of the GV corpus is too complicated for copying using and copying functions and is implemented as an external perl script

utilities/copy_grundtvig.pl

which does to essential things: (1) It only ingests the files published according to the GV filter (2) it preprocesses the GV TEI files such that they work with the text-service's indexing and retrieval practices.

The GV filter is just a set of (currently) 442 wildcards we use for copying the files.

...

1808/1808GV/1808_99a/1808_99a_*.xml
1809/1809GV/1809_105/1809_105_*.xml
1809/1809GV/1809_106/1809_106_*.xml
1809/1809GV/1809_107/1809_107_*.xml
1809/1809GV/1809_108/1809_108_*.xml
1809/1809GV/1809_109/1809_109_*.xml
1809/1809GV/1809_111/1809_111_*.xml
1809/1809GV/1809_113/1809_113_*.xml
1809/1809GV/1809_115/1809_115_*.xml
1809/1809GV/1809_116/1809_116_*.xml

...

It is maintained by the Grundtvig project.

Running solrizr and loading solr docs into cloud

indexing/solr_updater.pl \
    --file-list=files_to_be_indexed.text \
    --param exist_host=localhost \
    --param exist_port=8080 \
    --param service=adl \
    --param op=solrize \
    --param solr_host=localhost \
    --param solr_port=8983 \
    --param collection=adl

This software is dependent on the module URI::Template

sudo cpan -e install URI::Template

Minor utilities

  • xslt transform all files with --suffix xml in the --directory ./periods/ with a style --sheet preprocess.xsl
indexing/transform-all.pl --sheet exporters/common/preprocess.xsl --directory ./periods/ --suffix xml
  • Validate bagit create or validate bagit manifests
    • create-bag.rb
    • validate-bag.rb