Solrs and Snippets

All the software are variations upon a single theme. The theme can be described as follows:

Assume that we have (semi)structured (meta)data in a database or repository of some kind. Then the snippet server is used for performing various operations upon those data.

Examples of such operations are

returning documents prepared for indexing
pieces of texts for reading
image URIs for viewing pages
pieces of metadata for a detailed presentation

We refer to the store as being the Snippet Server, the data inside it as data and the results returned from the operations as Snippets. The word snippet comes from the fact that it is usually just a part of a document returned.

Currently

all the data is in XML format
the snippet server functionalities are written in XSLT or Xquery or both
the snippets are returned in JSON, HTML or XML

The Snippet Server has to support CRUD basic functionalities. The indexing is is currently SOLR and the snippet crud eXist

Granularity, Identifiers and Indexers

The data used are stored on github. For example, the Archive for Danish Literature corpus is on

https://github.com/kb-dk/public-adl-text-sources

Many of the corpora used are in private repositories.

All of them are in Text Encoding Initiative, TEI for short, markup. This means that they are basically ordered hierarchies of overlapping content objects

On volumes, works, trunks and leafs

An object in the content hierarchy is a work if annotated with metadata. Work units are the ones returned by search engine in the result set. The granularity is an editorial issue. The higher density of metadata annotations the more work nodes there are in a volume and the less text there is in each work, the higher the granularity.

The leaf is the smallest unit of the tree which can be identified and therefore retrievable and possible to index. The user interface gives, for each work in a result set a list of leaf nodes that are relevant for the search. leaf nodes are possible to quote but they do usually not appear in table of contents.

The trunk nodes are contained in work nodes. They may contain other trunk, work or leaf nodes. It is possible to address a trunk so it is possible to send a URI to someone and say: Read chapter 5, it is so good! The trunk is indexed and searchable in principle. However, the user interface only support them in table of contents and quotation services.

A volume is what comes close to a physical book. It contains one or more work nodes. If a volume contains only one and only one work, we refer to it as a monograph.

All text is indexed down to leaf nodes, basically paragraph, level, which implies

Paragraph in prose: <p> ... </p>
Speech in drama: <sp> ... </sp>
Strophe in poetry: <lg> ... </lg>

Hence if a work is five hundred pages we can find paragraphs, speech or strophes relevant to a users search, and provide a means to address them.

Identifying a node

We have to have a method to identify text segments and reference and quote them. Both we and our users need that. Here is how we achieve that:

Documents are indexed in a SOLR search engine. The Indexer, our software for loading the search eninge, traverses each TEI document tree creating SOLR documents as it goes.

Before we do that, though, we make sure that every node is identifiable using an ID. I.e., we ensure that each element has an xml:id. The Indexer must check whether a node has metadata annotations, i.e., if it is a work, in which case it has to pick up those. Those data are stored in the TEI header. The convention is that each work carries a reference to its metadata.

Hence, we have a three dimensional space

document ID
node ID
metadata ID

Any thing that should be possible to find for user in the frontend must have a SOLR document; everything that should be possible to reference must have an ID. However, for most practical tasks you only need to take into account the first two.

The document with the following URI as source

https://github.com/kb-dk/public-adl-text-sources/blob/master/texts/munp1.xml

will have its user interface on

https://tekster.kb.dk/text/adl-texts-munp1-shoot-workid57881

(1) adl represents the collection, (2) texts-munp1 is short for the file-path, that is directory and and file name. I.e., the file is adl/texts/munp1.xml (3) Finally shoot-workid57881 identifies (contains the node ID) the part of the document containing Gustaf Munch-Petersen's poem søvnen which annotated as being a work.

The same poem can be referred to as a point of a collection of poem, ”det nøgne menneske”

https://tekster.kb.dk/text/adl-texts-munp1-shoot-workid57312#workid57881

in which case we loose the connection with the metadata annotation to the work søvnen

Since file-paths can be long and hyphens are permitted in an xml:id I separate file-path from node ID with -shoot-; the volume node ends with -root. I.e.,

https://tekster.kb.dk/text/adl-texts-munp1-root

Document capabilities

see corrs-and-ids.md

How to install the Snippet Server and its Data

The installation is more or less automatic. It is using the eXist servers REST API, so all data are sent to the server using PUT requests.

The installation is taking place by copying the data into a build directory in the source tree.

ant -p

show you the targets. The current ones are shown in the tables below.

Build and data preparation targets

Ant command	Description	Depends
ant clean	Delete ./build
ant service	Creates ./build/system and ./build/text-retriever. Copies text-service index definition to system all scripts and transforms common for adl, gv and sks into the file system (basically the content of exporters/common)	clean
ant base_service	Adds functions specific for adl, letters, tfs, gv, and sks	service
ant other_services	For installing pmm and lhv	service
ant add_letters	Adds scripts for Danmarks Breve deprecated
ant add_letter_data	Adds data for Danmarks Breve deprecated
ant add_letters_ng	Adds Danmarks Breve to text-service, i.e., ng as next generation
ant add_grundtvig	Copies all gv data into the build area. A complicated task, since it creates an entirely new directory structure and forks external script	base_service
ant add_base_data	Copies adl, tfs and sks	base_service
ant add_other_data	Copies data for pmm and holberg	other_services
ant upload -Dhostport=just.an.example.org:8080	Installs the text-service backend on http://just.an.example.org:8080. Requires password for the user "admin" on that server

The upload function is implemented as a perl script executed by ant. Requires perl library libwww-perl, available as standard package for Linux or from CPAN.

Example

To install a snippet server on a server with hostname and port number just.an.example.org:8080 use the following to build and install in the database:

 ant service
 ant base_service
 ant add_base_data
 ant upload -Dhostport=just.an.example.org:8080

Your new snippet server will contain adl, tfs and sks. To set the permissions of all scripts in one go, "retrieve" the following URI

 http://admin@just.an.example.org:8080/exist/rest/db/text-retriever/xchmod.xq

which obviously requires password for "admin" user on just.an.example.org:8080

I sets (at least on some eXist installations) the execute permissions on all *.xq files. It doesn't work always, and as of writing this, it is not yet known when and where it works. Then you have to do that manually according to the eXist manual. See your server

http://just.an.example.org:8080/exist/apps/dashboard/index.html

The Snippet Server and its arguments

The scripts and transforms are in the directory exports. For ADL the following are available

open-seadragon-config.xq (web service providing JSON for OSD)
present.xq (general purpose presentation script)
present-text.xq (a detagger, it extracts raw text from the file)

Most Snippet Server scripts support the following arguments

doc -- the name of the document to be rendered or transformed
c -- if there are more sub-collections inside the data set, c is the name of the dirctory where doc is to be retrieved. Default is 'texts' for ADL, other are 'periods' and 'authors'
path -- the collection, file-path and node ID in the notation described above.
op -- is the operation to be performed upon the document doc. Possible op are
- 'render' which implies that doc is transformed into HTML. http://xstorage-test-01.kb.dk:8080/exist/rest/db/text-retriever/present.xq?doc=texts/aakjaer01val.xml&op=render&c=adl (synonymously using path instead of doc and collection(c)) http://xstorage-test-01.kb.dk:8080/exist/rest/db/text-retriever/present.xq?path=adl-texts-aakjaer01val-root&op=render)
- 'solrize' which returns a solr <add> ... </add> which is ready to be sent to SOLR. C.f., http://xstorage-test-01.kb.dk:8080/exist/rest/db/text-retriever/present.xq?doc=texts/aakjaer01val.xml&op=solrize&c=adl (Note, we hardly ever use the path notion with the solrize operation)
- 'facsimile' which returns a HTML document with images of the pages. Since we introduced OSD, it is only used for PDF generation. http://xstorage-test-01.kb.dk:8080/exist/rest/db/text-retriever/present.xq?doc=texts/aakjaer01val.xml&op=facsimile&c=adl (or synonymously http://xstorage-test-01.kb.dk:8080/exist/rest/db/text-retriever/present.xq?path=adl-texts-aakjaer01val-root&op=facsimile)
- 'toc' returns a HTML table of contents or synonymously http://xstorage-test-01.kb.dk:8080/exist/rest/db/text-retriever/present.xq?path=adl-texts-aakjaer01val-root&op=toc (http://xstorage-test-01.kb.dk:8080/exist/rest/db/text-retriever/present.xq?doc=texts/aakjaer01val.xml&op=toc&c=adl)
id -- the id of a part inside the doc which is to be treated. http://xstorage-test-01.kb.dk:8080/exist/rest/db/text-retriever/present.xq?path=adl-texts-aakjaer01val-shoot-workid59384&op=render (or http://xstorage-test-01.kb.dk:8080/exist/rest/db/text-retriever/present.xq?doc=texts/aakjaer01val.xml&op=render&id=workid59384&c=adl)
q -- assuming that 'q' is the query, the present.xq is labelling the hits in the text

Some more examples

Holberg, vol 3, HTML: http://xstorage-test-01.kb.dk:8080/exist/rest/db/text-retriever/present.xq?doc=texts/holb03val.xml&op=render&c=adl
Heiberg P.A., Rigsdalers-Sedlens Hændelser, Andet Kapitel. Detagged (plain text): http://xstorage-test-01.kb.dk:8080/exist/rest/db/text-retriever/present-text.xq?doc=texts/heibergpa01val.xml&id=idm140167182652400&c=adl
Den politiske Kandstøber, Actus II http://xstorage-test-01.kb.dk:8080/exist/rest/db/text-retriever/present.xq?doc=texts/holb03val.xml&op=render&id=idm140583366846000&c=adl
A single 'speak' in that play as HTML http://xstorage-test-01.kb.dk:8080/exist/rest/db/text-retriever/present.xq?doc=texts/holb03val.xml&op=render&id=idm140583366681648&c=adl
Another TOC for a small work. Note that it TOC is for the volume, however the entry for the work become highlighted in the service http://xstorage-test-01.kb.dk:8080/exist/rest/db/text-retriever/present.xq?doc=texts/aakjaer01val.xml&op=toc&id=workid94598&c=adl
Here is Aakjær, Jeppe RUGEN KJØRES HJEM: http://text-test-02.kb.dk/text/adl-texts-aakjaer01val-shoot-workid94598 The HTML represenation comes from http://xstorage-test-01.kb.dk:8080/exist/rest/db/text-retriever/present.xq?path=adl-texts-aakjaer01val-shoot-workid94598&op=render the facsimiles come from http://xstorage-test-01.kb.dk:8080/exist/rest/db/text-retriever/open-seadragon-config.xq?path=adl-texts-aakjaer01val-shoot-workid94598 which contains everything Open Seadragon need for presenting the images. In json.

Ingest and Indexing utilities

These utilities require the presence a local file system with stuff to be loaded.

Storing to exist

./indexing/exist_loader.pl <options>
where options are
   --load <directory> 
        from where to read files for loading
   --get <directory>
        where to write retrieved files
   --delete <directory with a backup>
        the files in that are found in the directory will be deleted from the
        database if there exist files with the same name

    --suffix <suffix> 
        file suffix to look for in directory. for example xml

    --target <target name>
        Basically database name. Default is 

    --context <context>
        Root for the rest services. Default is /exist/rest/db/

    --user <user name>
    --password <password of user>
    --host-port <host and port for server>
        Default is localhost:8080

For example

exist_loader.pl --file-list files_to_be_indexed.text \
		--user admin \
		--password secret \
		--host-port localhost:8080  \
		--context /exist/rest/db/adl/

will load the xml-files named in files_to_be_indexed.text into a database with base URI

http://localhost:8080/exist/rest/db/adl/

Data sources

To instantiate a text service, you need the software but also the data.

Published collections

adl
- https://github.com/kb-dk/adl-text-sources (authors and periods, private data)
- https://github.com/kb-dk/public-adl-text-sources (adl corpus, public data)
gv
- https://github.com/GrundtvigsVaerker (currently unavailable, used to be in project GV)
jura
- https://github.com/kb-dk/dab-lovforarbejder (public data). Note that there is an extensive apparatus for creating the TEI documents.
letters Deprecated, see letters-ng
letters-ng
- https://github.com/kb-dk/letter-corpus (private data)
lh
- Data comes from a zip file from somewhere.
sks
- https://github.com/kb-dk/SKS_tei (public data)
tfs
- https://github.com/kb-dk/trykkefrihedens-skrifter (for some odd reason this is private).

For generating the data source for lh and tfs you need https://github.com/kb-dk/alto-to-tei-tools

Unpublished demo collections

lhv https://github.com/kb-dk/other_tei_projects/pmm (private data)
pmm https://github.com/kb-dk/other_tei_projects/holberg (private data)

To add lhv (AKA Ludvig Holbergs Skrifter) would be fairly easy.

Copying GV

The structure of the GV corpus is too complicated for copying using and copying functions and is implemented as an external perl script

utilities/copy_grundtvig.pl

which does to essential things: (1) It only ingests the files published according to the GV filter (2) it preprocesses the GV TEI files such that they work with the text-service's indexing and retrieval practices.

The GV filter is just a set of (currently) 442 wildcards we use for copying the files.

...

1808/1808GV/1808_99a/1808_99a_*.xml
1809/1809GV/1809_105/1809_105_*.xml
1809/1809GV/1809_106/1809_106_*.xml
1809/1809GV/1809_107/1809_107_*.xml
1809/1809GV/1809_108/1809_108_*.xml
1809/1809GV/1809_109/1809_109_*.xml
1809/1809GV/1809_111/1809_111_*.xml
1809/1809GV/1809_113/1809_113_*.xml
1809/1809GV/1809_115/1809_115_*.xml
1809/1809GV/1809_116/1809_116_*.xml

...

It is maintained by the Grundtvig project.

Running solrizr and loading solr docs into cloud

indexing/solr_updater.pl \
    --file-list=files_to_be_indexed.text \
    --param exist_host=localhost \
    --param exist_port=8080 \
    --param service=adl \
    --param op=solrize \
    --param solr_host=localhost \
    --param solr_port=8983 \
    --param collection=adl

This software is dependent on the module URI::Template

sudo cpan -e install URI::Template

Minor utilities

xslt transform all files with --suffix xml in the --directory ./periods/ with a style --sheet preprocess.xsl

indexing/transform-all.pl --sheet exporters/common/preprocess.xsl --directory ./periods/ --suffix xml

Validate bagit create or validate bagit manifests
- create-bag.rb
- validate-bag.rb

Name		Name	Last commit message	Last commit date
Latest commit History 1,220 Commits
STATUS-and-ROADMAP		STATUS-and-ROADMAP
examples		examples
exporters		exporters
files_to_be_indexed		files_to_be_indexed
gv_filter		gv_filter
indexing		indexing
kolofon		kolofon
load-balancer		load-balancer
solr		solr
statistics		statistics
utilities		utilities
README.md		README.md
build.properties		build.properties
build.xml		build.xml
corrs-and-ids.md		corrs-and-ids.md
run_prod_indexing.sh		run_prod_indexing.sh
run_stage_indexing.sh		run_stage_indexing.sh
run_test_indexing.sh		run_test_indexing.sh
test_solrizr.sh		test_solrizr.sh
tree.jpg		tree.jpg

kb-dk/solr-and-snippets

Folders and files

Latest commit

History

Repository files navigation

Solrs and Snippets

Granularity, Identifiers and Indexers

On volumes, works, trunks and leafs

Identifying a node

Document capabilities

How to install the Snippet Server and its Data

Build and data preparation targets

Example

The Snippet Server and its arguments

Ingest and Indexing utilities

Storing to exist

Data sources

Copying GV

Running solrizr and loading solr docs into cloud

Minor utilities

About

Topics

Resources

Stars

Watchers

Forks

Languages