Prerequisites: Maven 3 with Java 8 and UTF-8 encoding; verify with
Create and change into a folder where you want to store the projects:
mkdir ~/git ; cd ~/git
Build the hbz metafacture-core fork:
git clone https://github.com/hbz/metafacture-core.git
mvn clean install -DskipTests
Get and change into the mabxml-elasticsearch repo:
git clone https://github.com/hbz/mabxml-elasticsearch.git
.travis.yml file for details on the CI config used by Travis.
Index server setup
See also: Elasticsearch installation steps.
Download the latest 2.3.x Elasticsearch release, e.g. on Linux:
Unzip it and change into the new directory:
unzip elasticsearch-2.3.3.zip ; cd elasticsearch-2.3.3
elasticsearch application in the
bin/ folder in daemon mode (output is logged to
logs/elasticsearch.log), and record the process id:
bin/elasticsearch -d -p pid
Access your local Elasticsearch server:
curl -X GET http://localhost:9200/
To shut down the Elasticsearch server, kill the process recorded in the
pid file on startup:
kill `cat pid`
To continue with the setup and usage below, leave the server running or restart it, and change back to the project root directory:
Web server setup
Download the minimal activator application (optionally, there’s an offline version available, see Playframework downloads documentation) to run the Play server:
Start the Play server from the project root in background production mode (output is logged to console and
logs/application.log, for development mode replace
The web applications index page can now be accessed at http://localhost:9000/hbz01.
Ctrl+D to return to the shell (since we called
start, the server remains in background).
To transform and index the data, POST to the
transform/ route and pass arguments as query parameters.
Pass a directory with the data to transform (full local path, change sample below for your system), the file suffix, your Elasticsearch cluster name, node IP number, and index name, e.g.:
curl -XPOST "http://localhost:9000/hbz01/transform?dir=/home/fsteeg/git/mabxml-elasticsearch/test/&suffix=bz2&cluster=elasticsearch&hostname=127.0.0.1&index=hbz01"
This will index the data from the specified location to the cluster ‘elasticsearch’, using node ‘127.0.0.1’, into an index called ‘hbz01’.
Index server data access
You can then GET a specific record in the index by hbz ID:
curl -XGET 'http://127.0.0.1:9200/hbz01/mabxml/HT012786619'; echo
You can also exclude the Elasticsearch metadata:
curl -XGET 'http://127.0.0.1:9200/hbz01/mabxml/HT012786619/_source'; echo
For details on the various options see the GET API documentation.
Web server data access
You can also GET data by ID using the Play server:
Unlike the Elasticsearch index queries above (which serve JSON), this serves XML:
curl http://localhost:9000/hbz01/HT017665866 | xmllint --format -
To shut down the server, kill the process recorded in the
kill `cat target/universal/stage/RUNNING_PID`
When running in foreground development mode (
activator run), hitting
CTRL+D stops the server.
We run this transformation daily using a cron job that calls the
cron.sh script. Internal documentation: to fully understand what is done when, trace the entries in crontab of hduser@weywot1.
The final index data is served at http://lobid.org/hbz01, with individual resource URLs like http://lobid.org/hbz01/HT012786619. Internal documentation: the application is deployed at sol@quaoar1:~/git/mabxml-elasticsearch, an Apache proxy is set up at emphytos:/etc/apache2/vhosts.d/lobid.org.conf.
Eclipse Public License: http://www.eclipse.org/legal/epl-v10.html