Command line OAI-PMH client and incremental harvester.
Go Shell Makefile
Fetching latest commit…
Cannot retrieve the latest commit at this time.


Command line OAI-PMH incremental harvester. Data is harvested in monthly chunks.

$ metha-sync

All downloaded files are written to a directory below a base directory. The base directory is ~/.metha by default and can be adjusted with the METHA_DIR environment variable.

$ METHA_DIR=/tmp/harvest metha-sync -dir

To show the harvesting directory, you can use the -dir flag:

$ metha-sync -dir

Harvesting can be interrupted any time. The data is currently harvested up to the last full day, so there is a small latency.

Example: If the current date would be Thu Apr 21 14:28:10 CEST 2016, the harvester would request all data since the repositories earliest date and 2016-04-20 23:59:59.

The HTTP client is resilient. You can stream records to stdout:

$ metha-cat

This will stream all harvested records to stdout. You can emit records based on datestamp as well:

$ metha-cat -from 2016-01-01

This will only stream records with a datestamp equal or after 2016-01-01.

To just stream all data really fast, use find and zcat over the harvesting directory.

$ find $(metha-sync -dir -name "*gz" | xargs unpigz -c

To display basic repository information:

$ metha-id

To list all harvested endpoints:

$ metha-ls


Use a release or

$ go get


Currently the endpoint URL, the format and the set are concatenated and base64 encoded to form the target directory, e.g:

$ echo "U291bmRzI29haV9kYyNodHRwOi8vY29wYWMuamlzYy5hYy51ay9vYWktcG1o" | base64 -d

If you have very long set names or a very long URL and the target directory exceeds e.g. 255 chars (on ext4), the harvest won't work.

Harvesting Roulette

$ URL=$(sort -R <(curl -Lsf | head -1); metha-sync $URL && metha-cat $URL

Errors this harvester can somewhat handle

  • responses with resumption tokens that lead to empty responses
  • gzipped responses, that are not advertised as such
  • funny (illegal) control characters in XML responses
  • repositories, that won't respond unless the dates are given with the exact granualarity
  • repositories with endless token loops
  • repositories that do not support selective harvesting, use -no-intervals flag
  • limited repositories, metha will try up to 8 times with an exponential backoff
  • repositories, which throw occasional HTTP errors, although most of the responses look good, use -ignore-http-errors flag
  • funny XML entities (non-strict XML)