Scrape documents with associated metadata from
Switch branches/tags
Nothing to show
Clone or download


Scrape documents with associated metadata from


Get the code

git clone

Get ruby (>= 2.0.0)

gpg --keyserver hkp:// --recv-keys 409B6B1796C275462A1703113804BB82D39DC0E3
\curl -sSL | bash -s stable
rvm install 2.2.3
rvm --default use 2.2.3

Install bundler

gem install bundler

Install Dependencies

cd ratsinfo
bundle install


To start scraping use on other console:


This will extract all documents to the path of the environment variable DOWNLOAD_PATH (defaults to "data") and convert it to xml files, containing metadata and full text of the pdfs

To scrape an individual session for example:

rake testmonth

To just download a tiny set of Data, only session data. Just for testing.

To display all tasks use:

rake -T

The download directory will have the following scheme:

  • each session have a directory, where the id is the directory name

  • every document belonging to this session will be extracted to this directory

  • additionally a JSON file is created, with the session id in its name. This is a machine-readable version of the index.htm file, which is contained in the document archives

We do now follow the OParl specification!

Deviations from the OParl spec:

  • Numerical id everywhere, because we don't yet serve the data on HTTP URIs