Machine Translation Archive - Statistics

This web application

  1. runs a daily cronjob scraping data of different categories from,
  2. munches this data into statistics about the number of publications related to machine translation per each category,
  3. stores this data in JSON-format in a CouchDB-database, a document per category each,
  4. and finally a frontend iterates over the documents of this CouchDB-database and presents each document as stacked bar graph.

Please note that this is only a quickly hacked together web-app primarily done by me for trying out stuff in Ruby, so the numbers and statistics provided by this app should be taken with care, since data scraping and related things like e.g. string matching are error-prone.

or in John Hutchins' words:

"...alert users to possible misinterpretations. Firstly, it covers English-language publications only, and most are papers from conferences (there are hardly any journal papers). Secondly, the coverage is satisfactory only since 1990. I am adding from time to time older materials, but there is a great deal to be done - as yet only some of the 1950s has been covered - there is a huge amount from the 1960s and 1970s to be included. Thirdly there are vagaries in the way indexes are constructed - as a simple example, the papers from a workshop or conference devoted entirely to a single topic (e.g. statistical MT, or morphology, or syntax) are entered only once under the conference title and not individually - hence, the totals for such topics are under-represented. Fourthly, there are 'inconsistencies' in the language indexes; language-pairs treated in 'depth' in a paper are of course entered, but language-pairs treated more marginally are sometimes included and sometimes omitted..."

Data and Categories

The source data is scraped from the machine translation archive, an electronic repository and bibliography of articles, books and papers on topics in machine translation, computer translation systems, and computer-based translation tools, compiled by John Hutchins for the European Association for Machine Translation on behalf of the International Association for Machine Translation.

The data of following categories are being scraped and derived from the archive:

  • number of publications related to a certain language per year (link)
  • number of publications related to a certain language pair per year (link)
  • number of publications per country of origin per year (link)
  • number of publications focusing on a certain method per year (link)
  • number of publications focusing on a specific MT-application use per year (link)

Data series is limited for output plotting to the top 15 overall data series for each category.

Technologies used

Technologies used in this app include:

  • require_all gem in the Rakefile which is responsible for the daily data-scraping-cronjob, to allow easy extendability
  • Cloudants CouchDB service for storing the documents per category
  • Sinatra together with jQuery and Flot per plotting the documents


The app can easily be used for similar purposes or extended by only placing another class into the stats-subdirectory. The Rakefile task responsible for the cronjob, will look into the stats-subdirectory, where each file contains a class responsible for scraping data for one or more specific categories. The classes need be of the following structure:

class UniqueClassName < Stats
  register_stat 'unique_class_name'
  # a method to do the actual scraping taking an url and data as argument
  def do_the_scrap url, data
    return 'unique_name_of_couchDB_document_to_store_data_to'

The JSON-data used is of the following format:

    "label": "Some label for data series, e.g.English",
    "data": [
    "label": "another label",
    "data": ...

Furthermore the following urls can be used to receive a JSON-representation of already available data: