Data and code used in the 2014 WMT, "Efficient Elicitation of Annotations for Human Evaluation of Machine Translation"
Switch branches/tags
Nothing to show
Clone or download
Latest commit 2130efd Nov 23, 2017

TrueSkill for WMT

Source code used in 2014 WMT paper, "Efficient Elicitation of Annotations for Human Evaluation of Machine Translation"

  • Keisuke Sakaguchi (keisuke[at]
  • Matt Post
  • Benjamin Van Durme

Last updated: June 12th, 2015

This document describes the proposed method described in the following paper:

  author    = {Sakaguchi, Keisuke  and  Post, Matt  and  Van Durme, Benjamin},
  title     = {Efficient Elicitation of Annotations for Human Evaluation of Machine Translation},
  booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation},
  month     = {June},
  year      = {2014},
  address   = {Baltimore, Maryland, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {1--11},
  url       = {}

Prerequisites (and optional) python modules:

Example Procedure:

    1. Preprocessing: converting an xml file (from Appraise) to a csv file.
    • mkdir result if not exist.
    • cd data
    • python ABC.xml
    • The xml/csv file must consist of a single language pair.
    1. Training: run python (TrueSkill) in the src directory.
    • cd ../src
    • cat ../data/ABC.csv |python ../result/ABC -n 2 -d 0 -s 2
    • for more details: python --help
    • You can change other parameters in, if needed.
    • For clustering (i.e. grouped ranking), we need to execute multiple runs of (100+ is recommended) for each language pair (e.g. fr-en from fr-en0 to fr-en99).
    • You will get the result named OUT_ID_mu_sigma.json in the result directory
    • For using Expected Win, run python -s 2 ../result/ABC
    1. To see the grouped ranking, run in the eval directory.
    • cd ../eval
    • python fr-en ../result/fr-en -n 100 -by-rank -pdf
    • for more details: python --help
    • pdf option might cause RuntimeError, but please check if a pdf file is successfully generated.
    1. (optional) To tune decision radius in (accuracy), run
    • e.g. cat data/sample-fr-en-{dev|test}.csv |python src/ -d 0.1 -i result/fr-en0_mu_sigma.json
    1. (optional) To see the next systems to be compared, run python src/scripts/ *_mu_sigma.json N
    • This outputs the next comparison under the current result mu and sigma (.json) for N free-for-all matches.

Questions and comments:

  • Please e-mail to Keisuke Sakaguchi (keisuke[at]