Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

TrueSkill for WMT

Source code used in 2014 WMT paper, "Efficient Elicitation of Annotations for Human Evaluation of Machine Translation"

  • Keisuke Sakaguchi
  • Matt Post
  • Benjamin Van Durme

Last updated: June 11th, 2020

This document describes the proposed method described in the following paper:

  author    = {Sakaguchi, Keisuke  and  Post, Matt  and  Van Durme, Benjamin},
  title     = {Efficient Elicitation of Annotations for Human Evaluation of Machine Translation},
  booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation},
  month     = {June},
  year      = {2014},
  address   = {Baltimore, Maryland, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {1--11},
  url       = {}

Prerequisites python modules:

  • python 2.7

pip install -r requirements.txt

Example Procedure:

    1. Preprocessing: converting an xml file (from Appraise) to a csv file.
    • mkdir result if not exist.
    • cd data
    • python ABC.xml
    • The xml/csv file must consist of a single language pair.
    1. Training: run python (TrueSkill) in the src directory.
    • cd ../src
    • cat ../data/ABC.csv |python ../result/ABC -n 2 -d 0 -s 2
    • for more details: python --help
    • You can change other parameters in, if needed.
    • For clustering (i.e. grouped ranking), we need to execute multiple runs of (100+ is recommended) for each language pair (e.g. fr-en from fr-en0 to fr-en99).
    • You will get the result named OUT_ID_mu_sigma.json in the result directory
    • For using Expected Win, run python -s 2 ../result/ABC
    1. To see the grouped ranking, run in the eval directory.
    • cd ../eval
    • python fr-en ../result/fr-en -n 100 -by-rank -pdf
    • for more details: python --help
    • pdf option might cause RuntimeError, but please check if a pdf file is successfully generated.
    1. (optional) To tune decision radius in (accuracy), run
    • e.g. cat data/sample-fr-en-{dev|test}.csv |python src/ -d 0.1 -i result/fr-en0_mu_sigma.json
    1. (optional) To see the next systems to be compared, run python src/scripts/ *_mu_sigma.json N
    • This outputs the next comparison under the current result mu and sigma (.json) for N free-for-all matches.

Questions and comments:

  • Please e-mail to Keisuke Sakaguchi (keisuke[at]


Data and code used in the 2014 WMT, "Efficient Elicitation of Annotations for Human Evaluation of Machine Translation"






No releases published


No packages published