Code for On the Stability of System Rankings at WMT

This repository contains code which can be used to replicate or extend experiments described in the paper "On the Stability of System Rankings at WMT" by Rebecca Knowles, published at the Sixth Conference on Machine Translation (WMT21).

Requirements

This code relies on SciPy (https://www.scipy.org) and NumPy (https://numpy.org/).

It has been tested with Python versions 3.6.9 and 3.5.6, with SciPy versions 1.6.2, 1.5.2, and 1.1, and with NumPy versions 1.21.2 and 1.15.2.

Running with SciPy version 1.7.1 produces different significance clusters than those reported in the Findings papers and this paper.

Replicate Tables from Paper

To replicate the tables from the paper, do the following:

First run scripts/get_data.sh to download and extract data (to the data/ directory).

Next run scripts/run_all_rankings.sh to generate all rankings required to replicate tables in the paper (see rankings/ for output; see scripts/run_ranking.sh to understand filenames). Note that you may wish to edit scripts/run_all_rankings.sh to run scripts/run_ranking.sh jobs in parallel and/or submit them to a compute cluster; if you run it as-is, it will take quite some time to generate all rankings.

Finally, run scripts/run_compare_rankings.sh to generate values from Tables 2 and 3 and Figure 1 (these should match scripts/reference_tables.txt).

Python Scripts

scripts/get_ranking.py produces an output file containing a WMT-style system ranking for a single language pair (in one year, with data collected through one interface). It provides options for removing arbitrary sets of systems from all computations or merely from the computation of significance clusters and the final ranking. It also provides the option to degrade the scores of human/reference translations in the data.

scripts/compare_rankings.py produces comparisons between a given pair of ranking variations (each generated by scripts/get_ranking.py and, if generated by the bash scripts provided, containing the same set of systems). In the example code, it uses the file scripts/pairs.txt to do this over all language pairs used in the paper.

What does this code do/not do?

This code uses human annotation data released by WMT organizers to replicate and experiment with modifications to the system ranking process. It relies on the existing processed files for removal of annotators who did not pass quality assurance; it does not compute those values itself. It averages duplicates, computes z-scores, averages raw scores, averages z-scores, computes rankings and significance clusters, and outputs rankings. In most cases, it exactly replicates the system rankings as described in the WMT News Task Findings papers (2018-2020). Appendix A of the paper provides more detail on the instances where that is not the case.

Code is also provided to compare two sets of rankings, which expects rankings in the format produced by scripts/get_ranking.py. Note that if you intend to use this code beyond the bash wrapper provided, you should take care to be sure whether you are comparing two rankings with different sets of systems (that should never occur with the scripts provided to replicate the paper's rankings). The -v/--verbose flag in scripts/compare_rankings.py does output that information.

Related Code

This code was written following the description in the Findings papers regarding how rankings are produced. In writing it, we also referenced code from the following repositories related to WMT ranking production:

https://github.com/ygraham/da-wmt16

https://github.com/ygraham/direct-assessment

https://github.com/ygraham/crowd-alone

Our code represents both an incomplete reimplementation (we do not perform quality assurance or work directly with the raw data) and an extension (we provide ways of modifying the rankings to test hypotheses about task composition) of these and the official WMT rankings.

Copyright

Multilingual Text Processing / Traitement multilingue de textes

Digital Technologies Research Centre / Centre de recherche en technologies numériques

National Research Council Canada / Conseil national de recherches Canada

Published under the GPL v3.0 License (see LICENSE).

Cite

If you use this code, you may wish to cite:

@inproceedings{knowles-2021-stability,
    title = "On the Stability of System Rankings at {WMT}",
    author = "Knowles, Rebecca",
    booktitle = "Proceedings of the Sixth Conference on Machine Translation",
    month = nov,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.statmt.org/wmt21/pdf/2021.wmt-1.56.pdf",
}

You may also wish to cite the WMT findings papers for the data used: https://aclanthology.org/W18-6401.bib, https://aclanthology.org/W19-5301.bib, https://aclanthology.org/2020.wmt-1.1.bib

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code for On the Stability of System Rankings at WMT

Requirements

Replicate Tables from Paper

Python Scripts

What does this code do/not do?

Related Code

Copyright

Cite

About

Releases

Packages

Contributors 2

Languages

License

nrc-cnrc/WMT-Stability

Folders and files

Latest commit

History

Repository files navigation

Code for On the Stability of System Rankings at WMT

Requirements

Replicate Tables from Paper

Python Scripts

What does this code do/not do?

Related Code

Copyright

Cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages