This is the repo for the paper - "Exploiting Curriculum Learning in Unsupervised Neural Machine Translation" (to appear in EMNLP Findings 2021.)
This paper exploits curriculum learning (CL) in unsupervised neural machine translation (UNMT). Specifically, we design methods to estimate the quality of pseudo bi-text and apply CL framework to improve UNMT. Please refer to the paper for more details.
- Python 3
- NumPy
- PyTorch
- fastBPE (generate and apply BPE codes)
- Moses (scripts to clean and tokenize text only - no installation required)
- Apex (for fp16 training)
Difficulty computation needs cross-lingual word embeddings, which are obtained by unsupervised training method MUSE. In fact, you can use the cross-lingual distances of word pairs which are extract by us (They are store in the directory CL_diff/data). Then, You can run the following command to compute the difficulty file for your training data.
python CL_diff/compute_tfidf_wordtrans_diff.py <DISTANCE_FILE> <TRAINING_DATA_FILE> <OUTPUT_FILE>
This repo is modified based on XLM toolkit and MASS. You can run the model through following commands.
For XLM:
bash CL_XLM/run_unmt_ende.sh
For MASS:
bash CL_MASS/run_unmt_enro.sh
If you have multiple GPUs, please modify the scripts according to XLM README
For en-de, en-fr, en-ro, please download from XLM README and MASS README.
For en-zh, our model can be download through the following link.
Link | Password |
---|---|
https://pan.baidu.com/s/1vTQDjWF119EITVIHew-leA | tkvn |
@article{lu2021,
title={Exploiting Curriculum Learning in Unsupervised Neural Machine Translation},
author={Jinliang, Lu and Jiajun, Zhang},
booktitle={Findings of the Empirical Methods in Natural Language Processing: EMNLP 2021},
year={2021}
}