PidginBaseline : Towards Supervised and Unsupervised Neural Machine Translation Baselines for Nigerian Pidgin
This repository contains the code for the paper titled - Towards Supervised and Unsupervised Neural Machine Translation Baselines for Nigerian Pidgin - and presented at the International Learning on Language Representations (ICLR) 2020 workshop on African NLP, April 2020, Addis Ababa, Ethiopia.
Link to paper - https://arxiv.org/abs/2003.12660
git clone https://github.com/orevaoghene/pidgin-baseline
cd pidigin-baseline
pip install -r requirements.txt
./get_data.sh
The above commands will:
- Clone the repository
- Change your present working directory to the cloned repository
- Install all requirements
- Download and preprocess the train, test and dev sets.
Now that you have the data, you can now specify your required training configuration in the config.yaml file. For more information about the configurations, please refer to the Joeynmt configuration documentation The configuration files used in our experiments are available in the experiments folder.
If you would be training with byte pair encodings, you would need to run the learn_bpe shell script before training, as this will learn the byte pair encodings needed.
./learn_bpe.sh
Once you have specified the necessary configurations and learned byte pair encodings (if need be), you can start training by running the train_model shell script.
./train_model.sh
You will be required to specify an experiment name after you run the train_model shell script.
To run the unsupervised baselines, follow the instructions in the PidginUNMT repository.
English to Pidgin Translation:
- Unsupervised Model (word-level) - 5.18
- Supervised Model (word-level) - 17.73
- Supervised Model (BPE) - 24.29
Pidgin to English Translation:
- Unsupervised Model (word-level) - 7.93
- Supervised Model (word-level) - 24.67
- Supervised Model (BPE) - 13.00
Please refer to the experiments folder to see the result translations by the different models, as well as to access the trained model weights.
Special thanks to the Masakhane group - website and github for literally catalysing this work.