## Setup

Provided:

- Pre-trained de-en models (100k, 500k, 1M) + SPM dict/model
- Pre-trained en-de models (100k, 500k, 1M) + SPM dict/model
- Pre-processed en:de train, dev, test data (100k, 500k, 1M)
- Raw+preprocessed mono en and mono de train data (50k)
- Raw+preprocessed parallel en:de FT train data (25k)
- Pre-processed en:de dev + test data (2k, 2k)
- Basic pre-processing script


You can train your own models and data, and are not obliged to use ours.


Make sure you use GPU env.
Go to `Runtime->Change runtime type` to change the runtime resources.

First mount your google drive.

Note to self: hyp (means hypothesis), i.e. the hypothetical translation; ref (means reference), i.e. the correct translation; src (means source), i.e. the original sentence.

In [60]:
# Define the necessary variables that can be subbed in any templates because %env magic, when using $, __will substitute a PYTHON variable__ instead of a bash environment variable
SRC="de"
TGT="en"
%env SRC=$SRC
%env TGT=$TGT

env: l1=de
env: l2=en
env: SRC=de
env: TGT=en


In [None]:
from google.colab import drive
import os, sys
drive.mount('/content/drive/')

Mounted at /content/drive/


Install `torch` and `fairseq`. You can store binaries in your google drive, so you don't need to install it every time, do this however you like

In [13]:
# NOTE: HIGHLY Recommended to use Python@3.9  
# %python3 -m venv .venv --system-site-packages
# !source .venv/bin/activate

# %env CWD=/content/drive/MyDrive/project-files
CWD="."
#make sure to use older torch version. fairseq doesn't work well with torch2
# %pip install fairseq sacremoses subword_nmt
# %pip install --upgrade torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113

# !git clone https://github.com/VarunGumma/fairseq
# !cd fairseq
# !pip install -e ./

## Training & Fine-tuning

Training example - here we use the command line, but you can also use hydra config files if you prefer



In [23]:
# Path to binary data
# %env path_2_data=./Data/train-euro-news-big.$SRC-$TGT/bin
%env path_2_data=Data/train-euro-news-big.$SRC-$TGT/bin
# check you can train a model from scratch, --max-update=10 means it will stop training immediately
# !fairseq-train \
#     "$path_2_data" \
#     --arch transformer_wmt_en_de \
#     --task translation \
#     --share-decoder-input-output-embed \
#     --optimizer adam \
#     --adam-betas '(0.9, 0.98)' \
#     --clip-norm 0.1 \
#     --lr 0.0006 \
#     --lr-scheduler inverse_sqrt \
#     --warmup-updates 2500 \
#     --warmup-init-lr 1e-07 \
#     --stop-min-lr 1e-09 \
#     --dropout 0.3 \
#     --weight-decay 0.0001 \
#     --criterion label_smoothed_cross_entropy \
#     --label-smoothing 0.1 \
#     --max-tokens 8192 \
#     --max-update 2 \
#     --update-freq 8 \
#     --patience 10 \
#     --scoring sacrebleu \
#     --eval-bleu \
#     --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
#     --eval-bleu-detok moses \
#     --eval-bleu-remove-bpe \
#     --eval-bleu-print-samples \
#     --best-checkpoint-metric bleu \
#     --maximize-best-checkpoint-metric \
#     --save-interval-updates 2000 \
#     --validate-interval-updates 2000 \
#     --no-epoch-checkpoints \
#     --keep-best-checkpoints 1 \
#     --encoder-learned-pos \
#     --save-dir Models/test-de-en \
#     --bpe sentencepiece

env: path_2_data=Data/train-euro-news-big.de-en/bin


Fine-tuning: This is the same as training, but you load a trained model with `--finetune-from-model checkpoint_best.pt`. Also consider modifying the the learning rate warmup and batch size.


## Preprocessing

Example preprocessing + subword training script
You may want to store the intermediate files in a tmp directory to avoid multiple copies. If using the provided models, use the SPM model + dictionary for any further preprocessing

In [53]:
# Preprocess monolingual data
%env src_train=./Data/it-mono/train.mono.$SRC
%env tgt_train=./Data/it-mono/train.mono.$TGT

%env databin=./Data/it-mono/bin/
%env train_file=./Data/it-mono/train.mono
%env dev_file=./Data/it-mono/dev
%env test_file=./Data/it-mono/test

# apply SPM
%env spm=./Models/spm.model

!bash scripts/preprocess.sh
!bash scripts/binarize.sh --only-source

env: src_train=./Data/it-mono/train.mono.de
env: tgt_train=./Data/it-mono/train.mono.en
env: databin=./Data/it-mono/bin/
env: train_file=./Data/it-mono/train.mono
env: dev_file=./Data/it-mono/dev
env: test_file=./Data/it-mono/test
env: spm=./Models/spm.model
Preprocessing data for de-en
Sentencepiece model: ./Models/spm.model
Source training data: ./Data/it-mono/train.mono.de
Target training data: ./Data/it-mono/train.mono.en
Train file: ./Data/it-mono/train.mono
Dev file: ./Data/it-mono/dev
Test file: ./Data/it-mono/test


100%|████████████████████████████████| 100000/100000 [00:03<00:00, 25646.34it/s]
100%|████████████████████████████████| 100000/100000 [00:03<00:00, 26946.38it/s]
100%|████████████████████████████████| 100000/100000 [00:03<00:00, 28946.77it/s]
100%|████████████████████████████████| 100000/100000 [00:04<00:00, 23013.41it/s]
100%|█████████████████████████████████████| 2000/2000 [00:00<00:00, 2169.53it/s]
100%|████████████████████████████████████| 2000/2000 [00:00<00:00, 33129.05it/s]
100%|█████████████████████████████████████| 2000/2000 [00:00<00:00, 2209.80it/s]
100%|████████████████████████████████████| 2000/2000 [00:00<00:00, 30850.00it/s]
100%|█████████████████████████████████████| 2000/2000 [00:00<00:00, 2040.00it/s]
100%|████████████████████████████████████| 2000/2000 [00:00<00:00, 31621.00it/s]
100%|█████████████████████████████████████| 2000/2000 [00:00<00:00, 2042.95it/s]
100%|████████████████████████████████████| 2000/2000 [00:00<00:00, 31205.65it/s]
processed 10000 lines
proces

In [54]:
# repeat preprocessing for parallel FT data
%env src_train=./Data/it-parallel/train.ft.$SRC
%env tgt_train=./Data/it-parallel/train.ft.$TGT
%env databin=./Data/it-parallel/bin/

%env train_file=./Data/it-parallel/train.parallel
%env dev_file=./Data/it-parallel/dev
%env test_file=./Data/it-parallel/test

# apply SPM
%env spm=./Models/spm.model

!bash scripts/preprocess.sh
!bash scripts/binarize.sh
# etc

env: src_train=./Data/it-parallel/train.ft.de
env: tgt_train=./Data/it-parallel/train.ft.en
env: databin=./Data/it-parallel/bin/
env: train_file=./Data/it-parallel/train.parallel
env: dev_file=./Data/it-parallel/dev
env: test_file=./Data/it-parallel/test
env: spm=./Models/spm.model
Preprocessing data for de-en
Sentencepiece model: ./Models/spm.model
Source training data: ./Data/it-parallel/train.ft.de
Target training data: ./Data/it-parallel/train.ft.en
Train file: ./Data/it-parallel/train.parallel
Dev file: ./Data/it-parallel/dev
Test file: ./Data/it-parallel/test


100%|██████████████████████████████████| 20000/20000 [00:01<00:00, 14630.34it/s]
100%|██████████████████████████████████| 20000/20000 [00:00<00:00, 39292.03it/s]
100%|██████████████████████████████████| 20000/20000 [00:01<00:00, 15582.79it/s]
100%|██████████████████████████████████| 20000/20000 [00:00<00:00, 31766.86it/s]
100%|█████████████████████████████████████| 2000/2000 [00:00<00:00, 2017.79it/s]
100%|████████████████████████████████████| 2000/2000 [00:00<00:00, 30292.97it/s]
100%|█████████████████████████████████████| 2000/2000 [00:00<00:00, 2051.97it/s]
100%|████████████████████████████████████| 2000/2000 [00:00<00:00, 31881.54it/s]
100%|█████████████████████████████████████| 2000/2000 [00:01<00:00, 1990.74it/s]
100%|████████████████████████████████████| 2000/2000 [00:00<00:00, 31292.14it/s]
100%|█████████████████████████████████████| 2000/2000 [00:00<00:00, 2046.00it/s]
100%|████████████████████████████████████| 2000/2000 [00:00<00:00, 27626.18it/s]
processed 10000 lines
proces

If you want to train models from scratch and experiment with different subword segmentation settings, you can train BPE using `subword-nmt learn-joint-bpe-and-vocab` then `subword-nmt apply-bpe`, or train a SentencePiece model with the provided `spm_train.py` file below are examples of training BPE or SPM vocabularies.

In [None]:
# TODO: Pray I don't need to do this.

# # preprocessing - example for BPE training
# %env src_train=/content/drive/MyDrive/data-bin/it/train.mono.en
# %env tgt_train=/content/drive/MyDrive/data-bin/it/train.mono.de
# %env databin=/content/drive/MyDrive/data-bin/it/

# %env bpe_train_file=$train_file.bpe

# %env train_file=/content/drive/MyDrive/data-bin/it/train.mono.tok
# %env codes_file=/content/drive/MyDrive/data-bin/it/train.mono.codes
# %env vocab_file=/content/drive/MyDrive/data-bin/it/train.mono.vocab

# %env dev_file=/content/drive/MyDrive/data-bin/it/dev.tok
# %env test_file=/content/drive/MyDrive/data-bin/it/test.tok

# ######## train SPM example #########
# !python ./spm_train.py --input="$train_file.$SRC,$train_file.$TGT" \
#     --vocab_size=32000 \
#     --character_coverage=1.0 \
#     --num_threads=8 \
#     --split_digits \
#     --model_prefix="$train_file.spm" \
#     --model_type=unigram \
#     --bos_id=0 --pad_id=1 --eos_id=2 --unk_id=3


## Generation & Evaluation

Generation standardly involves running inference using a trained model on a given test set. This test set must be segmented and binarised using the same vocabulary as the model (important if you want to test on other test sets or in different languages).

For evaluation, we show below how to get BLEU scores (a standard, if uninformative, MT metric).

In [55]:
# set experiment variables
# careful: test set must match model - because of different spm dictionaries
# %env TEST=news-euro-half
%env TEST=it-mono
%env MODEL=big-$SRC-$TGT
%env OUTPUT_DIR=./tests

!bash scripts/generate.sh
!bash scripts/evaluate.sh

# other metrics: comet, beer, your own ensemble, etc.
# TODO: I'd like to add COMET at some point.

env: TEST=it-mono
env: MODEL=big-de-en
env: OUTPUT_DIR=./tests
Generating translations for de-en on it-mono
Model: big-de-en
Output directory: ./tests


2024-04-16 20:13:57 | INFO | fairseq_cli.generate | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name'

## Project Code

When generating for backtranslation, you'll first want to evaluate the quality on a parallel test set (as above). Then backtranslate the monolingual training set using `--gen-subset train`, extract the source and hypotheses, apply some data filtering/selection methods, and preprocess into a training dataset for forward translation.

We recommend splitting these up into different steps, but present it here together for clarity.

You can choose your desired hyperparameters, including decoding strategy, data ratios, BT model, etc.

Always save and label your outputs clearly! `backtranslate1-en-de-test21.eval` is not going to be very helpful when you're performing your analyses later on. But `iwslt300k.selection=len.dec=greedy.ft=20k.test=it` will be more useful.

In [62]:
# evaluate reverse model on test set(s)

# data selection before backtranslation
# ....

# preprocessing - apply trained spm.model to selected train.mono subset


# then binarise with fixed dictionaries

# backtranslation generation with reverse model + extract source + hypotheses
%env DATA=it-mono
!source scripts/generate_train.sh Data/

# MK: We don't do that vvv
# data selection after backtranslation


# combine fine-tuning data (or some subset) with bt data, preprocess with fixed dictionaries

# train forward translation model

# evaluate


env: DATA=it-mono
Generating translations for de-en on it-mono
Model: big-de-en
Output directory: Data//generation-big-de-en


2024-04-16 21:14:26 | INFO | fairseq_cli.generate | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name'

In [66]:
# src is the original data
%env src_train=./Data/generation-big-de-en/it-mono-de-en.src
# tgt is the backtranslated data
%env tgt_train=./Data/generation-big-de-en/it-mono-de-en.hyp

!cp ./Data/it-mono/test* ./Data/generation-big-de-en/
!cp ./Data/it-mono/dev* ./Data/generation-big-de-en/

%env train_file=./Data/generation-big-de-en/train
%env dev_file=./Data/generation-big-de-en/dev
%env test_file=./Data/generation-big-de-en/test
!bash scripts/preprocess.sh
!bash scripts/binarize.sh --only-source

%env model_save_folder="./Models/it-mono-de-en"
!fairseq-train \
    "./Data/generation-big-de-en/bin" \
    --finetune_from_checkpoint "./Models/big-${SRC}-${TGT}/checkpoint_best.pt" \
    --arch transformer_wmt_en_de \
    --task translation \
    --share-decoder-input-output-embed \
    --optimizer adam \
    --adam-betas '(0.9, 0.98)' \
    --clip-norm 0.1 \
    --lr 0.0003 \
    --lr-scheduler inverse_sqrt \
    --warmup-updates 2500 \
    --warmup-init-lr 1e-07 \
    --stop-min-lr 1e-09 \
    --dropout 0.3 \
    --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy \
    --label-smoothing 0.1 \
    --max-tokens 8192 \
    --max-update 2 \
    --update-freq 8 \
    --patience 10 \
    --scoring sacrebleu \
    --eval-bleu \
    --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
    --eval-bleu-detok moses \
    --eval-bleu-remove-bpe \
    --eval-bleu-print-samples \
    --best-checkpoint-metric bleu \
    --maximize-best-checkpoint-metric \
    --save-interval-updates 2000 \
    --validate-interval-updates 2000 \
    --no-epoch-checkpoints \
    --keep-best-checkpoints 1 \
    --encoder-learned-pos \
    --save-dir $model_save_folder \
    --bpe sentencepiece


env: src_train=./Data/generation-big-de-en/it-mono-de-en.src
env: tgt_train=./Data/generation-big-de-en/it-mono-de-en.hyp


Python(28779) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Python(28783) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


env: train_file=./Data/generation-big-de-en/train
env: dev_file=./Data/generation-big-de-en/dev
env: test_file=./Data/generation-big-de-en/test
Preprocessing data for de-en
Sentencepiece model: ./Models/spm.model
Source training data: ./Data/generation-big-de-en/it-mono-de-en.src
Target training data: ./Data/generation-big-de-en/it-mono-de-en.hyp
Train file: ./Data/generation-big-de-en/train
Dev file: ./Data/generation-big-de-en/dev
Test file: ./Data/generation-big-de-en/test


Python(28787) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


100%|██████████████████████████████████| 99989/99989 [00:02<00:00, 40386.87it/s]
100%|██████████████████████████████████| 99989/99989 [00:02<00:00, 44359.89it/s]
100%|██████████████████████████████████| 99989/99989 [00:02<00:00, 40291.63it/s]
100%|██████████████████████████████████| 99989/99989 [00:02<00:00, 42493.58it/s]
100%|█████████████████████████████████████| 2000/2000 [00:00<00:00, 3064.12it/s]
100%|████████████████████████████████████| 2000/2000 [00:00<00:00, 45411.08it/s]
100%|█████████████████████████████████████| 2000/2000 [00:00<00:00, 3078.55it/s]
100%|████████████████████████████████████| 2000/2000 [00:00<00:00, 42488.16it/s]
100%|█████████████████████████████████████| 2000/2000 [00:00<00:00, 2943.79it/s]
100%|████████████████████████████████████| 2000/2000 [00:00<00:00, 40846.52it/s]
100%|█████████████████████████████████████| 2000/2000 [00:00<00:00, 3091.45it/s]
100%|████████████████████████████████████| 2000/2000 [00:00<00:00, 43989.66it/s]
processed 10000 lines
proces

Python(28862) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


2024-04-17 10:47:06 | INFO | fairseq_cli.preprocess | Namespace(no_progress_bar=False, log_interval=100, log_format=None, log_file=None, aim_repo=None, aim_run_hash=None, tensorboard_logdir=None, wandb_project=None, azureml_logging=False, seed=1, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, on_cpu_convert_precision=False, min_loss_scale=0.0001, threshold_loss_scale=None, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, quantization_config_path=None, profile=False, reset_logging=False, suppress_crashes=False, use_plasma_view=False, plasma_path='/tmp/plasma', criterion='cross_entropy', tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang='de', target_lang='en', tr

Python(29041) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


2024-04-17 10:47:49 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': N

If you prefer to write a bash script in colab, e.g. to combine repetitive preprocessing steps, here's an example of how you can do that:


### Advanced
If you're going to modify fairseq, there are two ways of doing this: either install fairseq + clone the extension repo, or clone fairseq and install as editable.
For 1, clone the repo you are going to work with. You need to fork the project repo https://github.com/afeena/fairseq_easy_extend.git. This has files for RL learning and non-autoregressive Transformers (which you're welcome to try out but isn't relevant to the current project). However if you're looking to implement (dynamic) curriculum learning, we recommend creating a new task (remember to declare and init the task). This is a fairly involved modification.
For 2, uncomment the code below.

In [None]:
# ! git clone https://github.com/facebookresearch/fairseq
# ! cd fairseq && pip install -e .
# import os
# os.chdir('/content')
# os.environ['PYTHONPATH'] += ":/content/fairseq/"
# ! echo $PYTHONPATH

import os
!git clone https://github.com/afeena/fairseq_easy_extend.git #here change to your own repo
os.chdir("fairseq_easy_extend")

The example config is for baseline cmlm training, add `checkpoint.restore_file=<path to checkpoint>` and `checkpoint.reset_optimizer=True` for finetuning. You need to change hyperparameters for fine-tuning!
Also, set `checkpoint.save_dir=<path>`

In [21]:
!path_2_data=/content/drive/MyDrive/data-bin/iwslt14.tokenized.de-en
!exp=test-de-en
!mkdir "/content/drive/MyDrive/data-bin/$exp"
import fairseq
!python train.py --config-dir "/content/fairseq_easy_extend/fairseq_easy_extend/models/nat/" --config-name "cmlm_config.yaml" \
task.data=/content/drive/MyDrive/data-bin/test-de-en
!ls /content/drive/MyDrive/data-bin/test-de-en

mkdir: /content/drive/MyDrive: No such file or directory


2024-04-13 12:35:56 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX


/Users/Matey/project/nlp2/.venv/bin/python: can't open file '/Users/Matey/project/nlp2/train.py': [Errno 2] No such file or directory
ls: /content/drive/MyDrive/data-bin/test-de-en: No such file or directory


Training of the model. You can change parameters in your config file or override directly

In [None]:
!python train.py --config-dir "/content/fairseq_easy_extend/fairseq_easy_extend/models/nat/" --config-name "cmlm_config.yaml" \
task.data=/content/drive/MyDrive/NLP2-2023-ET/iwslt14.tokenized.de-en

Fine-tuning example

In [None]:
!python train.py --config-dir "/content/fairseq_easy_extend/fairseq_easy_extend/models/nat/" --config-name "cmlm_config.yaml" \
task.data=/content/drive/MyDrive/NLP2-2023-ET/iwslt14.tokenized.de-en \
checkpoint.restore_file=/content/drive/MyDrive/NLP2-2023-ET/checkpoint_best.pt \
checkpoint.reset_optimizer=True