## Dependencies
Setup `fairseq` and install all required dependencies.

In [1]:
import os
!git clone https://github.com/iakirca/fairseq
os.chdir('fairseq')
%pip install -e ./ 

#-e or --editable is needed for development 
#alternatively you can create your own fork and clone it instead every time you change the code

Cloning into 'fairseq'...
remote: Enumerating objects: 31869, done.[K
remote: Counting objects: 100% (354/354), done.[K
remote: Compressing objects: 100% (106/106), done.[K
remote: Total 31869 (delta 280), reused 321 (delta 248), pack-reused 31515[K
Receiving objects: 100% (31869/31869), 21.76 MiB | 25.26 MiB/s, done.
Resolving deltas: 100% (23466/23466), done.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Obtaining file:///content/fairseq
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting hydra-core<1.1,>=1.0.7
  Downloading hydra_core-1.0.7-py3-none-any.whl (123 kB)
[K     |████████████████████████████████| 123 kB 8.9 MB/s 
[?25hCollecting bitarray
  Downloading bitarray-2.5.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (236 kB)
[K     |

In [2]:
#with autoreload you don't need to restart kernel after any changes
%load_ext autoreload
%autoreload 2

In [3]:
%pip install sacremoses

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[K     |████████████████████████████████| 880 kB 7.8 MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.53-py3-none-any.whl size=895260 sha256=77d2b6376ed95ebe09013d2fd91839c21b59dca456896e611615674af28fb49f
  Stored in directory: /root/.cache/pip/wheels/87/39/dd/a83eeef36d0bf98e7a4d1933a4ad2d660295a40613079bafc9
Successfully built sacremoses
Installing collected packages: sacremoses
Successfully installed sacremoses-0.0.53


## Data 
Data is typically has to be preprocessed. The pipline includes tokenization, truecasing, bpe splitting. Then, `fairseq-preprocess` is used to convert data into binary format.
Since the data has been already preprocessed for you, you just need to access it. For that, mount your google drive. Fairseq-compatible data is stored in `ro-en-fairseq-bin` folder.

In [4]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [5]:
%pip install torchmetrics

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchmetrics
  Downloading torchmetrics-0.9.0-py3-none-any.whl (418 kB)
[K     |████████████████████████████████| 418 kB 8.6 MB/s 
Installing collected packages: torchmetrics
Successfully installed torchmetrics-0.9.0


## Model training
Train CMLM model ([Ghazvininejad et. al., 2019](https://aclanthology.org/D19-1633/)) using `fairseq-hydra-train` command. You need to specify the config file (yaml config file example can be found in `model/train_config.yaml`). Trained baseline is avaliable under `model/model.pt`. You can load it as shown below.

For details refer to https://github.com/pytorch/fairseq/blob/main/examples/nonautoregressive_translation/scripts.md#mask-predict-cmlm-ghazvininejad-et-al-2019

In [6]:
%reload_ext autoreload
# %autoreload 0
%run -i train.py /content/drive/MyDrive/mini-project-A/ro-en-fairseq-bin \
    --save-dir /content/drive/MyDrive/model_nltk_bleu \
    --restore-file /content/drive/MyDrive/mini-project-A/model/model.pt \
    --task translation_lev \
    --criterion rl_loss_nltk_bleu \
    --arch cmlm_transformer \
    --noise random_mask \
    --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9,0.98)' \
    --lr 0.0005 --lr-scheduler inverse_sqrt \
    --stop-min-lr '1e-09' --warmup-updates 10000 \
    --warmup-init-lr '1e-07' \
    --dropout 0.3 --weight-decay 0.01 \
    --decoder-learned-pos \
    --encoder-learned-pos \
    --apply-bert-init \
    --log-format 'simple' --log-interval 100 \
    --fixed-validation-seed 7 \
    --max-tokens 4000 \
    --update-freq 2 \
    --save-interval-updates 10000 \
    --max-update 300000 \
    --max-epoch 124 \
    --reset-optimizer

2022-06-05 10:45:17 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX
2022-06-05 10:45:22 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': 'simple', 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging':

Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
  "amp_C fused kernels unavailable, disabling multi_tensor_l2norm; "
Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


2022-06-05 10:47:17 | INFO | train_inner | epoch 123:    824 / 2507 loss=1.635, nll_loss=None, word_ins=None, length=None, ppl=3.11, wps=7587.6, ups=1.07, wpb=7065.8, bsz=240.2, num_updates=100, lr=5.099e-06, gnorm=1.125, train_wall=86, gb_free=11.9, wall=0
2022-06-05 10:48:45 | INFO | train_inner | epoch 123:    924 / 2507 loss=1.651, nll_loss=None, word_ins=None, length=None, ppl=3.14, wps=8034.9, ups=1.13, wpb=7123.1, bsz=231, num_updates=200, lr=1.0098e-05, gnorm=1.125, train_wall=88, gb_free=12, wall=0
2022-06-05 10:50:13 | INFO | train_inner | epoch 123:   1024 / 2507 loss=1.674, nll_loss=None, word_ins=None, length=None, ppl=3.19, wps=8187.7, ups=1.14, wpb=7200.1, bsz=246.4, num_updates=300, lr=1.5097e-05, gnorm=1.1, train_wall=88, gb_free=11.9, wall=0
2022-06-05 10:51:40 | INFO | train_inner | epoch 123:   1124 / 2507 loss=1.656, nll_loss=None, word_ins=None, length=None, ppl=3.15, wps=8359.2, ups=1.16, wpb=7217.8, bsz=269.1, num_updates=400, lr=2.0096e-05, gnorm=1.087, train_w

In [7]:
%pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 7.5 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 37.4 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 5.2 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.7.0 tokenizers-0.12.1 transformers-4.19.2


In [8]:
%pip install datasets unbabel-comet


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.2.2-py3-none-any.whl (346 kB)
[K     |████████████████████████████████| 346 kB 3.0 MB/s 
[?25hCollecting unbabel-comet
  Downloading unbabel_comet-1.1.1-py3-none-any.whl (64 kB)
[K     |████████████████████████████████| 64 kB 2.4 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 18.3 MB/s 
Collecting dill<0.3.5
  Downloading dill-0.3.4-py2.py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 3.8 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 8.5 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloadi

## Generation
Generate output for the test data using `fairseq-generate` and stored checkpoint

See https://github.com/pytorch/fairseq/blob/main/examples/nonautoregressive_translation/README.md#translate for more details

In [15]:
%run -i fairseq_cli/generate.py /content/drive/MyDrive/mini-project-A/ro-en-fairseq-bin \
--path /content/drive/MyDrive/model_nltk_bleu/NLTK_bleu_checkpoint_best.pt \
--batch-size 16 --beam 1 --task translation_lev --iter-decode-max-iter 9 \
--gen-subset test --remove-bpe --scoring bleu --tokenizer moses \
--source-lang ro --target-lang en --quiet 

2022-06-05 12:53:28 | INFO | fairseq_cli.generate | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name'