<div align=right>
LAP 3 / EMLCT Computational Morphology<br>
Hulden<br>
Fall 2022
</div>

<h1 align=center>Training a transformer model</h1>

(You can access this notebook on Colab directly, [here](https://colab.research.google.com/drive/1GU9HUZ0lnTNcuym-tvXY19FGDQ_9QSLO?usp=sharing))

# Seq2seq training with Fairseq

This notebook illustrates basic training of a seq2seq model using a GPU and the Transformer model implemented in the Fairseq package. The task is to learn to inflect Spanish verbs from citation forms and grammatical information. 

You should download the file `fairseqexample.tar.gz` and place the files into your Google drive. You will need to mount the Google drive as the working directory into the notebook (see cells below), so that you can run Fairseq on the training/dev/test files.

## The data

The training / dev / test data are organized into the following files: train.esp.input, train.esp.output, dev.esp.input, dev.esp.output, tst.esp.input (there is no gold output for test as we are only training an example model and generating outputs, not evaluating.)

For example, the verb "manducar" inflected in the past participle masculine singular (V.PTCP PST MASC SG) is "manducado". This is reflected in that the line in the file `train.esp.input`:

`m a n d u c a r # V.PTCP PST MASC SG`

corresponds to the line

`m a n d u c a d o`

in the file `train.esp.output`. The dev set is organized the same way.

## GPU 
To train on a GPU you need to activate the GPU in the Colab notebook by going Edit > Notebook Settings and select GPU as the "Hardware Accelerator".



In [None]:
# May need to install fairseq in Colab once for a notebook if it needs it
!pip install fairseq

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fairseq
  Downloading fairseq-0.12.2-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.0 MB)
[K     |████████████████████████████████| 11.0 MB 5.2 MB/s 
[?25hCollecting omegaconf<2.1
  Downloading omegaconf-2.0.6-py3-none-any.whl (36 kB)
Collecting bitarray
  Downloading bitarray-2.6.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (235 kB)
[K     |████████████████████████████████| 235 kB 46.1 MB/s 
Collecting hydra-core<1.1,>=1.0.7
  Downloading hydra_core-1.0.7-py3-none-any.whl (123 kB)
[K     |████████████████████████████████| 123 kB 68.5 MB/s 
Collecting sacrebleu>=1.4.12
  Downloading sacrebleu-2.3.1-py3-none-any.whl (118 kB)
[K     |████████████████████████████████| 118 kB 57.9 MB/s 
[?25hCollecting antlr4-python3-runtime==4.8
  Downloading antlr4-python3-runtime-4.8.tar.gz (112 kB)
[K     |████████████████████████████████| 112 kB 61.0 MB/s 
Coll

In [None]:
!pip install tensorboardX

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tensorboardX
  Downloading tensorboardX-2.5.1-py2.py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 4.9 MB/s 
Installing collected packages: tensorboardX
Successfully installed tensorboardX-2.5.1


In [None]:
# Here, we mount Google drive so that Colab can access files in your Google drive.
from google.colab import drive
drive.mount('/content/drive/')
!ls -al

Mounted at /content/drive/
total 20
drwxr-xr-x 1 root root 4096 Nov 25 09:28 .
drwxr-xr-x 1 root root 4096 Nov 25 09:27 ..
drwxr-xr-x 4 root root 4096 Nov 22 00:13 .config
drwx------ 6 root root 4096 Nov 25 09:28 drive
drwxr-xr-x 1 root root 4096 Nov 22 00:14 sample_data


In [None]:
# You should navigate into the directory where the training/dev/test files 
# and the preprocessing and training scripts are. This may be different for you
# depending on where you placed the files from fairseqexample.tar.gz
%cd /content/drive/MyDrive/fairseqexample
!ls -al

/content/drive/MyDrive/fairseqexample
total 747
drwx------ 2 root root   4096 Apr 19  2022 checkpoints
drwx------ 2 root root   4096 Apr 19  2022 data-bin
-rw------- 1 root root  37637 Oct 30  2020 dev.esp.input
-rw------- 1 root root  24947 Oct 30  2020 dev.esp.output
-rw------- 1 root root    552 Oct 30  2020 preprocess.sh
-rw------- 1 root root 375069 Oct 30  2020 train.esp.input
-rw------- 1 root root 248914 Oct 30  2020 train.esp.output
-rw------- 1 root root   1723 Nov 26  2021 train.sh
-rw------- 1 root root  37364 Oct 30  2020 tst.esp.input
-rw------- 1 root root  24620 Apr 27  2022 tst.esp.output
-rw------- 1 root root   3253 Apr 20  2022 tst.esp.output2


In [None]:
# We have to preprocess the data so the tokens get analyzed
!bash ./preprocess.sh esp

2022-11-25 09:30:06 | INFO | fairseq_cli.preprocess | Namespace(aim_repo=None, aim_run_hash=None, align_suffix=None, alignfile=None, all_gather_list_size=16384, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, azureml_logging=False, bf16=False, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/esp', dict_only=False, empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_file=None, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, on_cpu_convert_precision=False, only_source=False, optimizer=None, padding_factor=8, plasma_path='/tmp/plasma', profile=False, quantization_config_path=None, reset_logging=False, scoring='bleu', seed=1, source_lang='esp.input', srcdict

In [None]:
  # Train with default parameters, roughly the baseline in SIGMORPHON 2020 shared task
# Let this run until the loss on the validation (dev) test no longer improves. (Maybe 10 minutes with a GPU).
!bash ./train.sh esp

2022-11-25 09:31:12 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 212, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name':

In [None]:
# Generate predictions on test data - read in all the inputs from tst.esp.input 
# and generate outputs to the file tst.esp.output (this is slow and takes about a minute)
!fairseq-interactive data-bin/esp/ --source-lang=esp.input --target-lang=esp.output --path=checkpoints/esp-models/checkpoint_best.pt --input=tst.esp.input | grep -P "D-[0-9]+" | cut -f3 > tst.esp.output

2022-11-25 09:45:06 | INFO | fairseq_cli.interactive | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_na

In [None]:
# Read in the generated outputs and inputs and display the first 20 side-by-side
linesinput = [l.strip() for l in open("tst.esp.input")]
linesoutput = [l.strip() for l in open("tst.esp.output")]
tuple(zip(linesinput, linesoutput))[:20] # Look at 20 first test inputs and predicted outputs

(('< m e r c a d e a r > V NEG IMP 3 SG', '< n o # m e r c a d e e >'),
 ('< t r a p e a r > V NFIN', '< t r a p e a r >'),
 ('< a s i l a r > V SBJV PRS 1 SG', '< a s i l e >'),
 ('< n a d a r > V NEG IMP 3 PL', '< n o # n a d e n >'),
 ('< e n m a r a ñ a r > V POS IMP 2 SG', '< e n m a r a ñ a >'),
 ('< u b i c a r > V SBJV PST 3 SG LGSPEC1', '< u b i c a r a >'),
 ('< b u r l a r > V IND FUT 2 SG', '< b u r l a r á s >'),
 ('< c a r e c e r > V SBJV PST 2 SG LGSPEC1', '< c a r e c i e r a s >'),
 ('< e n t r a ñ a r > V SBJV FUT 3 SG', '< e n t r a ñ a r e >'),
 ('< a d e n t r a r > V SBJV FUT 2 SG', '< a d e n t r a r e s >'),
 ('< e n c a b e z a r > V IND PST 2 SG IPFV', '< e n c a b e z a b a s >'),
 ('< a r g u m e n t a r > V SBJV PST 2 SG', '< a r g u m e n t a s e s >'),
 ('< d e s a s i r > V IND PST 3 PL PFV', '< d e s a s i e r o n >'),
 ('< e n t e r a r > V SBJV FUT 1 PL', '< e n t e r á r e m o s >'),
 ('< v o l c a r > V SBJV FUT 1 SG', '< v o l c a r e >'),
 ('< c 