# FAIRSEQ

from https://fairseq.readthedocs.io/en/latest/getting_started.html

"Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks." It provides reference implementations of various sequence-to-sequence models making our life much more easier!

## Installation

In [None]:
! pip3 install fairseq

## Downloading some data and required scripts

In [None]:
! bash data/prepare-wmt14en2fr.sh

## Pretrained Model Evaluation

Let's first see how to evaluate a pretrained model in fairseq. We'll download a pretrained model along with it's vocabulary

In [None]:
! curl https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf -

We have written a script to do it, but as a fun example, let's do it in Jupyter Notebook for fun

In [None]:
sentence = 'Why is it rare to discover new marine mammal species ?'

In [None]:
%%bash -s "$sentence"
SCRIPTS=data/mosesdecoder/scripts
TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
CLEAN=$SCRIPTS/training/clean-corpus-n.perl
NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl
BPEROOT=data/subword-nmt
BPE_TOKENS=40000
src=en
tgt=fr
echo $1 | \
            perl $NORM_PUNC $src | \
            perl $REM_NON_PRINT_CHAR | \
            perl $TOKENIZER -threads 8 -a -l $src > temp_tokenized.out         
prep=wmt14.en-fr.fconv-py
BPE_CODE=$prep/bpecodes
python $BPEROOT/apply_bpe.py -c $BPE_CODE < temp_tokenized.out > final_result.out
rm temp_tokenized.out
cat final_result.out
rm final_result.out

Let's now look at the very cool interactive feature of fairseq. Open shell, cd to this directory and type the copy the following command:

In [None]:
%%bash
MODEL_DIR=wmt14.en-fr.fconv-py
echo "Why is it rare to discover new marine mam@@ mal species ?" | fairseq-interactive \
    --path $MODEL_DIR/model.pt $MODEL_DIR \
    --beam 1 --source-lang en --target-lang fr

This generation script produces three types of outputs: a line prefixed with O is a copy of the original source sentence; H is the hypothesis along with an average log-likelihood; and P is the positional score per token position, including the end-of-sentence marker which is omitted from the text. Let's do this in bash again

In [None]:
!  echo "Why is it rare to discover new marine mam@@ mal species ?" | sed -r 's/(@@ )|(@@ ?$)//g' 

All Good! Now let's train a new model

## Training

### Data Preprocessing

Fairseq contains example pre-processing scripts for several translation datasets: IWSLT 2014 (German-English), WMT 2014 (English-French) and WMT 2014 (English-German). We will work with a part of WMT 2014 like we did in the previous section

To pre-process and binarize the IWSLT dataset run <code>bash prepare-wmt14en2fr.sh</code> like we did for the previous section. This will download the data, tokenize it, perform byte pair encoding and do a test train split on the data. 

To Binaize the data, we do the following:

In [None]:
%%bash
TEXT=data/wmt14_en_fr
fairseq-preprocess --source-lang en --target-lang fr \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  --destdir data-bin/wmt14_en_fr --thresholdtgt 5 --thresholdsrc 5 \
  --workers 1

Ofcourse, we cannot see what is inside the binary line, but let's check what is in the dictionary

In [None]:
! ls data-bin/wmt14_en_fr/

In [None]:
! head -5 data-bin/wmt14_en_fr/dict.en.txt

In [None]:
! head -5 data-bin/wmt14_en_fr/dict.fr.txt

## Model

Fairseq provides a lot of predefined architectures to choose from. For English-French, we will choose an architecure known to work well for the problem. In the next section, we will see how to define custom models in Fairseq

In [None]:
! mkdir -p fairseq_models/checkpoints/fconv_wmt_en_fr

In [None]:
! fairseq-train data-bin/wmt14_en_fr \
  --lr 0.5 --clip-norm 0.1 --dropout 0.1 --max-tokens 3000 \
  --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
  --lr-scheduler fixed --force-anneal 50 \
  --arch fconv_wmt_en_fr --save-dir fairseq_models/checkpoints/fconv_wmt_en_fr

In [None]:
! ls data-bin

## Generating and Checking BLEU for our model

In [None]:
! pip3 install sacrebleu

In [None]:
! mkdir -p fairseq_models/logs

In [None]:
%%bash
fairseq-generate data-bin/wmt14_en_fr  \
  --path fairseq_models/checkpoints/fconv_wmt_en_fr/checkpoint_best.pt \
  --beam 1 --batch-size 128 --remove-bpe --sacrebleu  >> fairseq_models/logs/our_model.out

In [None]:
! head -10 fairseq_models/logs/our_model.out

In [None]:
! tail -2 fairseq_models/logs/our_model.out

### Generating and Checking BLEU for the large Pretrained Model

In [None]:
! curl https://dl.fbaipublicfiles.com/fairseq/data/wmt14.v2.en-fr.newstest2014.tar.bz2 | tar xvjf - -C data-bin

In [None]:
%%bash
fairseq-generate data-bin/wmt14.en-fr.newstest2014  \
  --path wmt14.en-fr.fconv-py/model.pt \
  --beam 1 --batch-size 128 --remove-bpe --sacrebleu >> fairseq_models/logs/pretrained_model.out

In [None]:
! head -10 fairseq_models/logs/pretrained_model.out

In [None]:
! tail -2 fairseq_models/logs/pretrained_model.out

## Writing A Custom Model in FAIRSEQ

We will extend fairseq by adding a new FairseqModel that encodes a source sentence with an LSTM and then passes the final hidden state to a second LSTM that decodes the target sentence (without attention).

### Building an Encoder and Decoder

In this section we’ll define a simple LSTM Encoder and Decoder. All Encoders should implement the FairseqEncoder interface and Decoders should implement the FairseqDecoder interface. These interfaces themselves extend torch.nn.Module, so FairseqEncoders and FairseqDecoders can be written and used in the same ways as ordinary PyTorch Modules.

### Encoder

Our Encoder will embed the tokens in the source sentence, feed them to a torch.nn.LSTM and return the final hidden state.

### Decoder

Our Decoder will predict the next word, conditioned on the Encoder’s final hidden state and an embedded representation of the previous target word – which is sometimes called input feeding or teacher forcing. More specifically, we’ll use a torch.nn.LSTM to produce a sequence of hidden states that we’ll project to the size of the output vocabulary to predict each target word

## Registering the Model

Now that we’ve defined our Encoder and Decoder we must register our model with fairseq using the register_model() function decorator. Once the model is registered we’ll be able to use it with the existing Command-line Tools.

All registered models must implement the BaseFairseqModel interface. For sequence-to-sequence models (i.e., any model with a single Encoder and Decoder), we can instead implement the FairseqModel interface.

Create a small wrapper class in the same file and register it in fairseq with the name 'simple_lstm':

Finally let’s define a named architecture with the configuration for our model. This is done with the register_model_architecture() function decorator. Thereafter this named architecture can be used with the --arch command-line argument, e.g., --arch tutorial_simple_lstm

In [4]:
import fairseq
import os
fairseq_path = os.path.dirname(fairseq.__file__)
fairseq_path = os.path.join(fairseq_path, 'models')
print(fairseq_path)

/scratch/sm7582/condaenvs/denoising/lib/python3.7/site-packages/fairseq/models


In [5]:
%%bash -s "$fairseq_path"
cp fairseq_models/custom_models/simple_lstm.py $1

In [None]:
%%bash -s "$fairseq_path"
ls $1 | grep lstm

## Training Our Custom Model

In [1]:
! mkdir -p fairseq_models/checkpoints/tutorial_simple_lstm

In [23]:
%%bash
fairseq-train data-bin/wmt14_en_fr \
  --arch tutorial_simple_lstm \
  --encoder-dropout 0.2 --decoder-dropout 0.2 \
  --optimizer adam --lr 0.005 --lr-shrink 0.5 \
  --max-epoch 50 \
  --max-tokens 12000 --save-dir fairseq_models/checkpoints/tutorial_simple_lstm

Namespace(adam_betas='(0.9, 0.999)', adam_eps=1e-08, arch='tutorial_simple_lstm', bucket_cap_mb=25, clip_norm=25, cpu=False, criterion='cross_entropy', data=['data-bin/wmt14_en_fr'], ddp_backend='c10d', decoder_dropout=0.2, decoder_embed_dim=256, decoder_hidden_dim=256, device_id=0, distributed_backend='nccl', distributed_init_method=None, distributed_port=-1, distributed_rank=0, distributed_world_size=1, encoder_dropout=0.2, encoder_embed_dim=256, encoder_hidden_dim=256, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.005], lr_scheduler='reduce_lr_on_plateau', lr_shrink=0.5, max_epoch=50, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=12000, max_update=0, memory_efficient_fp16=False, min_loss_scale=0.0001, mi



In [24]:
%%bash
fairseq-generate data-bin/wmt14_en_fr  \
  --path fairseq_models/checkpoints/tutorial_simple_lstm/checkpoint_best.pt \
  --beam 1 --batch-size 128 --remove-bpe --sacrebleu  >> fairseq_models/logs/custom_model.out



In [25]:
!head -10 fairseq_models/logs/custom_model.out

Namespace(beam=1, cpu=False, data=['data-bin/wmt14_en_fr'], diverse_beam_groups=-1, diverse_beam_strength=0.5, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1, log_format=None, log_interval=1000, match_source_len=False, max_len_a=0, max_len_b=200, max_sentences=128, max_source_positions=1024, max_target_positions=1024, max_tokens=None, memory_efficient_fp16=False, min_len=1, model_overrides='{}', nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, num_shards=1, num_workers=0, path='fairseq_models/checkpoints/tutorial_simple_lstm/checkpoint_best.pt', prefix_size=0, print_alignment=False, quiet=False, raw_text=False, remove_bpe='@@ ', replace_unk=None, sacrebleu=True, sampling=False, sampling_temperature=1, sampling_topk=-1, score_reference=False, seed=1, shard_id=0, skip_invalid_size_inputs_valid_test=False, sourc

In [26]:
!tail -2 fairseq_models/logs/custom_model.out

| Translated 3003 sentences (95826 tokens) in 17.3s (173.62 sentences/s, 5540.33 tokens/s)
| Generate test with beam=1: BLEU(score=2.920992351091757, counts=[20940, 3724, 1489, 594], totals=[97757, 94754, 91752, 88750], precisions=[21.4204609388586, 3.9301770901492286, 1.6228529078385212, 0.6692957746478874], bp=0.9445955091061184, sys_len=97757, ref_len=103329)
