# Model Building Pipeline
In this notebook, we'll cover the following topics:
1. Environmental setup
2. Data preparation
3. Configuration

Author: Kriti Singh
Last update: 6. August, 2022

> ⚠ **Important:** Before you start, set runtime type to GPU.

In [None]:
!nvidia-smi

Sat Aug  6 11:59:44 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   56C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Make sure that you have a compatible PyTorch version.

In [None]:
import torch
torch.__version__

'1.12.0+cu113'

Mount your Google Drive

Install joeynmt for python3.7

In [None]:
!pip install git+https://github.com/joeynmt/joeynmt.git@py3.7

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/joeynmt/joeynmt.git@py3.7
  Cloning https://github.com/joeynmt/joeynmt.git (to revision py3.7) to /tmp/pip-req-build-084x4deg
  Running command git clone -q https://github.com/joeynmt/joeynmt.git /tmp/pip-req-build-084x4deg
  Running command git checkout -b py3.7 --track origin/py3.7
  Switched to a new branch 'py3.7'
  Branch 'py3.7' set up to track remote branch 'py3.7' from 'origin'.
Collecting protobuf==3.20.1
  Downloading protobuf-3.20.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.0 MB)
[K     |████████████████████████████████| 1.0 MB 37.0 MB/s 
Collecting sacrebleu>=2.0.0
  Downloading sacrebleu-2.2.0-py3-none-any.whl (116 kB)
[K     |████████████████████████████████| 116 kB 75.0 MB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████

## Data Preparation

### Download
We'll use English - Hindi translations from the [Tatoeba](https://tatoeba.org/) collection ([CC-BY 2.0 FR](https://creativecommons.org/licenses/by/2.0/fr/)).

[Tatoeba](https://huggingface.co/datasets/tatoeba) corpus is available in Huggingface's datasets library. Tatoeba is a collection of sentences and translations. To load a language pair which isn't part of the config, all you need to do is specify the language code as pairs. You can find the valid pairs in Homepage section of Dataset Description: http://opus.nlpl.eu/Tatoeba.php 


The Tatoeba dataset on HuggingFace Hub doesn't have dev and test split, but train split only. So let's split the data manually and save it locally.

> 📝 Note that most of the dataset loading scripts in Huggingface have pre-defined train-dev-test splits, e.g. [wmt17](https://huggingface.co/datasets/wmt17). In that case, you can skip this step, please go to the Vocabulary generation part.

In [None]:
from datasets import load_dataset

lang1 = "en"
lang2 = "hi"
lang = lang1 + lang2

tatoeba_kwargs = {
  "path": "tatoeba",
  "lang1": lang1,
  "lang2": lang2,
  "date" : "v2021-07-22",
  "ignore_verifications": True,
  # "cache_dir": "/content/drive/MyDrive/.cache/huggingface"
}

tatoeba_dev = load_dataset(split="train[:1000]", **tatoeba_kwargs)
tatoeba_test = load_dataset(split="train[1000:2000]", **tatoeba_kwargs)
tatoeba_train = load_dataset(split="train[2000:]", **tatoeba_kwargs)

tatoeba_dev, tatoeba_test, tatoeba_train

Downloading builder script:   0%|          | 0.00/1.95k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.45k [00:00<?, ?B/s]

Using custom data configuration en-hi-d689d0b456ea3e42


Downloading and preparing dataset tatoeba/en-hi to /root/.cache/huggingface/datasets/tatoeba/en-hi-d689d0b456ea3e42/0.0.0/b3ea9c6bb2af47699c5fc0a155643f5a0da287c7095ea14824ee0a8afd74daf6...


Downloading data:   0%|          | 0.00/344k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset tatoeba downloaded and prepared to /root/.cache/huggingface/datasets/tatoeba/en-hi-d689d0b456ea3e42/0.0.0/b3ea9c6bb2af47699c5fc0a155643f5a0da287c7095ea14824ee0a8afd74daf6. Subsequent calls will reuse this data.


Using custom data configuration en-hi-d689d0b456ea3e42
Reusing dataset tatoeba (/root/.cache/huggingface/datasets/tatoeba/en-hi-d689d0b456ea3e42/0.0.0/b3ea9c6bb2af47699c5fc0a155643f5a0da287c7095ea14824ee0a8afd74daf6)
Using custom data configuration en-hi-d689d0b456ea3e42
Reusing dataset tatoeba (/root/.cache/huggingface/datasets/tatoeba/en-hi-d689d0b456ea3e42/0.0.0/b3ea9c6bb2af47699c5fc0a155643f5a0da287c7095ea14824ee0a8afd74daf6)


(Dataset({
     features: ['id', 'translation'],
     num_rows: 1000
 }), Dataset({
     features: ['id', 'translation'],
     num_rows: 1000
 }), Dataset({
     features: ['id', 'translation'],
     num_rows: 8949
 }))

Inspect the data

In [None]:
tatoeba_dev['translation'][:3]

[{'en': 'I have to go to sleep.', 'hi': 'मुझे सोना है।'},
 {'en': 'Muiriel is 20 now.', 'hi': 'म्यूरियल अब बीस साल की हो गई है।'},
 {'en': 'Muiriel is 20 now.', 'hi': 'म्यूरियल अब बीस साल की है।'}]

In [None]:
tatoeba_test['translation'][:3]

[{'en': 'The problem is being discussed now.', 'hi': 'समस्या पर बहस जारी है।'},
 {'en': 'I have nothing to say with regard to that problem.',
  'hi': 'मुझे इस विषय पर कुछ नहीं कहना है।'},
 {'en': 'I have nothing to say with regard to that problem.',
  'hi': 'मुझे इस समस्या के बारे में कुछ नहीं कहना है।'}]

In [None]:
tatoeba_train['translation'][:3]

[{'en': "Today's minimum temperature was 3 °C.",
  'hi': 'आज का न्यूनतम तापमान तीन डिग्री सेल्सियस था।'},
 {'en': "I've not read today's paper yet.",
  'hi': 'मैंने अभी तक आज का अखबार नहीं पढ़ा है।'},
 {'en': "Bring me today's paper, please.",
  'hi': 'कृपया मेरे लिए आज का अख़बार लाना।'}]

In [8]:
import os
import pathlib

val = os.uname()[1]
if val == "nitt":
  data_dir_prefix = '/'+'/'.join(pathlib.Path().resolve().parts[1:-1])+"/test/data/"
else:
  data_dir_prefix = ""


In [None]:
print(data_dir_prefix)

Save the train-dev-test splits in local dir

In [None]:
from datasets.dataset_dict import DatasetDict

dataset_dict = DatasetDict({ 
  "train": tatoeba_train,
  "validation": tatoeba_dev,
  "test": tatoeba_test
})

data_dir = data_dir_prefix+"tatoeba_{lang}".format(lang=lang)
dataset_dict.save_to_disk(data_dir)

### Vocabulary

We will use the [sentencepiece](https://github.com/google/sentencepiece) library to split words into subwords (BPE) according to their frequency in the training corpus.

`build_vocab.py` script will train the BPE model and creates joint vocabulary. It takes the same config file as the joeynmt.

In [None]:
from pathlib import Path

# Create the config
config = """
name: "tatoeba_{lang}_sp"
joeynmt_version: "2.0.0"

data:
    train: "{data_dir}/train"
    dev: "{data_dir}/validation"
    test: "{data_dir}/test"
    dataset_type: "huggingface"
    #dataset_cfg:           # not necessary for manually saved pyarray daraset
    #    name: "{lang1}-{lang2}"
    sample_dev_subset: 200
    src:
        lang: "{lang1}"
        max_length: 100
        lowercase: False
        normalize: False
        level: "bpe"
        voc_limit: 8240
        voc_min_freq: 1
        voc_file: "{data_dir}/vocab.txt"
        tokenizer_type: "sentencepiece"
        tokenizer_cfg:
            model_file: "{data_dir}/sp.model"

    trg:
        lang: "{lang2}"
        max_length: 100
        lowercase: False
        normalize: False
        level: "bpe"
        voc_limit: 8240
        voc_min_freq: 1
        voc_file: "{data_dir}/vocab.txt"
        tokenizer_type: "sentencepiece"
        tokenizer_cfg:
            model_file: "{data_dir}/sp.model"

""".format(data_dir=data_dir,lang=lang,lang1=lang1,lang2=lang2)
with (Path(data_dir) / "config.yaml").open('w') as f:
    f.write(config)

Call the script with `--joint` flag

In [None]:
! wget https://raw.githubusercontent.com/joeynmt/joeynmt/main/scripts/build_vocab.py

--2022-08-06 12:02:05--  https://raw.githubusercontent.com/joeynmt/joeynmt/main/scripts/build_vocab.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10277 (10K) [text/plain]
Saving to: ‘build_vocab.py’


2022-08-06 12:02:05 (95.6 MB/s) - ‘build_vocab.py’ saved [10277/10277]



In [None]:
!python build_vocab.py {data_dir}/config.yaml --joint

Dropping NaN...: 100% 9/9 [00:00<00:00, 131.95ba/s]
Preprocessing...: 100% 8944/8944 [00:00<00:00, 12440.65ex/s]
### Training sentencepiece...
sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=/tmp/sentencepiece__4xj3a_z.txt --model_prefix=tatoeba_enhi/sp --model_type=unigram --vocab_size=8240 --character_coverage=1.0 --accept_language=en,hi --unk_piece=<unk> --bos_piece=<s> --eos_piece=</s> --pad_piece=<pad> --unk_id=0 --bos_id=2 --eos_id=3 --pad_id=1 --vocabulary_output_piece_score=false
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: /tmp/sentencepiece__4xj3a_z.txt
  input_format: 
  model_prefix: tatoeba_enhi/sp
  model_type: UNIGRAM
  vocab_size: 8240
  accept_language: en
  accept_language: hi
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  m

The generated vocabulary looks like this:

In [None]:
!head -20 {data_dir}/vocab.txt

<unk>
<pad>
<s>
</s>
.
।
▁है
?
'
▁I
ी
े
s
ा
▁is
▁मैं
▁
▁the
▁to
▁नहीं


✍ **Future Improvements**
1. Try [subword-nmt](https://github.com/rsennrich/subword-nmt) style BPE. Enable BPE dropout.
2.  Preprocessing options:
  - lowercase
  - normalize (which normalization will be applied?)
  - vocab size
  - pretokenize by moses

## Configuration

Joey NMT reads model and training hyperparameters from a configuration file. We're generating this now to configure paths in the appropriate places.

The configuration below builds a small Transformer model with shared embeddings between source and target language on the base of the subword vocabularies created above.

In [None]:
model_dir = "models/tatoeba_{lang}".format(lang=lang)
config += """
testing:
    n_best: 1
    beam_size: 5
    beam_alpha: 1.0
    batch_size: 256
    batch_type: "token"
    max_output_length: 100
    eval_metrics: ["bleu"]
    #return_prob: "hyp"
    #return_attention: False
    sacrebleu_cfg:
        tokenize: "13a"

training:
    #load_model: "{model_dir}/latest.ckpt"
    #reset_best_ckpt: False
    #reset_scheduler: False
    #reset_optimizer: False
    #reset_iter_state: False
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999]
    scheduling: "warmupinversesquareroot"
    learning_rate_warmup: 2000
    learning_rate: 0.0002
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    loss: "crossentropy"
    batch_size: 512
    batch_type: "token"
    batch_multiplier: 4
    early_stopping_metric: "bleu"
    epochs: 10
    updates: 20000
    validation_freq: 1000
    logging_freq: 100
    model_dir: "{model_dir}"
    overwrite: True
    shuffle: True
    use_cuda: True
    print_valid_sents: [0, 1, 2, 3]
    keep_best_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4
        embeddings:
            embedding_dim: 256
            scale: True
            dropout: 0.0
        # typically ff_size = 4 x hidden_size
        hidden_size: 256
        ff_size: 1024
        dropout: 0.1
        layer_norm: "pre"
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 8
        embeddings:
            embedding_dim: 256
            scale: True
            dropout: 0.0
        # typically ff_size = 4 x hidden_size
        hidden_size: 256
        ff_size: 1024
        dropout: 0.1
        layer_norm: "pre"

""".format(model_dir=model_dir)
with (Path(data_dir) / "config.yaml").open('w') as f:
    f.write(config)

✍ **Hyper-parameter Tuning options**
We can perform hyperparameter tuning on below options:
1. dropout
2. label smoothing
3. layer normalization
4. tied embeddings
5. tied softmax
6. weight decay
7. learning rate warmup
8. batch multiplier (gradient accumulation)
9. early stopping with patience
10. fp16 (half-precision)
11. label smoothing
12. beam alpha (length penalty)
13. repetition penalty
14. ngram blocker
15. n best outputs