# Lab 09

In this lab, we will try to use the OpenNMT library to train an NMT model using the toy English-German dataset.

This notebook was found originally at:
https://github.com/OpenNMT/OpenNMT-py#quickstart

In [None]:
# Install OpenNMT-py 2.x
# NOTE: By the end of the insatallation, it might ask for restarting the runtime...
# In this case, just click the "RESTART RUNTIME" button.

!pip3 install git+https://github.com/OpenNMT/OpenNMT-py.git

In [None]:
# On Google Colab ONLY
# Reinstall Torch to avoid incompatibility with Cuda 10.1

# NOTE: By the end of the insatallation, it might ask for restarting the runtime...
# In this case, just click the "RESTART RUNTIME" button.

!pip3 install --ignore-installed torch==1.6.0 -f https://download.pytorch.org/whl/torch_stable.html

In [None]:
# Download the files of the QuickStart

!wget https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz
!tar xf toy-ende.tar.gz

In [None]:
# Optional: List the extracted files

!cd toy-ende/ && ls

In [None]:
# Optional: Print the first 3 lines of the source file

!head -n 3 toy-ende/src-train.txt

In [None]:
# Optional: Check the number of lines in the source file

!echo "Number of lines:" && wc -l toy-ende/src-train.txt

In [None]:
# Create the YAML configuration file
# On a regular machine, you can create it manually or with nano

config = '''# toy_en_de.yaml

## Where the samples will be written
save_data: toy-ende/run/example

## Where the vocab(s) will be written
src_vocab: toy-ende/run/example.vocab.src
tgt_vocab: toy-ende/run/example.vocab.tgt

## Where the model will be saved
save_model: model/model

# Prevent overwriting existing files in the folder
overwrite: False

# Corpus opts:
data:
    corpus_1:
        path_src: toy-ende/src-train.txt
        path_tgt: toy-ende/tgt-train.txt
    valid:
        path_src: toy-ende/src-val.txt
        path_tgt: toy-ende/tgt-val.txt

world_size: 1
gpu_ranks: [0]

# Remove or modify these lines for bigger files
train_steps: 1000
valid_steps: 200
'''

with open("toy_en_de.yaml", "w+") as config_yaml:
  config_yaml.write(config)

!cat toy_en_de.yaml

In [None]:
# Build Vocabulary

!onmt_build_vocab -config toy_en_de.yaml -n_sample -1

In [None]:
# Check if GPU is active
# If not, go to "Runtime" menu > "Change runtime type" > "GPU"

!nvidia-smi -L

In [None]:
# Make sure the GPU is visable to PyTorch

import torch

gpu_id = torch.cuda.current_device()
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(gpu_id))

In [None]:
# Train the NMT model

!onmt_train -config toy_en_de.yaml

In [None]:
# Translate

!onmt_translate -model model/model_step_1000.pt -src toy-ende/src-test.txt -output toy-ende/pred_1000.txt -gpu 0 -verbose

Install Sacrebleu to evaluate the model

In [None]:
!pip install sacrebleu

In [None]:
!sacrebleu toy-ende/tgt-test.txt < toy-ende/pred_1000.txt

## Assignment

 #### A1 
  - Please note down the BLEU scores obtained above in the cell below.


* (note down results here)

#### A2 

 - Your assginment is to train a model using the OpenNMT library as shown above but with larger dataset.

 - You can use any parallel corpus available from [Samanantar](https://indicnlp.ai4bharat.org/samanantar/)

 - Train a model on a single language pair and evaluate it using BLEU score as a metric as shown above.

 - Also note down the hyperparameters used for training the model. 

 - As a class you can discuss amongst yourselves and can collectively try different hyperparameters. 

 - If the parallel corpus is hard to fit in the GPU memory then you can use a smaller dataset, but if you are collectively trying different hyperparameters then all of you should experiment with the same dataset.

 - (Optional) You can further try to byte-pair encode the corpus and re-train the model. [The byte-pair encoding code is available in this notebook.](https://github.com/cfiltnlp/IITB-English-Hindi-PC/blob/main/IITB_En_Hi_Get_Data.ipynb) This notebook contains the code for byte-pair encoding the [IITB-English Hindi Parallel Corpus](https://huggingface.co/datasets/cfilt/iitb-english-hindi)