<a href="https://colab.research.google.com/github/nshah-waripari/nlp_transformers/blob/main/nmt_nepali_english_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## English to Nepali Translation with T5 Transformer Model

In [1]:
# Install the required libraries
!git clone https://github.com/nlp-with-transformers/notebooks.git
%cd notebooks
from install import *
install_requirements(is_chapter2=True)

Cloning into 'notebooks'...
remote: Enumerating objects: 422, done.[K
remote: Counting objects: 100% (422/422), done.[K
remote: Compressing objects: 100% (225/225), done.[K
remote: Total 422 (delta 197), reused 411 (delta 191), pack-reused 0[K
Receiving objects: 100% (422/422), 24.99 MiB | 27.54 MiB/s, done.
Resolving deltas: 100% (197/197), done.
/content/notebooks
⏳ Installing base requirements ...
✅ Base requirements installed!
⏳ Installing Git LFS ...
✅ Git LFS installed!


### Load the dataset
1. We will be using data [Nepali - English Language Pair] obtained from The Tatoeba Translation Challenge [https://github.com/Helsinki-NLP/Tatoeba-Challenge].
2. Train SentencePiece model [https://github.com/google/sentencepiece] to tokenize the raw data and use Tokenizer[from HuggingFace] to tokenize the text that will be language agnostic.


In [2]:
# first mount the drive
from google.colab import drive
drive.mount('content')

# load the training & test dataset, pre-trained spm model along with vocab file hosted at dropbox
training_dataset_url = "https://www.dropbox.com/s/oey08zbsdsqjhqb/train.tsv.zip?dl=0" # train.tsv.zip this is a zip file, need to unzip before using it
eval_dataset_url = "https://www.dropbox.com/s/wbxha2sm4hitn6a/eval.tsv?dl=0" # eval.tsv file
spm_model = "https://www.dropbox.com/s/divi5qh1atzpu5p/spm.model?dl=0" # spm.model
vocab = "https://www.dropbox.com/s/k129w9ylkosrxnv/spm.vocab?dl=0" #spm.vocab
# download training.tsv.zip file
!wget -P data/ {training_dataset_url}
# download eval.tsv file
!wget -P data/ {eval_dataset_url}
# download spm.model file
!wget -P data/ {spm_model}
# download vocab file
!wget -P data/ {vocab}

Mounted at content
--2022-08-13 02:12:02--  https://www.dropbox.com/s/oey08zbsdsqjhqb/train.tsv.zip?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.1.18, 2620:100:6016:18::a27d:112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.1.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/oey08zbsdsqjhqb/train.tsv.zip [following]
--2022-08-13 02:12:02--  https://www.dropbox.com/s/raw/oey08zbsdsqjhqb/train.tsv.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc588a19dd37ac4b7c9af8c9b61b.dl.dropboxusercontent.com/cd/0/inline/Bq5g3QRp62ryVxQksTyE68E0EdGg6PAQSOhmfmorPsYDwW5HqnXhUW0W2QZoOXhOfJE7Bb4OHx6OoAuaXZkmMlJJv82HcFvaN_pzBxv_f2BLnvnYEfhhRSQ_viWtRmpoFkhf55JOn2Rx7OXx4CgT5MRyOw-CFD5Bsj0iKM1p1Aoawg/file# [following]
--2022-08-13 02:12:02--  https://uc588a19dd37ac4b7c9af8c9b61b.dl.dropboxusercontent.com/cd/0/inline/Bq5g3QRp62ryVxQksTyE68E0EdGg6PAQSOhmfmorP

### Prepare Dataset

In [3]:
!unzip data/train.tsv.zip?dl=0 -d "data"
!head -n 5 data/train.tsv
!ls data



Archive:  data/train.tsv.zip?dl=0
  inflating: data/train.tsv          
	prefix	input_text	target_text
0	translate nepali to english	'सिहर्स-अभिकर्ता' कार्यक्रम असफलतापूर्ण अन्त्य भयो ।	The 'seahorse-agent' program exited unsuccessfully.
1	translate english to nepali	The 'seahorse-agent' program exited unsuccessfully.	'सिहर्स-अभिकर्ता' कार्यक्रम असफलतापूर्ण अन्त्य भयो ।
2	translate nepali to english	अन्य कोठा	Other Rooms
3	translate english to nepali	Other Rooms	अन्य कोठा
'eval.tsv?dl=0'			   'spm.model?dl=0'   train.tsv
 github-issues-transformers.jsonl  'spm.vocab?dl=0'  'train.tsv.zip?dl=0'


In [4]:
from datasets import load_dataset
nep_eng_ds = load_dataset("csv", data_files="data/train.tsv", sep="\t", names=["prefix", "input_text", "target_text"])




Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-9597c4c534fbc676/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-9597c4c534fbc676/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [5]:
nep_eng_ds["train"][1]

{'__index_level_0__': 0.0,
 'input_text': "'सिहर्स-अभिकर्ता' कार्यक्रम असफलतापूर्ण अन्त्य भयो ।",
 'prefix': 'translate nepali to english',
 'target_text': "The 'seahorse-agent' program exited unsuccessfully."}

In [6]:
print(nep_eng_ds.column_names)

{'train': ['prefix', 'input_text', 'target_text', '__index_level_0__']}


Convert dataset into Dataframe

In [5]:
import pandas as pd
nep_eng_ds.set_format(type="pandas")
df = nep_eng_ds["train"][1:]
df.tail()

Unnamed: 0,prefix,input_text,target_text,__index_level_0__
6713985,translate english to nepali,All three of these are laboured way past the p...,उनका अनुसार दैनिक तीन(चारजनाको सफल शल्यक्रियास...,6713985.0
6713986,translate nepali to english,प्रभु येशू ख्रीष्टलाई हामीले प्रेम गर्नुका कार...,8 Our brothers around the globe persevere in o...,6713986.0
6713987,translate english to nepali,8 Our brothers around the globe persevere in o...,प्रभु येशू ख्रीष्टलाई हामीले प्रेम गर्नुका कार...,6713987.0
6713988,translate nepali to english,गुप्तिकरण गर्न असफल भयो: कुनै वैध प्रापक निर्द...,Can not encrypt this message: no recipients sp...,6713988.0
6713989,translate english to nepali,Can not encrypt this message: no recipients sp...,गुप्तिकरण गर्न असफल भयो: कुनै वैध प्रापक निर्द...,6713989.0


In [6]:
df.shape

(6713990, 4)

**Data Clean-up**

In [7]:
# check for NaN and empty values in dataframe
missing_cols, missing_rows = (
    (df.isnull().sum(x) | df.eq('').sum(x))
    .loc[lambda x: x.gt(0)].index
    for x in (0, 1)
)
df.loc[missing_rows, missing_cols]

Unnamed: 0,input_text,target_text
3092,,CLD2: problems with line The Prophet (sallalla...
3093,CLD2: problems with line The Prophet (sallalla...,
32916,,CLD2: problems with line  Excellent through v...
32917,CLD2: problems with line  Excellent through v...,
38222,,CLD2: problems with line 3 And the sons of the...
...,...,...
6683955,"CLD2: problems with line  And turning, he reb...",
6704310,,CLD2: problems with line He said: It was just...
6704311,CLD2: problems with line He said: It was just...,
6708616,,CLD2: problems with line Theirs will be the cu...


In [8]:
# remove the data with null or empty values
df = df.dropna()
print(df.shape)

(6712770, 4)


Load T5 Tokenizer

In [10]:
from transformers import T5Tokenizer
spm_model_name="data/spm.model?dl=0"
tokenizer = T5Tokenizer(spm_model_name,
                               do_lower_case=True, do_basic_tokenize=True, 
                               padding=True, bos_token="<s>", 
                               eos_token="</s>",unk_token="<unk>", 
                               pad_token="<pad>")

# let's examine how the tokenizer works
# text = "Tokenizing text is a core task of NLP."
text = "काठमाडौं उपत्यकामा प्राकृतिक सुन्दरता"
encoded_text = tokenizer(text)
print(encoded_text)



{'input_ids': [2821, 23448, 3169, 14296, 1], 'attention_mask': [1, 1, 1, 1, 1]}


In [16]:
# # Tokenizer from pre-trained T5 model does not recognize nepali tokens at all
# tokenizer = T5Tokenizer.from_pretrained("t5-base")
# text = "काठमाडौं उपत्यकामा प्राकृतिक सुन्दरता"
# encoded_text = tokenizer(text)
# print(encoded_text)

In [11]:
# convert the token id back to token
tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)
print(tokens)
print(tokenizer.convert_tokens_to_string(tokens))
# Tokenizer class has some important attributes like
# vocab size
print(tokenizer.vocab_size)
# max length based on model's context
print(tokenizer.model_max_length)
# tokenizer's field names
print(tokenizer.model_input_names)
# print special token ids
print(f"{tokenizer.eos_token}:{tokenizer.eos_token_id}")
print(f"{tokenizer.bos_token}:{tokenizer.bos_token_id}")
print(f"{tokenizer.unk_token}:{tokenizer.unk_token_id}")
print(f"{tokenizer.pad_token}:{tokenizer.pad_token_id}")

['▁काठमाडौं', '▁उपत्यकामा', '▁प्राकृतिक', '▁सुन्दरता', '</s>']
काठमाडौं उपत्यकामा प्राकृतिक सुन्दरता</s>
32100
1000000000000000019884624838656
['input_ids', 'attention_mask']
</s>:1
<s>:2
<unk>:2
<pad>:0


Prepare input dataset for the model

In [12]:
import torch
from torch.utils.data import DataLoader, Dataset

class LanguageDataset(Dataset):
  def __init__(self,
      data: pd.DataFrame = df,
      tokenizer: T5Tokenizer = tokenizer,
      input_max_len: int = 100,
      target_max_len: int = 100
  ):
    self.data = data
    self.tokenizer = tokenizer
    self.input_max_len = input_max_len
    self.target_max_len = target_max_len

  def __len__(self):
    return len(self.data)

  def __getitem__(self, idx: int):
    data_row = self.data.iloc[idx]
    prefix_text = data_row["prefix"]
    input_text = data_row["input_text"]
    target_text = data_row["target_text"]
    # encode the input text
    input_text_encoded = self.tokenizer(
        prefix_text + ": " + input_text,
        max_length = self.input_max_len,
        padding = 'max_length',
        truncation = True,
        return_attention_mask = True,
        add_special_tokens = True,
        return_tensors = 'pt'
    )
    # input_ids, attention_mask = input_text_encoded.input_ids, input_text_encoded.attention_mask

    # encode the target text
    target_text_encoded = self.tokenizer(
        target_text,
        max_length = self.target_max_len,
        padding = 'max_length',
        truncation = True,
        add_special_tokens = True,
        return_attention_mask = True,
        return_tensors = 'pt'
    )
    # target_input_ids, target_attention_mask = target_text_encoded.input_ids, target_text_encoded.attention_mask

    # set the padding token of target text to -100 
    # such that it is ignored by the model during training
    target_input_ids = target_text_encoded['input_ids']
    target_input_ids[target_input_ids == self.tokenizer.pad_token_id] = -100

    return dict(
        input_ids = input_text_encoded['input_ids'].flatten(),
        attention_mask = input_text_encoded['attention_mask'].flatten(),
        labels = target_input_ids.flatten(),
        decoder_attention_mask = target_text_encoded['attention_mask'].flatten()

    )





In [13]:
# train test split
from sklearn.model_selection import train_test_split
df_train, df_eval = train_test_split(df, test_size=0.2)
print(df_train.shape)
print(df_eval.shape)
dataset = LanguageDataset(df_train, tokenizer)
dataloader = DataLoader(dataset, shuffle=True, batch_size=32)

(5370216, 4)
(1342554, 4)


In [14]:
print(dataset[0]['input_ids'])
print(dataset[0]['attention_mask'])
print(dataset[0]['labels'])
print(dataset[0]['decoder_attention_mask'])


tensor([15540,     7, 22593,     9,     7,  5990, 16745,    10,  1033,     1,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [15]:
len(dataset)

5370216

### Training T5 Model from Scratch

In [16]:
from transformers import T5ForConditionalGeneration, T5Config
import torch

config = T5Config(
    vocab_size = tokenizer.vocab_size,
    pad_token_id = tokenizer.pad_token_id,
    eos_token_id = tokenizer.eos_token_id,
    decoder_start_token_id = tokenizer.pad_token_id,
    d_model = 300
)
model = T5ForConditionalGeneration(config)

# count the number of parameters of the given model
def model_size(model):
  return sum(t.numel() for t in model.parameters())

print(f"T5 size: {model_size(model)/1000**2:.1f}M parameters")




T5 size: 35.4M parameters


In [None]:
# Train the model

from transformers import AdamW, get_scheduler
from tqdm.auto import tqdm

num_epochs = 5
num_training_steps = num_epochs * len(dataloader)
progress_bar = tqdm(range(num_training_steps))

optimizer = AdamW(model.parameters())
lr_scheduler = get_scheduler(
    "linear",
    optimizer = optimizer,
    num_warmup_steps = 0,
    num_training_steps = num_training_steps
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

model.train()
for epoch in range(num_epochs):
  for batch in dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}

    outputs = model(**batch)
    logits = outputs.logits

    loss = outputs.loss
    loss.backward()

    optimizer.step()
    lr_scheduler.step()

    optimizer.zero_grad()
    progress_bar.update()
  
  torch.save({
      'epoch': epoch,
      'model_state_dict': model.state_dict(),
      'optimizer_state_dict': optimizer.state_dict(),
      'loss': loss,     
      }, f'data/t5_nep_to_eng.pth')
  print(f"epoch: {epoch + 1} -- loss: {loss}")




  0%|          | 0/839100 [00:00<?, ?it/s]