<a href="https://colab.research.google.com/github/nshah-waripari/nlp_transformers/blob/main/nmt_nepali_english_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## English to Nepali Translation with T5 Transformer Model

In [1]:
# Install the required libraries
!git clone https://github.com/nlp-with-transformers/notebooks.git
%cd notebooks
from install import *
install_requirements(is_chapter2=True)

Cloning into 'notebooks'...
remote: Enumerating objects: 422, done.[K
remote: Counting objects: 100% (422/422), done.[K
remote: Compressing objects: 100% (225/225), done.[K
remote: Total 422 (delta 197), reused 411 (delta 191), pack-reused 0[K
Receiving objects: 100% (422/422), 24.99 MiB | 27.31 MiB/s, done.
Resolving deltas: 100% (197/197), done.
/content/notebooks
⏳ Installing base requirements ...
✅ Base requirements installed!
⏳ Installing Git LFS ...
✅ Git LFS installed!


### Load the dataset
1. We will be using data [Nepali - English Language Pair] obtained from The Tatoeba Translation Challenge [https://github.com/Helsinki-NLP/Tatoeba-Challenge].
2. Train SentencePiece model [https://github.com/google/sentencepiece] to tokenize the raw data and use Tokenizer[from HuggingFace] to tokenize the text that will be language agnostic.


In [2]:
# first mount the drive
from google.colab import drive
drive.mount('content')

# load the training & test dataset, pre-trained spm model along with vocab file hosted at dropbox
training_dataset_url = "https://www.dropbox.com/s/oey08zbsdsqjhqb/train.tsv.zip?dl=0" # train.tsv.zip this is a zip file, need to unzip before using it
eval_dataset_url = "https://www.dropbox.com/s/wbxha2sm4hitn6a/eval.tsv?dl=0" # eval.tsv file
spm_model = "https://www.dropbox.com/s/divi5qh1atzpu5p/spm.model?dl=0" # spm.model
vocab = "https://www.dropbox.com/s/k129w9ylkosrxnv/spm.vocab?dl=0" #spm.vocab
# download training.tsv.zip file
!wget -P data/ {training_dataset_url}
# download eval.tsv file
!wget -P data/ {eval_dataset_url}
# download spm.model file
!wget -P data/ {spm_model}
# download vocab file
!wget -P data/ {vocab}

Mounted at content
--2022-08-11 14:23:19--  https://www.dropbox.com/s/oey08zbsdsqjhqb/train.tsv.zip?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.5.18, 2620:100:601d:18::a27d:512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.5.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/oey08zbsdsqjhqb/train.tsv.zip [following]
--2022-08-11 14:23:19--  https://www.dropbox.com/s/raw/oey08zbsdsqjhqb/train.tsv.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucfe7d023de7d98542e17fc261a1.dl.dropboxusercontent.com/cd/0/inline/Bq2k2UYN8FA4DJZCQFSd3XaWbpe1r8dkWEH4IwzWURZIASaMoVcH8pMxQDMFlPi8CRJdfZI7IED8WAeVGF35x4FPkXyslohXoc6B0pCeRi0E9W39KyPzVPHiG3H2oMF77zyb6hVk8oly7uEiJ6hZ69cNnaVvIBtOnuqs9s5oB5Jq2A/file# [following]
--2022-08-11 14:23:20--  https://ucfe7d023de7d98542e17fc261a1.dl.dropboxusercontent.com/cd/0/inline/Bq2k2UYN8FA4DJZCQFSd3XaWbpe1r8dkWEH4IwzWU

### Prepare Dataset

In [3]:
!unzip data/train.tsv.zip?dl=0 -d "data"
!head -n 5 data/train.tsv
!ls data



Archive:  data/train.tsv.zip?dl=0
  inflating: data/train.tsv          
	prefix	input_text	target_text
0	translate nepali to english	'सिहर्स-अभिकर्ता' कार्यक्रम असफलतापूर्ण अन्त्य भयो ।	The 'seahorse-agent' program exited unsuccessfully.
1	translate english to nepali	The 'seahorse-agent' program exited unsuccessfully.	'सिहर्स-अभिकर्ता' कार्यक्रम असफलतापूर्ण अन्त्य भयो ।
2	translate nepali to english	अन्य कोठा	Other Rooms
3	translate english to nepali	Other Rooms	अन्य कोठा
'eval.tsv?dl=0'			   'spm.model?dl=0'   train.tsv
 github-issues-transformers.jsonl  'spm.vocab?dl=0'  'train.tsv.zip?dl=0'


In [4]:
from datasets import load_dataset
nep_eng_ds = load_dataset("csv", data_files="data/train.tsv", sep="\t", names=["prefix", "input_text", "target_text"])




Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-9597c4c534fbc676/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-9597c4c534fbc676/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [5]:
nep_eng_ds["train"][1]

{'__index_level_0__': 0.0,
 'input_text': "'सिहर्स-अभिकर्ता' कार्यक्रम असफलतापूर्ण अन्त्य भयो ।",
 'prefix': 'translate nepali to english',
 'target_text': "The 'seahorse-agent' program exited unsuccessfully."}

In [6]:
print(nep_eng_ds.column_names)

{'train': ['prefix', 'input_text', 'target_text', '__index_level_0__']}


Convert dataset into Dataframe

In [7]:
import pandas as pd
nep_eng_ds.set_format(type="pandas")
df = nep_eng_ds["train"][1:]
df.head()

Unnamed: 0,prefix,input_text,target_text,__index_level_0__
0,translate nepali to english,'सिहर्स-अभिकर्ता' कार्यक्रम असफलतापूर्ण अन्त्य...,The 'seahorse-agent' program exited unsuccessf...,0.0
1,translate english to nepali,The 'seahorse-agent' program exited unsuccessf...,'सिहर्स-अभिकर्ता' कार्यक्रम असफलतापूर्ण अन्त्य...,1.0
2,translate nepali to english,अन्य कोठा,Other Rooms,2.0
3,translate english to nepali,Other Rooms,अन्य कोठा,3.0
4,translate nepali to english,कठीन स्तर सेट गर्नुहोस्,Set the difficulty level,4.0


Load T5 Tokenizer

In [8]:
from transformers import T5Tokenizer
spm_model_name="data/spm.model?dl=0"
tokenizer = T5Tokenizer(spm_model_name,
                               do_lower_case=True, do_basic_tokenize=True, 
                               padding=True, bos_token="<s>", 
                               eos_token="</s>",unk_token="<unk>", 
                               pad_token="<pad>")

# let's examine how the tokenizer works
# text = "Tokenizing text is a core task of NLP."
text = "काठमाडौं उपत्यकामा प्राकृतिक सुन्दरता"
encoded_text = tokenizer(text)
print(encoded_text)



{'input_ids': [2821, 23448, 3169, 14296, 1], 'attention_mask': [1, 1, 1, 1, 1]}


In [16]:
# # Tokenizer from pre-trained T5 model does not recognize nepali tokens at all
# tokenizer = T5Tokenizer.from_pretrained("t5-base")
# text = "काठमाडौं उपत्यकामा प्राकृतिक सुन्दरता"
# encoded_text = tokenizer(text)
# print(encoded_text)

In [9]:
# convert the token id back to token
tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)
print(tokens)
print(tokenizer.convert_tokens_to_string(tokens))
# Tokenizer class has some important attributes like
# vocab size
print(tokenizer.vocab_size)
# max length based on model's context
print(tokenizer.model_max_length)
# tokenizer's field names
print(tokenizer.model_input_names)
# print special token ids
print(f"{tokenizer.eos_token}:{tokenizer.eos_token_id}")
print(f"{tokenizer.bos_token}:{tokenizer.bos_token_id}")
print(f"{tokenizer.unk_token}:{tokenizer.unk_token_id}")
print(f"{tokenizer.pad_token}:{tokenizer.pad_token_id}")

['▁काठमाडौं', '▁उपत्यकामा', '▁प्राकृतिक', '▁सुन्दरता', '</s>']
काठमाडौं उपत्यकामा प्राकृतिक सुन्दरता</s>
32100
1000000000000000019884624838656
['input_ids', 'attention_mask']
</s>:1
<s>:2
<unk>:2
<pad>:0


Prepare input dataset for the model

In [27]:
import torch
from torch.utils.data import DataLoader, Dataset

class LanguageDataset(Dataset):
  def __init__(self,
      data: pd.DataFrame = df,
      tokenizer: T5Tokenizer = tokenizer,
      input_max_len: int = 200,
      target_max_len: int = 200
  ):
    self.data = data
    self.tokenizer = tokenizer
    self.input_max_len = input_max_len
    self.target_max_len = target_max_len

  def __len__(self):
    return len(self.data)

  def __getitem__(self, idx: int):
    data_row = self.data.iloc[idx]
    input_text = data_row["input_text"]
    target_text = data_row["target_text"]
    # encode the input text
    input_text_encoded = self.tokenizer(
        input_text,
        max_length = self.input_max_len,
        padding = 'longest',
        truncation = True,
        return_attention_mask = True,
        add_special_tokens = True,
        return_tensors = 'pt'
    )
    input_ids, attention_mask = input_text_encoded.input_ids, input_text_encoded.attention_mask

    # encode the target text
    target_text_encoded = self.tokenizer(
        target_text,
        max_length = self.target_max_len,
        padding = 'longest',
        truncation = True,
        add_special_tokens = True,
        return_attention_mask = True,
        return_tensors = 'pt'
    )
    target_input_ids, target_attention_mask = target_text_encoded.input_ids, target_text_encoded.attention_mask

    # set the padding token if of target text to -100 
    # such that it is ignored by the model during training
    target_input_ids = torch.tensor(target_input_ids)
    target_input_ids[target_input_ids == self.tokenizer.pad_token_id] = -100

    return dict(
        input_ids = input_ids.flatten(),
        attention_mask = attention_mask.flatten(),
        target_input_ids = target_input_ids.flatten(),
        target_attention_mask = target_attention_mask.flatten()

    )





In [28]:
dataset = LanguageDataset(df, tokenizer)
dataloader = DataLoader(dataset, shuffle=True, batch_size=8)

In [33]:
dataset[0]['input_ids']



tensor([  167, 30301,    17, 17225,    41,   541, 10923,  2058,  1184,   158,
           11,     1])

In [37]:
len(dataset)

6713990

From Datasets to DataFrame

In [None]:
import pandas as pd
train_ds.set_format(type="pandas")
df = train_ds[:]
df.head()


Unnamed: 0,id,translation
0,0,"{'en': 'Add Feed to Akregator', 'ne': 'एक्रिगे..."
1,1,"{'en': 'Add Feeds to Akregator', 'ne': 'एक्रिग..."
2,2,"{'en': 'Add All Found Feeds to Akregator', 'ne..."
3,3,{'en': 'Subscribe to site updates (using news ...
4,4,"{'en': 'Imported Feeds', 'ne': 'आयातित फिड'}"


In [None]:
from transformers import pipeline
model_checkpoint = "Helsinki-NLP/opus-mt-en-ne"
translator = pipeline("translation", model=model_checkpoint)
translator("Default to expanded threads")

RepositoryNotFoundError: ignored

In [None]:
from transformers import T5ForConditionalGeneration

ImportError: ignored

In [None]:
import sentencepiece as spm

ModuleNotFoundError: ignored