<a href="https://colab.research.google.com/github/johncoder-30/NLP-translation-model/blob/main/transformer_translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Transformer based model to translate tamil sentences to english
A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the field of natural language processing (NLP) and in computer vision (CV).

#Downloading of Datasets

In [2]:
!wget https://storage.googleapis.com/samanantar-public/V0.2/data/en2indic/en-ta.zip
!unzip "/content/en-ta.zip" -d "/content/data/"
english_raw = open('/content/data/en-ta/train.en', 'r',encoding='utf8').read().split('\n')
tamil_raw = open('/content/data/en-ta/train.ta', 'r', encoding='utf8').read().split('\n')

print(len(english_raw), len(tamil_raw))

--2022-02-03 09:15:47--  https://storage.googleapis.com/samanantar-public/V0.2/data/en2indic/en-ta.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.216.128, 173.194.217.128, 173.194.218.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.216.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1377236241 (1.3G) [application/zip]
Saving to: ‘en-ta.zip’


2022-02-03 09:15:57 (129 MB/s) - ‘en-ta.zip’ saved [1377236241/1377236241]

Archive:  /content/en-ta.zip
   creating: /content/data/en-ta/
 extracting: /content/data/en-ta/train.ta  
 extracting: /content/data/en-ta/train.en  
5095764 5095764


##Importing libraries

In [3]:
from torchtext.legacy.data import Field, BucketIterator, TabularDataset
import pandas as pd
from sklearn.model_selection import train_test_split
import re
import torch
import torch.nn as nn
import torch.optim as optim
import random

##Preprocessing Dataset

In [4]:
# print(english_raw[10],tamil_raw[10])
raw_data = {'English': [line for line in english_raw[:20000]],
            'Tamil': [line for line in tamil_raw[:20000]]}
df = pd.DataFrame(raw_data, columns=['English', 'Tamil'])
df=df[df['English'].str.split(' ').map(len) < 100]
df=df[df['Tamil'].str.split(' ').map(len) < 100]
train, test = train_test_split(df, test_size=0.05,random_state=1234)
train.to_csv('train.csv', index=False)
test.to_csv('test.csv', index=False)
print(df.shape)
df

(19991, 2)


Unnamed: 0,English,Tamil
0,That's what I am saying.,என்றுதான் நான் சொல்ல வருகிறேன்.
1,Every tournament is difficult.,ஒவ்வொரு சுற்றுப்பயணமும் கடினமானது.
2,"One of the first questions Flavio posed was, D...",பல வருடங்களாக அவர் அந்த நித்திய எரிநரக தண்டனைய...
3,He gave full credit to the Union Finance Minis...,அவர் நிதி அமைச்சர் அருண்ஜேட்லியின் முயற்சியை த...
4,Some art historians have suggested that he onl...,சில கலை வரலாற்றாசிரியர்கள் அவர் ஒரு வருடத்திற்...
...,...,...
19995,Events in the Balkans led to the outbreak of W...,ஏனெனில் 1914 ஜூலையில்சேர்பியாவுக்கு ஆஸ்திரியா ...
19996,That is very important and should not be omitted.,"இது மிக முக்கியமான ஒன்றாகும், மேலும் இதை தாமதப..."
19997,Food and water,தண்ணீரும் பாலும்
19998,For many decades the extreme wealth in India w...,பல தசாப்தங்களாக இந்தியாவில் வளம் டாடாக்கள் மற்...


In [5]:
def tokenize_eng(sentence):
    sentence = re.sub(r'\n', '', sentence)
    # sentence = re.sub(r'[^\w\s\']', '', sentence.lower())
    sentence = re.sub(r'[\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^\_\`\{\|\}\~]', '', sentence.lower())
    return [words for words in sentence.split()]

print(tokenize_eng('She says she knows what is going on, but can do nothing about it.'))

def tokenize_tam(sentence):
    sentence = re.sub(r'\n', '', sentence)
    sentence = re.sub(r'\([^)]*\)', '', sentence)
    sentence = re.sub(r'[\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^\_\`\{\|\}\~]', '', sentence)
    return [words for words in sentence.split()]

print(tokenize_tam('என்ன நடக்கிறது என்பது தமக்கு தெரியும் என்றும் ஆனால், தம்மால் எதுவும் செய்யமுடியாது என்றும் கடிதம் எழுதியிருந்தார்.'))

['she', 'says', 'she', 'knows', 'what', 'is', 'going', 'on', 'but', 'can', 'do', 'nothing', 'about', 'it']
['என்ன', 'நடக்கிறது', 'என்பது', 'தமக்கு', 'தெரியும்', 'என்றும்', 'ஆனால்', 'தம்மால்', 'எதுவும்', 'செய்யமுடியாது', 'என்றும்', 'கடிதம்', 'எழுதியிருந்தார்']


##Using torchtext library to
>1. tokenize sentences,
>2. build vocabulary 
>3. splitting into batches to train in GPU

In [6]:
english = Field(init_token='<sos>', eos_token='<eos>', tokenize=tokenize_eng, lower=True, batch_first=False)
tamil = Field(init_token='<sos>', eos_token='<eos>', tokenize=tokenize_tam, lower=False, batch_first=False)
fields = {'English': ('eng', english), 'Tamil': ('tam', tamil)}
train_data, test_data = TabularDataset.splits(path='', train='train.csv', test='test.csv', format='csv', fields=fields)
english.build_vocab(train_data, max_size=10000, min_freq=2)
tamil.build_vocab(train_data, max_size=10000, min_freq=2)
train_iterator, test_iterator = BucketIterator.splits((train_data, test_data),
                                                      batch_size=128, device='cuda', sort_key=lambda x: len(x.tam),
                                                      sort_within_batch=True)

##Transformer Model

In [13]:
class Transformer_model(nn.Module):
    def __init__(self, embedding_size, src_vocab_size, trg_vocab_size, src_pad_idx, num_heads, num_encoder_layers,
                 num_decoder_layers, feed_forward, dropout_p, max_len, device):
        super(Transformer_model, self).__init__()
        self.src_word_embedding = nn.Embedding(src_vocab_size, embedding_size)
        self.src_position_embedding = nn.Embedding(max_len, embedding_size)
        self.trg_word_embedding = nn.Embedding(trg_vocab_size, embedding_size)
        self.trg_position_embedding = nn.Embedding(max_len, embedding_size)

        self.device = device
        self.transformer = nn.Transformer(embedding_size, num_heads, num_encoder_layers,num_decoder_layers, feed_forward, dropout_p)
        self.fc_out = nn.Linear(embedding_size, trg_vocab_size)
        self.dropout = nn.Dropout(dropout_p)
        self.src_pad_idx = src_pad_idx

    def make_src_mask(self, src):
        # src_shape=(src_len,N)
        src_mask = src.transpose(0, 1) == self.src_pad_idx
        # src_shape=(N,src_len)
        return src_mask

    def forward(self, src, trg):
        src_seq_length, N = src.shape
        trg_seq_length, N = trg.shape

        src_positions = (torch.arange(0, src_seq_length).unsqueeze(1).expand(src_seq_length, N).to(self.device))
        trg_positions = (torch.arange(0, trg_seq_length).unsqueeze(1).expand(trg_seq_length, N).to(self.device))

        embed_src = self.dropout(self.src_word_embedding(src) + self.src_position_embedding(src_positions))
        embed_trg = self.dropout(self.trg_word_embedding(trg) + self.trg_position_embedding(trg_positions))

        src_padding_mask = self.make_src_mask(src).to(self.device)
        trg_mask = self.transformer.generate_square_subsequent_mask(trg_seq_length).to(self.device)

        out = self.transformer(embed_src, embed_trg, src_key_padding_mask=src_padding_mask, tgt_mask=trg_mask)
        out = self.fc_out(out)
        return out

##Hyperparameters for Model

In [14]:
device = torch.device('cuda')

num_epoch = 100
learning_rate = 3e-4
batch_size = 128

src_vocab_size = len(tamil.vocab)
trg_vocab_size = len(english.vocab)
print(src_vocab_size,trg_vocab_size)
embedding_size = 512
num_heads = 4
num_encoder_layers = 3
num_decoder_layers = 3
dropout = 0.10
max_len = 100
feed_forward = 2048
src_pad_idx = tamil.vocab.stoi['<pad>']

model = Transformer_model(embedding_size, src_vocab_size, trg_vocab_size, src_pad_idx, num_heads, num_encoder_layers, num_decoder_layers, feed_forward,dropout, max_len, device).to(device)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
pad_idx = english.vocab.stoi['<pad>']
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)

# print(model)

10004 9862


##Loading pre trained Model

In [15]:
# # load model from g_drive
# model = Transformer_model(embedding_size, src_vocab_size, trg_vocab_size, src_pad_idx, num_heads, num_encoder_layers, num_decoder_layers, feed_forward,dropout, max_len, device).to(device)
# optimizer = optim.Adam(model.parameters(), lr=learning_rate)
q=input('do you want to load model:')
if q=='yes':
    from google.colab import drive

    drive.mount('/content/gdrive')
    !ls '/content/gdrive/My Drive/pytorch_models'

do you want to load model:yes
Mounted at /content/gdrive
seq2seq_attention.pt  seq2seq_transformer_200.pt


In [16]:
q=input('Do u want to continue :')
if q=='yes':
    model_save_name = 'seq2seq_transformer_200.pt'
    path = F"/content/gdrive/My Drive/pytorch_models/{model_save_name}"
    checkpoint = torch.load(path)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    epoch = checkpoint['epoch']
    loss = checkpoint['loss']

    # model.eval()
    # - or -
    model.train()

Do u want to continue :yes


##Test function to check model output while training

In [17]:
def test_model():
    model.eval()
    tam_sen = '<sos> திருவிழாவைக் காண அருகில் இருக்கும் கிராமங் களைச் சேர்ந்தவர்கள் மறவப்பட் டிக்கு படையெடுத்துவந்தனர். <eos>'
    # tam_sen = '<sos> கடந்த 5 ஆண்டுகளில் பயனடைந்தோர் மற்றும் செலவின விவரம் பின்வருமாறு <eos>'
    # tam_sen = '<sos> இது காட்டிற்கு செல்லும் வழி <eos>'
    # tam_sen = '<sos> சில கலை வரலாற்றாசிரியர்கள் அவர் ஒரு வருடத்திற்கு இரண்டு அல்லது மூன்று ஓவியங்களை மட்டுமே தயாரித்துள்ளதாக தெரிவித்திருக்கிறார்கள். <eos>'
    tam_encoded = []
    for x in tokenize_tam(tam_sen):
        tam_encoded.append(tamil.vocab.stoi[x])
    tam_sen = torch.Tensor(tam_encoded).long().to(device)
    tam_sen = tam_sen.reshape(-1, 1)
    
    outputs = [english.vocab.stoi["<sos>"]]
    for i in range(100):
        trg_tensor = torch.LongTensor(outputs).unsqueeze(1).to(device)

        with torch.no_grad():
            output = model(tam_sen, trg_tensor)

        best_guess = output.argmax(2)[-1, :].item()
        outputs.append(best_guess)

        if best_guess == english.vocab.stoi["<eos>"]:
            break

    translated_sentence = [english.vocab.itos[idx] for idx in outputs]

    for a in translated_sentence:
            print(a, end=' ')
    print('\n')
test_model()

def eng_decoder(sen):
    for a in sen:
        for b in a:
            print(english.vocab.itos[int(b)], end=' ')
        print()

def save_model():
    model_save_name = 'seq2seq_transformer.pt'
    path = F"/content/{model_save_name}"
# torch.save(model.state_dict(), path)
    torch.save({
    'epoch': _epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss.item(),
        }, path)

<sos> people from nearby villages also came to watch the celebrations <eos> 



##Training Transformer Model

In [None]:
for _epoch in range(epoch,epoch+100):
    for batch_idx, batch in enumerate(train_iterator):
        inp_data = batch.tam.to(device)
        target = batch.eng.to(device)
        # eng_decoder(target)
        # print(target.shape,inp_data.shape)

        output = model(inp_data, target[:-1,:])
        output = output.reshape(-1, output.shape[2])
        target = target[1:].reshape(-1)
        optimizer.zero_grad()
        loss = criterion(output, target)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)
        optimizer.step()
        
    if epoch%5==0:
        print('epoch:',_epoch,' loss=',loss.item())
        test_model()
        save_model()
        model.train()

epoch: 100  loss= 0.33221444487571716
<sos> the number of 65 students have participated in the programme <eos> 

epoch: 105  loss= 0.709135115146637
<sos> besides youths in large numbers participated in <unk> <eos> 

epoch: 110  loss= 0.5519265532493591
<sos> the phone has a <unk> <unk> front camera <eos> 

epoch: 115  loss= 0.3861539959907532
<sos> besides parents and students were also present <eos> 

epoch: 120  loss= 0.391206294298172
<sos> details of sanctioned students admission strength for the year 201415 at tanuvas are furnished below <eos> 

epoch: 125  loss= 0.34915032982826233
<sos> details of beneficiaries and expenditure incurred during the last 5 years are <eos> 

epoch: 130  loss= 0.3442474603652954
<sos> details of beneficiaries and expenditure incurred during the last 5 years are <eos> 

epoch: 135  loss= 0.2777898907661438
<sos> details of beneficiaries and expenditure incurred during the last 5 years are <eos> 

epoch: 140  loss= 0.6602615714073181
<sos> details of 

##Save Model

In [None]:
# save model
model_save_name = 'seq2seq_transformer_200.pt'
path = F"/content/gdrive/My Drive/pytorch_models/{model_save_name}"
# torch.save(model.state_dict(), path)
torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss.item(),
}, path)