#**Hindi to English Seq2Seq Neural Machine translation**

## Specifications:
1. Architecture: Seq2seq with RNN:
    Encoder: Bidirectional single layer GRU
    Decoder: Unidirectional single layer GRU with     attention mechanism
2. Vocabulary size: 37208 for Hindi, 28157 for English
3. Tokenizer: NLTK TweetTokenizer for English, IndicNLP trivial tokenizer for Hindi
4. Encoder: Input size: 37208, Embedding size: 512, Dropout: 0.5
5. Decoder: Input size: 28157, Embedding size: 512, Dropout: 0.5
6. Output size: 28157, Hidden size: 512
7. Layers: 1
8. Epochs: 50 (Best model at 45 epoch), Batch size: 64
9. Learning rate: 0.001
10. Optimizer: AdamW, Loss: Cross Entropy Loss

## How to run?
train.csv is optional, but cleaned-train-randomized.csv should be mounted (provided in folder), because it's a randomized version of the train set, and if this file is generated again, the vocabulary for the trained model will not match. Required packages will be installed and other files will be automatically generated by the notebook. testhindistatements.csv should also be mounted for testing.

## Installing packages

The packages we're going to use except the python basics are indic_nlp, torch and nltk. indic_nlp needs to be built from source.

In [None]:
!git clone "https://github.com/anoopkunchukuttan/indic_nlp_library"
!git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git

Cloning into 'indic_nlp_library'...
remote: Enumerating objects: 1271, done.[K
remote: Counting objects: 100% (93/93), done.[K
remote: Compressing objects: 100% (68/68), done.[K
remote: Total 1271 (delta 50), reused 54 (delta 25), pack-reused 1178[K
Receiving objects: 100% (1271/1271), 9.56 MiB | 16.12 MiB/s, done.
Resolving deltas: 100% (654/654), done.
Cloning into 'indic_nlp_resources'...
remote: Enumerating objects: 133, done.[K
remote: Counting objects: 100% (7/7), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 133 (delta 0), reused 2 (delta 0), pack-reused 126[K
Receiving objects: 100% (133/133), 149.77 MiB | 41.31 MiB/s, done.
Resolving deltas: 100% (51/51), done.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Setting path for INDIC_NLP_LIBRARY
INDIC_NLP_LIB_HOME=r"/content/indic_nlp_library"
INDIC_NLP_RESOURCES="/content/indic_nlp_resources"

In [None]:
import sys
sys.path.append(r'{}'.format(INDIC_NLP_LIB_HOME))
from indicnlp import common
common.set_resources_path(INDIC_NLP_RESOURCES)
from indicnlp import loader
loader.load()
import csv
import string
import re
import random
import time
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import numpy as np
import nltk
from nltk.tokenize import TweetTokenizer
from indicnlp.tokenize import indic_tokenize
import torch.nn.functional as func

**Sentence tokenization functions:** Using NLTK TweetTokenizer for tokenizing English and indic_nlp for tokenizing Hindi sentences. TweetTokenizer is chosen because it does not split on apostrophe, letting us preserve words like don't, can't etc.

In [None]:
def tokenize_hindi(text_in_hindi): # Tokenize hindi sentences
    hindi_tokens=[]
    for token in indic_tokenize.trivial_tokenize(text_in_hindi):
        hindi_tokens.append(token)
    return hindi_tokens 

tokenizer = TweetTokenizer()
def tokenize_english(eng_text): # Tokenize english sentences
     return [token for token in tokenizer.tokenize(eng_text)]
print(tokenize_english("i  can't help you in doing that."))

['i', "can't", 'help', 'you', 'in', 'doing', 'that', '.']


## Data Cleaning and preprocessing

1. English text is converted to lower case. 
2. Extra spaces are removed.
3. All special characters except . ? ! are removed. Multiple quotations, eg """ are converted to single quotations ". Multiple occurances of . eg '...' are converted to a single occurance '.'
4. The character '.' is removed if found at any position except the end of an english sentence.
4. Hindi numbers are converted to their english versions.
5. Consistency of ending punctuations have been maintained. eg If the hindi sentence ends with |, English sentence must end with . and if the hindi sentence ends with ? the english sentence must also end with ? and so on.
6. We take only the sentences containing <=50 words from train.csv. This preserves 78k out of 80k sentences in the dataset, but gives lesser <pad> overhead for GPU and faster runtime. The gain transcends the loss of 2k sentences.
7. Hindi sentences containing english words are dropped.

In [None]:
dic={}
with open('cleaned_train.csv', 'w', newline='\n',encoding='utf-8') as file:
    writer = csv.writer(file)
    csv_file = open("train.csv",encoding='utf-8')
    rows = csv.reader(csv_file)
    for row in rows:
        hindi = str(row[1]).lower().strip()
        english = str(row[2]).lower().strip() # Convert to lower case

        english = english.replace('’','\'')
        hindi = hindi.replace('’','\'')
        hindi=re.sub(r'[/\-:;%()♪♫<>,~¶#&=]+','',hindi)
        english=re.sub(r"[^a-z0-9!.?'\[\] ]+",'',english)
        english=re.sub('"+','', english)
        hindi=re.sub('"+','', hindi)
        english=re.sub(' +',' ', english)
        hindi=re.sub(' +',' ', hindi)
        # Removed punctuations and extra space

        hindi = hindi.replace("०", "0")
        hindi = hindi.replace("१", "1")
        hindi = hindi.replace("२", "2")
        hindi = hindi.replace("३", "3")
        hindi = hindi.replace("४", "4")
        hindi = hindi.replace("५", "5")
        hindi = hindi.replace("६", "6")
        hindi = hindi.replace("७", "7")
        hindi = hindi.replace("८", "8")
        hindi = hindi.replace("९", "9")
        # Replace hindi digits with english digits

        if len(english)>0 and len(hindi)>0 and english[-1]=='.' and hindi[-1]!='।' and hindi[-1]!='.' and hindi[-1]!='!' and hindi[-1]!='?' and hindi[-1]!='"' and hindi[-1]!='|':
            hindi+='।'
        elif len(english)>0 and len(hindi)>0 and english[-1]=='?' and hindi[-1]!='।' and hindi[-1]!='.' and hindi[-1]!='!' and hindi[-1]!='?' and hindi[-1]!='"' and hindi[-1]!='|':
            hindi+='?'
        elif len(english)>0 and len(hindi)>0 and english[-1]=='!' and hindi[-1]!='।' and hindi[-1]!='.' and hindi[-1]!='!' and hindi[-1]!='?' and hindi[-1]!='"' and hindi[-1]!='|':
            hindi+='!'
        if len(english)>0 and len(hindi)>0 and english[-1]!='.' and english[-1]!='!' and english[-1]!='?' and hindi[-1]=='।':
            english+='.'
        elif len(english)>0 and len(hindi)>0 and english[-1]!='.' and english[-1]!='!' and english[-1]!='?' and hindi[-1]=='?':
            english+='?'
        elif len(english)>0 and len(hindi)>0 and english[-1]!='.' and english[-1]!='!' and english[-1]!='?' and hindi[-1]=='!':
            english+='!'
        # Insert stop words to maintain consistency

        if("..." in hindi and hindi.index("...")!=len(hindi)-3):
            hindi = hindi.replace("..."," ")
        if("..." in english and english.index("...")!=len(english)-3):
            english = english.replace("..."," ")
        hindi = hindi.replace("....",".")
        hindi = hindi.replace("...",".")
        hindi = hindi.replace("..",".")
        hindi = re.sub(r'""','"', str(hindi))
        english = english.replace("....",".")
        english = english.replace("...",".")
        english = english.replace("..",".")
        # Remove multiple occurence of punctuations
        if("." in english and english.index(".")!=len(english)-1):
            english = english.replace(".","")
        # Remove . if found in the middle of a sentence

        if(len(english)==0 or len(hindi)==0): # Skip if sentence is blank
            continue
        flag=0
        for i in range(65,123):# Find if there are english letters in the hindi sentence
            if chr(i) in hindi:
                flag=1
                break
        if flag==0:
            dic[hindi.strip().lower()] = english.strip().lower() 
    for key, value in dic.items():
        hin_len = 0
        eng_len = 0
        for token in tokenize_hindi(key):
            hin_len+=1
        for token in tokenize_english(value):
            eng_len+=1
        if hin_len<=50 and eng_len<=50:  # Accept the sentences if both their lengths<=50
            writer.writerow([key, value])

The cleaned dataset is then shuffled to maintain randomness in the training dataset. 

In [None]:
# ip = open('cleaned_train.csv','r',encoding='utf-8')
# ipdata = ip.readlines()
# random.shuffle(ipdata)
# with open('cleaned-randomized-train.csv','w',encoding='utf-8') as f:
#     rows = '\n'.join([row.strip() for row in ipdata])
#     f.write(rows)
train_data = []
csv_file = open("cleaned-randomized-train.csv",encoding='utf-8')
rows = csv.reader(csv_file)
for row in rows:
    train_data.append(row)

In [None]:
# Splitting into train set and validation set in 85:15 ratio
train_data_size=int(len(train_data)*0.85)
train, validation = train_data[:train_data_size],train_data[train_data_size:]

In [None]:
while(len(train)%64>0): # Making sure that the length of train set is divisible by our batch size (64)
    del train[-1]
print(len(train))

78208


## Creating vocabularies

Here, we create vocabularies for hindi and english words using 2 dictionaries for each - one for word-to-index conversion and second for index-to-word conversion. The indexes will be used for tensor representation of the sentences. We also keep track of the vocabulary size which will be used later for input and output sizes for encoder/decoder.

In [None]:
sos = "<sos>" # All sentences will have <sos> at the start
eos = "<eos>" # All sentences will be appended with <eos> at the end
unk = "" # unk token is kept blank to prevent occurences of <unk> in the predictions
pad = "<pad>"
Eng_cnt = {} # Count of distinct english tokens found
Hin_cnt = {} # Count of distinct hindi tokens found

Eng_ind_word={0:sos, 
             1:eos,
             2:unk, 
             3:pad
} # Index to word translation for english tokens
Eng_word_ind = {sos:0,
                eos:1,
                unk:2,
                pad:3
} # Word to index translation of english tokens
Hin_ind_word = {0:sos, 
                1:eos,
                2:unk, 
                3:pad
} # Index to word translation of hindi tokens
Hin_word_ind = {sos:0,
                eos:1,
                unk:2,
                pad:3
}# Word to index translation of hindi tokens
eng_vocab_size = 4
hin_vocab_size = 4
# Above two variables keep track of our current vocab size

eng_words = set()
hin_words = set()
# Sets of english and hindi words seen so far

for row in train: # Each row contains a hindi, english sentence pair
    english_tokens= tokenize_english(row[1])
    for token in english_tokens:
        if token not in eng_words: # New token found
            Eng_word_ind[token]= eng_vocab_size
            Eng_ind_word[eng_vocab_size]=token
            Eng_cnt[token]=1
            eng_vocab_size+=1
            eng_words.add(token)
        else: # Token already exists
            Eng_cnt[token]+=1

    hindi_tokens= tokenize_hindi(row[0]) # Do the same for hindi tokens
    for token in hindi_tokens:
        if token not in hin_words: # New token found
            Hin_word_ind[token]= hin_vocab_size
            Hin_ind_word[hin_vocab_size]=token
            Hin_cnt[token]=1
            hin_vocab_size+=1
            hin_words.add(token)
        else: # Token already exists
            Hin_cnt[token]+=1


In [None]:
print(hin_vocab_size, eng_vocab_size) # Vocabulary sizes

37208 28157


## Defining our architecture:

###Encoder: Single layer bidirectional GRU

In our encoder, we use a bidirectional RNN. Using this biRNN, we're able to process hindi sequences in the forward as well as in the backward direction parallely. This architecture hence gives us two context vectors, one capturing past information and another capturing future information. However, we need to concatenate these into a single context vector as our decoder is unidirectional. The concatenation is done by first concatenating the top layer forward RNN hidden state after the final time-step and top layer backward RNN hidden state after the final time-step horizontally, and then passing the resultant vector over a linear layer and then applying tanh activation. 

In [None]:
class Encoder(nn.Module):
    def __init__(self, input_size, embedding_size, hidden_size, layers, dropout):
        super(Encoder, self).__init__()   
        self.embedding = nn.Embedding(input_size, embedding_size) # Defines our embedding
        self.layers = layers
        self.gru = nn.GRU(embedding_size, hidden_size, bidirectional = True)
        # Bidirectional GRU used: A forward and a backward RNN
        self.dropout = nn.Dropout(dropout)
        self.fc_layer = nn.Linear(hidden_size * 2, hidden_size) # Defining a fully connected linear layer
        
    def forward(self, source_tensor): # source_tensor = hindi_tensors (indexes of hindi words from the vocab)
        # let x = source tensor length. batch_size = 64, embedding_size = 512, hidden_size = 1024
        # shape of source_tensor = [x, 64] where 
        source_embedding = self.dropout(self.embedding(source_tensor)) # 3D embedded input tensor
        
        outputs, hidden_state = self.gru(source_embedding) # Get the outputs from the RNN. No cell state in GRU.
        
        # [0, :, :] is the top layer forward RNN hidden state after the final time-step
        # [1, :, :] gives the top layer backward RNN hidden state after the final time-step
        hidden_cat = torch.cat((hidden_state[0,:,:], hidden_state[1,:,:]), dim = 1)
        # Concatenating hidden layers of forward direction and backward direction because the decoder is not bidirectional
        hidden_cat_fc = self.fc_layer(hidden_cat) # Passing through the fully connected layer
        hidden_state = torch.tanh(hidden_cat_fc) #  Applying tanh activation
        
        return outputs, hidden_state

### Decoder: Unidirectional single layer GRU with attention mechanism

Our decoder uses an attention layer. Unlike a simple Seq2Seq-RNN model, here we use attention-weighted source vectors at every time step while decoding.

In [None]:
class Decoder(nn.Module):
    def __init__(self, output_size, embedding_size, hidden_size, dropout):
        super(Decoder, self).__init__()

        self.output_size = output_size  
        self.embedding = nn.Embedding(output_size, embedding_size) # Defines our embedding
        self.gru = nn.GRU((hidden_size * 2) + embedding_size, hidden_size) # Unidirectional GRU
        self.fc_layer_out = nn.Linear((hidden_size * 3) + embedding_size, output_size) # Defining a fully connected linear layer
        self.dropout = nn.Dropout(dropout)
        
        self.attention_layer = nn.Linear((hidden_size * 3), hidden_size)
        self.weighted_energy_sum = nn.Linear(hidden_size, 1, bias = False) # Represents weighted sum of the energy across all encoder hidden states. 

    def forward(self, decoder_input, hidden_state, encoder_outputs):
        # let x = source tensor length. batch_size = 64, embedding_size = 512, hidden_size = 1024

        enc_op_copy = encoder_outputs
        hid_state_copy = hidden_state # Making copies of these for the attention mechanism
        
        decoder_input = decoder_input.unsqueeze(0) # Add 1 dimension
        
        # Before squeeze: decoder_input = [64]. After squeeze: [1, 64]
        
        source_embedding = self.dropout(self.embedding(decoder_input)) # Apply dropout on the source embedding

        source_length =  enc_op_copy.shape[0]
        batch_size =  enc_op_copy.shape[1]
        
        #repeat decoder hidden state source_length times
        hid_state_copy = hid_state_copy.unsqueeze(1).repeat(1, source_length, 1) # Repeating the previous decoder hidden state source_length times

        enc_op_copy =  enc_op_copy.permute(1, 0, 2) # Switch dimensions 

        concat = self.attention_layer(torch.cat((hid_state_copy, enc_op_copy), dim = 2)) # Concatenating encoder outputs and hidden state
        energy = torch.tanh(concat) # Represents the energy between the previous decoder hidden state and the encoder hidden states.
        
        # Now we compute our attention vector
        attention_vector = self.weighted_energy_sum(energy).squeeze(2) # energy gets multiplied by a [1, hidden_size] tensor weighted_energy_sum.
        # weighted_energy_sum represents the weighted sum of the energy across all encoder hidden states        
        attention_vector = func.softmax(attention_vector, dim=1).unsqueeze(1) # Softmax ensures that the elements of the vector are between 0 and 1
        # This is our final attention vector over the hindi sentences
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2) # Swap dimensions
        
        attn_weighted_encoder_outputs = torch.bmm(attention_vector, encoder_outputs) # Performs a batch matrix-matrix product of matrices stored in attention_vector and encoder_outputs
        attn_weighted_encoder_outputs = attn_weighted_encoder_outputs.permute(1,0,2)
        # Created a a weighted sum of the encoder outputs
        
        gru_input = torch.cat((source_embedding, attn_weighted_encoder_outputs), dim = 2) # [1, batch size, (hidden_size*2) + embedding_size]
        decoder_output, hidden_state = self.gru(gru_input, hidden_state.unsqueeze(0))
        # Get the prediction from the RNN. decoder_output shape = [x, 64, 1024*2] where x = sentence length, 64 = batch size and 1024 = hidden size
        # hidden_state shape = [2, 64, 1024] where 2 = number of directions
        
        # Pass the embedded input, weighted source tensor and decoder output from the RNN over a linear layer to predict the next word of the english sentence
        decoder_prediction = self.fc_layer_out(torch.cat((decoder_output.squeeze(0), attn_weighted_encoder_outputs.squeeze(0), source_embedding.squeeze(0)), dim = 1))
        hidden_state = hidden_state.squeeze(0) # Remove extra dimension
        return decoder_prediction, hidden_state

### Combining the encoder and decoder in our Seq2seq class:

In [None]:
class Seq2seq(nn.Module): # Defines our model. Combines the encoder and the decoder
    def __init__(self, encoder, decoder):
        super(Seq2seq, self).__init__()      
        self.encoder = encoder
        self.decoder = decoder
        
    def forward(self, source_tensor, target_tensor, teacher_force_ratio = 0.5):
        # For "good" predictions, sometimes we use the correct word, sometimes (50% of the times) we use the predicted word. If the ratio is 1, it might cause overfitting. If it is 0 (only predicted words taken), it might underfit because a wrong prediction can ruin the prediction of a lot of subsequent words
   
        batch_size = source_tensor.shape[1]
        target_length = target_tensor.shape[0]
        
        decoder_outputs = torch.zeros(target_length, batch_size, eng_vocab_size).to(device)
        # We're going to predict 1 word at a time for an entire batch, and each prediction is going to be a vector of entire english vocab size

        encoder_outputs, hidden_state = self.encoder(source_tensor) # Get the encoder outputs for the source tensor 
        decoder_input = target_tensor[0,:] # Append the <sos> tokens
        
        for i in range(1, target_length):
            
            decoder_output, hidden_state = self.decoder(decoder_input, hidden_state, encoder_outputs) # inputs given are from the encoder
            # The output from this is going to be the decoder output and the next hidden_state which will be reused in the loop
            # Size of decoder_output = (batch_size, target_vocab_size) ie (N, eng_vocab_size)
            decoder_outputs[i] = decoder_output
            
            if random.random() < teacher_force_ratio: # If this condition is satisfied, use the actual next word (ground truth) as the next input to decoder
                decoder_input = target_tensor[i]
            else: # else use the actual prediction as the next input to decoder
                decoder_input = decoder_output.argmax(1) # Taking argmax of the 1st dimension to get the highest predicted token

        return decoder_outputs

## Preparing tensors

The idea is to create tensors of train data and iterate over them in steps of batch size. We use the DataLoader module of pytorch to do this. Now, DataLoader needs all the tensors in a batch to be of equal length, so we need to pad the tensors to make them of equal size. Instead of having a global maximum tensor length, we use a dictionary to keep track of the maximum length of tensors in each batch. This saves computation time.

In [None]:
batch_maxlen={} # Dictionary to store max length of tensors in each batch
batch_size = 64 # Batch size 64 found to be optimal
batch_no=1
for i in range(0,len(train),batch_size): # Iterate over the train set in steps of batch size.
    max_sentence_len = 0 
    for row in train[i:i+batch_size]: # This represents rows in our current batch
        hindi = row[0]
        english = row[1]
        eng_maxlen, hin_maxlen = 0,0 
        for token in tokenize_hindi(hindi):
            hin_maxlen+=1 # Count hindi tokens
        for token in tokenize_english(english):
            eng_maxlen+=1 # Count english tokens
        max_sentence_len = max(max_sentence_len,max(hin_maxlen, eng_maxlen)) # Max sentence length is the max over hindi and english sentences in the current batch
    batch_maxlen[batch_no] = max_sentence_len+2 # +2 done to account for <sos> and <eos> tokens
    batch_no+=1 # Go to the next batch

In [None]:
train_tensors = [] # Our cumulative train tensors. A list of list containing index vectors for all the sentences in the train corpus
batch_no=1
for i in range(0,len(train),batch_size): # Iterate over the train set in steps of batch size.
    max_sentence_len=batch_maxlen[batch_no] # Find the max sentence length in this batch
    for row in train[i:i+batch_size]: # This represents rows in our current batch
        hindi = row[0]
        english = row[1]
        hindi_indexes = [] # To store indices for hindi tokens
        eng_indexes = [] # To store indices for english tokens
        for token in tokenize_hindi(hindi):
            if Hin_word_ind.get(token)==None: # Token not found in vocabulary
                hindi_indexes.append(Hin_word_ind[unk]) # 2 is for unk token
            else:
                hindi_indexes.append(Hin_word_ind.get(token)) # Append the index of the current token
        hindi_indexes.insert(0,Hin_word_ind[sos]) # Insert the <sos> token's index at start
        hindi_indexes.append(Hin_word_ind[eos]) # Append the <eos> token's index at end
        while(len(hindi_indexes) < max_sentence_len):
            hindi_indexes.append(Hin_word_ind[pad]) # Padding
        for token in tokenize_english(english):
            if Eng_word_ind.get(token)==None: # Token not found in vocabulary
                eng_indexes.append(Eng_word_ind[unk]) # 2 is for unk token
            else:
                eng_indexes.append(Eng_word_ind.get(token)) # Append the index of the current token
        eng_indexes.insert(0,Eng_word_ind[sos])# Insert the <sos> token's index at start
        eng_indexes.append(Eng_word_ind[eos])# Append the <eos> token's index at end
        while(len(eng_indexes) < max_sentence_len):
            eng_indexes.append(Eng_word_ind[pad]) # Padding

        train_tensors+=[[torch.IntTensor(hindi_indexes).detach().clone(), torch.IntTensor(eng_indexes).detach().clone()]]
    batch_no+=1

## Some prerequisites before we start training:

In [None]:
# Setting our hyperparameters
epochs = 50
learning_rate = 0.001

layers = 1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
input_size_encoder = hin_vocab_size
output_size = eng_vocab_size
hidden_size = 512 # Hidden size taken the same for both encoder and decoder
encoder_embedding_size = 256
decoder_embedding_size = 256
encoder_dropout = 0.5
decoder_dropout = 0.5

In [None]:
encoder_net = Encoder(input_size_encoder, encoder_embedding_size, hidden_size, layers, encoder_dropout)
decoder_net = Decoder(output_size, decoder_embedding_size, hidden_size, decoder_dropout)
# Defined our encoder and decoder net

model = Seq2seq(encoder_net, decoder_net).to(device) # Initialize our model and send to cuda
pad_index = Eng_word_ind['<pad>'] 
criterion = nn.CrossEntropyLoss(ignore_index = pad_index) # We don't want to pay loss for <pad>, so ignore pad indexes
optimizer = optim.AdamW(model.parameters(), lr=learning_rate)  # AdamW optimizer used instead of plain old Adam.
def init_weights(model): # Initializing weights
    for name, parameter in model.named_parameters():
        if 'weight' in name:
            nn.init.normal_(parameter.data, mean=0, std=0.01)
        else:
            nn.init.constant_(parameter.data, 0)
            
model.apply(init_weights)

Seq2seq(
  (encoder): Encoder(
    (embedding): Embedding(37208, 256)
    (gru): GRU(256, 512, bidirectional=True)
    (dropout): Dropout(p=0.5, inplace=False)
    (fc_layer): Linear(in_features=1024, out_features=512, bias=True)
  )
  (decoder): Decoder(
    (embedding): Embedding(28157, 256)
    (gru): GRU(1280, 512)
    (fc_layer_out): Linear(in_features=1792, out_features=28157, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
    (attention_layer): Linear(in_features=1536, out_features=512, bias=True)
    (weighted_energy_sum): Linear(in_features=512, out_features=1, bias=False)
  )
)

In [None]:
iterator = DataLoader(train_tensors, batch_size=batch_size,shuffle=False) # Created train iterator mimicking's torchtext's train iterator

In [None]:
#torch.save(model.state_dict(), 'final-phase-45.pt')
model.load_state_dict(torch.load('final-model.pt')) #loading the saved model
model.eval()

Seq2seq(
  (encoder): Encoder(
    (embedding): Embedding(37208, 256)
    (gru): GRU(256, 512, bidirectional=True)
    (dropout): Dropout(p=0.5, inplace=False)
    (fc_layer): Linear(in_features=1024, out_features=512, bias=True)
  )
  (decoder): Decoder(
    (embedding): Embedding(28157, 256)
    (gru): GRU(1280, 512)
    (fc_layer_out): Linear(in_features=1792, out_features=28157, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
    (attention_layer): Linear(in_features=1536, out_features=512, bias=True)
    (weighted_energy_sum): Linear(in_features=512, out_features=1, bias=False)
  )
)

## Commence training!

In [None]:
total_loss = 0 # To keep track of the training loss in each epoch
for epoch in range(31,epochs):
    print("Epoch"+str(epoch) +str("/") + str(epochs))
    model.train() # Setting model in train mode
    for id, batch in enumerate(iterator): # Iterating over the training set in batches of 64

        hindi_input=torch.transpose(batch[0].long(), 0, 1).to(device)
        english_target=torch.transpose(batch[1].long(), 0, 1).to(device)
        # hindi and english tensors have shape = (batch_size, max_batchlen) but we need shape to be (max_batchlen, batch_size ) so transpose these

        output = model(hindi_input, english_target) # Forward propagation
        output = output[1:].reshape(-1, output.shape[2]) # We're going to keep the output dimension, which is the size of the vocab, and put everything else together, and gotta discard the first output(<sos>)
      
        english_target = english_target[1:].reshape(-1) # Remove the <sos> token from the ground truth target
        optimizer.zero_grad() 
        loss = criterion(output, english_target) # Calculate the CrossEntropy loss 
        loss.backward() # Back propagation
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1) # Clipping the gradients to prevent them from exploding
        optimizer.step() # Gradient descent
        total_loss+=loss.item() # Add the current batch loss to the total epoch loss

    if epoch==35:
        torch.save(model.state_dict(), 'attn-gru4-35.pt')
    elif epoch==40:
        torch.save(model.state_dict(), 'attn-gru4-40.pt')
    elif epoch==45:
        torch.save(model.state_dict(), 'attn-gru4-45.pt')  
    elif epoch==50:
        torch.save(model.state_dict(), 'attn-gru4-50.pt')
    print("Train loss :", total_loss/len(iterator))
    total_loss = 0

[Epoch 31 / 50]
Training loss : 1.3845185347742808
[Epoch 32 / 50]
Training loss : 1.3792187408612715
[Epoch 33 / 50]
Training loss : 1.360443650407994
[Epoch 34 / 50]
Training loss : 1.3579538188472318
[Epoch 35 / 50]
Training loss : 1.3380614352792062
[Epoch 36 / 50]
Training loss : 1.3252727541517704
[Epoch 37 / 50]
Training loss : 1.3163236871288568
[Epoch 38 / 50]
Training loss : 1.299136302962826
[Epoch 39 / 50]
Training loss : 1.2950468673058306
[Epoch 40 / 50]
Training loss : 1.2800884007334514
[Epoch 41 / 50]
Training loss : 1.2639049985209776
[Epoch 42 / 50]
Training loss : 1.257107134341413
[Epoch 43 / 50]
Training loss : 1.251821501165287
[Epoch 44 / 50]
Training loss : 1.2385139078048168
[Epoch 45 / 50]
Training loss : 1.2211567567335213
[Epoch 46 / 50]
Training loss : 1.2208030289206684
[Epoch 47 / 50]
Training loss : 1.2092267169987512
[Epoch 48 / 50]
Training loss : 1.1998458169583603
[Epoch 49 / 50]
Training loss : 1.1923794133280772


## **Evaluation on validation set:**

In [None]:
def translate(model, sentence, max_length): # Function for translating hindi to english using our trained model

    tokens = [t for t in tokenize_hindi(sentence)] # Extract tokens from the current hindi sentence

    tokens.insert(0, "<sos>")
    tokens.append("<eos>")
    # Add <sos> and <eos> tokens in beginning and end respectively

    text_to_indices = [] # Stores the indices of hindi tokens
    for token in tokens: # Go through each hindi token and convert to its index in the vocab
        if Hin_word_ind.get(token)==None:
            text_to_indices.append(2) # 2 is for unk token
        else:
            text_to_indices.append(Hin_word_ind[token])
    sentence_tensor = torch.IntTensor(text_to_indices)
    sentence_tensor = sentence_tensor.unsqueeze(1).to(device)
     # Convert to Tensor and send to cuda
     
    with torch.no_grad():
        encoder_states, hidden = model.encoder(sentence_tensor) # Fetch encoder outputs

    decoder_outputs = [Eng_word_ind["<sos>"]] # decoder_outputs stores our english translations
    for _ in range(max_length):
        previous_word = torch.IntTensor([decoder_outputs[-1]]).to(device)

        with torch.no_grad():
            output, hidden = model.decoder(previous_word, hidden, encoder_states)
            best_guess = output.argmax(1).item()

        decoder_outputs.append(best_guess)

        if output.argmax(1).item() == Eng_word_ind["<eos>"]:
            break
    translated_sentence = [] # Stores the word conversions of target sentence
    for idx in decoder_outputs:
        translated_sentence.append(Eng_ind_word[int(idx)]) # Convert indices to english words 

    if translated_sentence[-1] == "<eos>":
        del translated_sentence[-1] # Remove <eos> token if found at end
    return " ".join(translated_sentence[1:]) # Remove <sos> token and return the english prediction

Generating ground truth english translations and predicted english translations

In [None]:
file = open("ground_truth.txt","w") # This txt will store the groud truth english translations
for row in validation:
    file.write(row[1]+"\n")
file.close()

Generating model predictions:

In [None]:
model.eval()
file1 = open("model-prediction.txt","w") # Stores our model predictions
for row in validation:
    sentence = row[0]
    translated_sentence = translate( # Translated english sentence
        model, sentence, max_length=50
    )
    file1.write(translated_sentence + "\n")
file1.close()

## **Generating predictions on test set:**

### We preprocess the hindi statements in test dataset in the same way we preprocessed hindi statements in the train dataset.

In [None]:
with open('testhindistatements-cleaned.csv', 'w', newline='\n',encoding='utf-8') as file:
    writer = csv.writer(file)
    with open('testhindistatements.csv',encoding='utf-8') as hfile:
        rows = csv.reader(hfile)
        for row in rows:
            hindi = row[2]
            hindi = hindi.replace('’','\'')
            hindi=re.sub(r'[/\-:;%()♪♫<>,~¶#&=]+','',hindi)
            hindi=re.sub(' +',' ', hindi)
            if("..." in hindi and hindi.index("...")!=len(hindi)-3):
                hindi = hindi.replace("..."," ")
            hindi = hindi.replace("....",".")
            hindi = hindi.replace("...",".")
            hindi = hindi.replace("..",".")
            hindi = hindi.replace("०", "0")
            hindi = hindi.replace("१", "1")
            hindi = hindi.replace("२", "2")
            hindi = hindi.replace("३", "3")
            hindi = hindi.replace("४", "4")
            hindi = hindi.replace("५", "5")
            hindi = hindi.replace("६", "6")
            hindi = hindi.replace("७", "7")
            hindi = hindi.replace("८", "8")
            hindi = hindi.replace("९", "9")
            hindi=re.sub('"+','', hindi)
            hindi=re.sub(' +',' ', hindi)
            writer.writerow([row[0],row[1],hindi])

In [None]:
model.eval()
file2 = open("testhindistatements.txt","w")
file3 = open("answer.txt","w")
csv_file2 = open('testhindistatements-cleaned.csv',encoding='utf-8')
rows = csv.reader(csv_file2)
cnt=0
for row in rows:
    if cnt==0: # Skip the 0th row
        cnt+=1
        continue
    sentence = row[2]
    translated_sentence = translate(
        model, sentence, max_length=80
    )
    file2.write(sentence + "\n")
    file3.write(translated_sentence + "\n")
file2.close()
file3.close()

## References

[1] https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

[2] https://arxiv.org/pdf/1409.0473.pdf

[3] Text preprocessing tutorial: https://colab.research.google.com/drive/1p3oGPcNdORw5_MDcufTDYWJhJt3XVPuC?usp=sharing