# 1 - Sequence to Sequence Learning with Neural Networks
<font size = 5>**Hindi to English Translation**</font>

<font size = 4> References:</font> 
- <font size = 4> https://github.com/cfiltnlp/IITB-English-Hindi-PC
- <font size = 4> https://github.com/bentrevett/pytorch-seq2seq





## Import Libraries



In [None]:
#!pip install datasets

In [19]:
# Import Libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pad_sequence
from torch.nn.utils import clip_grad_norm_, clip_grad_value_

import torchtext 
from torchtext.vocab import vocab

import spacy
#import stanza

from datasets import load_dataset_builder, load_dataset 

import numpy as np
from collections import Counter, OrderedDict
import random
import math
import time
import pandas as pd
from pathlib import Path
import joblib
import pickle
import swifter

In [2]:
#from google.colab import drive
#drive.mount('/content/drive')

In [3]:
folder = Path('/home/harpreet/Insync/google_drive_harpreet/Research/NLP/pytorch-seq2seq')

In [4]:
#!pip install -U spacy

In [5]:
torchtext.__version__, torch.__version__, torch.cuda.is_available(), spacy.__version__

('0.11.0', '1.10.0', True, '3.2.4')

# Set Seeds

In [6]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

## Load Data and Tokenize

Next, we download and load the train, validation and test data. 

The dataset we'll be using is the [IIT Bombay English-Hindi Corpus](https://www.cfilt.iitb.ac.in/iitb_parallel/). 
Anoop Kunchukuttan, Pratik Mehta, Pushpak Bhattacharyya. The IIT Bombay English-Hindi Parallel Corpus. Language Resources and Evaluation Conference. 2018.

The datset can be  downloaded from huggingface library as well

### Load Data and create DataFrame

In [7]:
# check the datsset without downloading it
# need to run this cell only once
'''
dataset_builder = load_dataset_builder('cfilt/iitb-english-hindi')
print(dataset_builder.cache_dir)
print(f'\n{dataset_builder.info.features}')
print(f'\n{dataset_builder.info.splits}')
'''

"\ndataset_builder = load_dataset_builder('cfilt/iitb-english-hindi')\nprint(dataset_builder.cache_dir)\nprint(f'\n{dataset_builder.info.features}')\nprint(f'\n{dataset_builder.info.splits}')\n"

In [8]:
# need to run this cell only once
#dataset = load_dataset("cfilt/iitb-english-hindi")

In [9]:
# need to run this cell only once
#dataset

In [10]:
# need to run this cell only once
#dataset['train'][0:2]

In [11]:
# need to run this cell only once
# pd.DataFrame(dataset['train'][0:2]['translation'])

In [12]:
## need to run this cell only once
# Note : There is a better and easier way of converting HUgging Face Datsets to Pandas and viceversa
'''
df={}
for split in ['train', 'validation', 'test']:
    df[split] = pd.DataFrame(dataset[split]['translation'])
'''

"\ndf={}\nfor split in ['train', 'validation', 'test']:\n    df[split] = pd.DataFrame(dataset[split]['translation'])\n"

In [13]:
#df['train'].head()

In [14]:
# need to run this cell only once
#df['train'].info()

In [15]:
# need to run this cell only once
#df['test'].head(2)

In [16]:
# need to run this cell only once
#df['validation'].head(2)

### Tokenization

In [17]:
# need to run this cell only once
#nlp_en = stanza.Pipeline(lang='en', processors = 'tokenize', tokenize_no_split = True)

In [18]:
# need to run this cell only once
#nlp_hi = stanza.Pipeline(lang='hi', processors = 'tokenize', tokenize_no_split = True)

Next, we create the tokenizer functions. These can be passed to torchtext and will take in the sentence as a string and return the sentence as a list of tokens.

<font color = 'red'>**In the paper we are implementing, they find it beneficial to reverse the order of the input which they believe "introduces many short term dependencies in the data that make the optimization problem much easier".We copy this by reversing the German sentence after it has been transformed into a list of tokens.**</font>

In [19]:
# need to run this cell only once
'''
def my_tokenizer(stanza_pipeline, data, reverse =False, ):
    token_list =[]
    for text in data:
        doc = stanza_pipeline(text)
        tokens =[]
        for sent in doc.sentences:            
            for token in sent.tokens:
                tokens.append(token.text)
        if reverse:
            tokens.reverse()
        token_list.append(tokens) 
    return token_list
'''

'\ndef my_tokenizer(stanza_pipeline, data, reverse =False, ):\n    token_list =[]\n    for text in data:\n        doc = stanza_pipeline(text)\n        tokens =[]\n        for sent in doc.sentences:            \n            for token in sent.tokens:\n                tokens.append(token.text)\n        if reverse:\n            tokens.reverse()\n        token_list.append(tokens) \n    return token_list\n'

In [20]:
# need to run this cell only once
#df['test']['hi'][0:2].values

In [21]:
# need to run this cell only once
#df['test']['en'][0:2].values

In [22]:
# need to run this cell only once
#my_tokenizer(nlp_hi, df['test']['hi'][0:2].values, reverse=True)

In [23]:
# need to run this cell only once
#my_tokenizer(nlp_en, df['test']['en'][0:2].values, reverse=False)

In [24]:
# need to run this cell only once
'''
for split in ['train', 'validation', 'test']:
    df[split]['source_tokens'] = my_tokenizer(nlp_hi, df[split]['hi'].values, reverse = True)
    df[split]['target_tokens'] = my_tokenizer(nlp_en, df[split]['en'].values)
    df[split] = df[split][['source_tokens', 'target_tokens']]
'''

"\nfor split in ['train', 'validation', 'test']:\n    df[split]['source_tokens'] = my_tokenizer(nlp_hi, df[split]['hi'].values, reverse = True)\n    df[split]['target_tokens'] = my_tokenizer(nlp_en, df[split]['en'].values)\n    df[split] = df[split][['source_tokens', 'target_tokens']]\n"

In [None]:
#df_train = df['train']
#df_valid = df['validation']
#df_test = df['test']

In [15]:
#df['train']  = df_train.rename(columns = {'source_tokens': 'source_tokens_reversed'})

In [16]:
#df_test = df_test.rename(columns = {'source_tokens: 'source_tokens_reversed'})
#df_valid = df_valid.rename(columns = {'source_tokens': 'source_tokens_reversed'})

In [30]:
#df_train['source_tokens'] = df_train['source_tokens_reversed'].swifter.apply(
#                                                           lambda x: x[::-1])

Pandas Apply:   0%|          | 0/1659083 [00:00<?, ?it/s]

In [33]:
#df_test['source_tokens'] = df_test['source_tokens_reversed'].swifter.apply(
#                                                           lambda x: x[::-1])

Pandas Apply:   0%|          | 0/2507 [00:00<?, ?it/s]

In [34]:
#df_valid['source_tokens'] = df_valid['source_tokens_reversed'].swifter.apply(
#                                                           lambda x: x[::-1])

Pandas Apply:   0%|          | 0/520 [00:00<?, ?it/s]

#### Save Tokenized data

In [37]:
# need to run this cell only once
#df_train.to_pickle(path=folder/'df_train_hi_en')
#df_test.to_pickle(path=folder/'df_test_hi_en')
#df_test.to_pickle(path=folder/'df_valid_hi_en')

## Load Tokenized Data

In [38]:
df_train = pd.read_pickle(folder/'df_train_hi_en')
df_test = pd.read_pickle(folder/'df_test_hi_en')
df_valid = pd.read_pickle(folder/'df_valid_hi_en')

In [39]:
df_train.head()

Unnamed: 0,source_tokens_reversed,target_tokens,source_tokens
0,"[दें, लाभ, का, व्यायाम, पहुंचनीयता, को, अनुप्र...","[Give, your, application, an, accessibility, w...","[अपने, अनुप्रयोग, को, पहुंचनीयता, व्यायाम, का,..."
1,"[अन्वेषक, पहुंचनीयता, एक्सेर्साइसर]","[Accerciser, Accessibility, Explorer]","[एक्सेर्साइसर, पहुंचनीयता, अन्वेषक]"
2,"[खाका, प्लग-इन, डिफोल्ट, लिए, के, पटल, निचले]","[The, default, plugin, layout, for, the, botto...","[निचले, पटल, के, लिए, डिफोल्ट, प्लग-इन, खाका]"
3,"[खाका, प्लग-इन, डिफोल्ट, लिए, के, पटल, ऊपरी]","[The, default, plugin, layout, for, the, top, ...","[ऊपरी, पटल, के, लिए, डिफोल्ट, प्लग-इन, खाका]"
4,"[है, गया, किया, निष्क्रिय, से, रूप, डिफोल्ट, ज...","[A, list, of, plugins, that, are, disabled, by...","[उन, प्लग-इनों, की, सूची, जिन्हें, डिफोल्ट, रू..."


In [40]:
df_valid.head()

Unnamed: 0,source_tokens_reversed,target_tokens,source_tokens
0,"[?, बॉक्स, ब्लैक, में, कार, आपकी]","[A, black, box, in, your, car, ?]","[आपकी, कार, में, ब्लैक, बॉक्स, ?]"
1,"[।, है, जाता, हो, फिट, से, सफ़ाई, पर, डैशबोर्ड...","[As, America, 's, road, planners, struggle, to...","[जबकि, अमेरिका, के, सड़क, योजनाकार, ,, ध्वस्त,..."
2,"[।, है, चुका, बन, मुद्दा, का, प्रयास, विवादास्...","[The, devices, ,, which, track, every, mile, a...","[यह, डिवाइस, ,, जो, मोटर-चालक, द्वारा, वाहन, च..."
3,"[।, है, गया, बन, मुद्दा, का, गठबंधनों, जीवंत, ...","[The, usually, dull, arena, of, highway, plann...","[आम, तौर, पर, हाईवे, नियोजन, जैसा, उबाऊ, काम, ..."
4,"[।, हैं, गए, मिल, साथ, के, समूहों, पर्यावरणीय,...","[Libertarians, have, joined, environmental, gr...","[आपने, द्वारा, ड्राइव, किए, गए, मील, ,, तथा, स..."


In [41]:
df_test.head()

Unnamed: 0,source_tokens_reversed,target_tokens,source_tokens
0,"[?, बॉक्स, ब्लैक, में, कार, आपकी]","[A, black, box, in, your, car, ?]","[आपकी, कार, में, ब्लैक, बॉक्स, ?]"
1,"[।, है, जाता, हो, फिट, से, सफ़ाई, पर, डैशबोर्ड...","[As, America, 's, road, planners, struggle, to...","[जबकि, अमेरिका, के, सड़क, योजनाकार, ,, ध्वस्त,..."
2,"[।, है, चुका, बन, मुद्दा, का, प्रयास, विवादास्...","[The, devices, ,, which, track, every, mile, a...","[यह, डिवाइस, ,, जो, मोटर-चालक, द्वारा, वाहन, च..."
3,"[।, है, गया, बन, मुद्दा, का, गठबंधनों, जीवंत, ...","[The, usually, dull, arena, of, highway, plann...","[आम, तौर, पर, हाईवे, नियोजन, जैसा, उबाऊ, काम, ..."
4,"[।, हैं, गए, मिल, साथ, के, समूहों, पर्यावरणीय,...","[Libertarians, have, joined, environmental, gr...","[आपने, द्वारा, ड्राइव, किए, गए, मील, ,, तथा, स..."


In [42]:
df_train['source_len']= df_train['source_tokens'].swifter.apply(lambda x : len(x))

Pandas Apply:   0%|          | 0/1659083 [00:00<?, ?it/s]

In [43]:
df_train['source_len'].max()

1463

In [44]:
df_train['source_len'].min()

0

In [45]:
df_train['source_len'].mean()

15.419901234597667

## Build Vocab

In [37]:
# need to run this cell only once

# Function to create vocab and insert special tokens
# the function should take text, min_freq and specials as input
# also set the index for default words to 0.

'''
def create_vocab(text, min_freq, specials):
    my_counter = Counter()
    for sent in text:
      my_counter.update(sent)  
    my_vocab = vocab(my_counter, min_freq = min_freq)
    for i, special in enumerate(specials):
        my_vocab.insert_token(special, i)
    my_vocab.set_default_index(0)
        
    return my_vocab
'''

'\ndef create_vocab(text, min_freq, specials):\n    my_counter = Counter()\n    for sent in text:\n      my_counter.update(sent)  \n    my_vocab = vocab(my_counter, min_freq = min_freq)\n    for i, special in enumerate(specials):\n        my_vocab.insert_token(special, i)\n    my_vocab.set_default_index(0)\n        \n    return my_vocab\n'

In [38]:
# need to run this cell only once
#source_vocab = create_vocab(text= df_train.source_tokens.values, min_freq=300,specials= ['<UNK>', '<BOS>', '<EOS>','<PAD>'])

In [39]:
# need to run this cell only once
#target_vocab = create_vocab(text= df_train.target_tokens.values, min_freq=300,specials= ['<UNK>', '<BOS>', '<EOS>','<PAD>'])

In [40]:
#len(source_vocab), len(target_vocab)

In [41]:
# need to run this cell only once
#pickle.dump(source_vocab, open(folder/'source_vocab_hi_en.pkl', 'wb'))
#pickle.dump(target_vocab, open(folder/'target_vocab_hi_en.pkl', 'wb'))

## Load Vocab

In [46]:
source_vocab = pickle.load(open(folder/'source_vocab_hi_en.pkl','rb'))
target_vocab = pickle.load(open(folder/'target_vocab_hi_en.pkl','rb'))

In [43]:
#pd.DataFrame(source_vocab.get_stoi().items(), columns=['tokens', 'index']).sort_values(by = ['index'])

In [44]:
# check index of unknown word - it should be zero
#source_vocab['abracdabra']

In [47]:
target_vocab['from']

55

In [48]:
len(source_vocab), len(target_vocab)

(6115, 6537)

# Create Dataset and Dataloader

In [50]:
class EngHindi(Dataset):
    
    '''
    Takes input as (X1, X2)
    X1 : pandas series for  source language
    X2 : pndas series for target language
    '''
    def __init__(self, X1, X2):
        self.X1 = X1
        self.X2 = X2
        
        
    def __len__(self):
        return len(self.X1)
    
    def __getitem__(self, indices):
        source_examples = self.X1.iloc[indices]  
        target_examples = self.X2.iloc[indices]
        return source_examples, target_examples    

In [51]:
trainset = EngHindi(df_train['source_tokens_reversed'], df_train['target_tokens'])
testset =  EngHindi(df_test['source_tokens_reversed'], df_test['target_tokens'])
validset = EngHindi(df_valid['source_tokens_reversed'], df_valid['target_tokens'])

In [52]:
trainset.__getitem__(0)

(['दें', 'लाभ', 'का', 'व्यायाम', 'पहुंचनीयता', 'को', 'अनुप्रयोग', 'अपने'],
 ['Give', 'your', 'application', 'an', 'accessibility', 'workout'])

In [53]:
len(trainset), len(testset), len(validset)

(1659083, 2507, 2507)

In [54]:
len(trainset)*0.02

33181.66

In [55]:
# get subset of data
# We will be using only 100 images for nboth train and validation datasets
train_sample_size = int(len(trainset)*0.02)

# Getting n random indices
train_subset_indices = random.sample(range(0, len(trainset)), train_sample_size)

# Getting subset of dataset
train_subset = torch.utils.data.Subset(trainset, train_subset_indices)

In [56]:
print(train_subset.__getitem__(11))

(['।', 'होगा', 'सुनिश्चित', 'भविष्य', 'का', 'राष्ट्र', 'हमारे', 'और', 'बढ़ेगा', 'आधार', 'का', 'पीरामिड', 'नवाचार', 'हमारे', 'से', 'बोने', 'बीज', 'के', 'शक्ति', 'की', 'नवाचारों', 'और', 'विचारों', 'नए', 'में', 'बच्चों', 'स्कूली'], ['Seeding', 'the', 'power', 'of', 'ideas', 'and', 'innovation', 'in', 'schoolchildren', 'will', 'broaden', 'the', 'base', 'of', 'our', 'innovation', 'pyramid', 'and', 'secure', 'the', 'future', 'of', 'our', 'nation', '.'])


In [57]:
# transform text to indexes and append eos and bos
# finally convert to tensors
def text_transform(my_vocab, text):
    text_num = [my_vocab['<BOS>']] + [my_vocab[word] for word in text] + [my_vocab['<EOS>']]
    return torch.tensor(text_num)   

In [58]:
text = train_subset.__getitem__(13)[1]
print(text)

['Bad', 'current', 'tag', 'value', '.']


In [59]:
text_transform(target_vocab, text)

tensor([   1, 1032,   37,  672,  104,   33,    2])

In [60]:
def collate_batch(batch):
    source_list, target_list = [], []
    for source, target in batch:
        source_tensor = text_transform(source_vocab, source)
        target_tensor = text_transform(target_vocab, target)
        source_list.append(source_tensor)
        target_list.append(target_tensor)
        
    source_pad = pad_sequence(source_list, batch_first=False, padding_value= source_vocab['<PAD>'])
    target_pad = pad_sequence(target_list, batch_first=False, padding_value= target_vocab['<PAD>'])
    
    return source_pad, target_pad     

In [61]:
#?DataLoader

In [62]:
batch_size = 2
train_loader = DataLoader(trainset, batch_size=batch_size, shuffle= True,collate_fn = collate_batch )

In [63]:
for source, target in train_loader:
  print(source)
  break

tensor([[   1,    1],
        [2448,   29],
        [4585,  174],
        [ 146,  175],
        [ 355,   31],
        [1390,  802],
        [ 528, 1386],
        [1700,    2],
        [   6,    3],
        [ 168,    3],
        [2294,    3],
        [   8,    3],
        [5179,    3],
        [  25,    3],
        [   0,    3],
        [   0,    3],
        [  34,    3],
        [   0,    3],
        [  41,    3],
        [5556,    3],
        [  37,    3],
        [2499,    3],
        [  34,    3],
        [   6,    3],
        [   0,    3],
        [  37,    3],
        [2597,    3],
        [2225,    3],
        [  10,    3],
        [3058,    3],
        [  25,    3],
        [5928,    3],
        [3405,    3],
        [   8,    3],
        [   0,    3],
        [ 176,    3],
        [1504,    3],
        [  21,    3],
        [4014,    3],
        [   2,    3]])


In [64]:
BATCH_SIZE = 128

train_loader = DataLoader(train_subset, batch_size=BATCH_SIZE, shuffle = True, collate_fn = collate_batch )
valid_loader = DataLoader(validset, batch_size=BATCH_SIZE, shuffle = False, collate_fn = collate_batch )
test_loader = DataLoader(testset, batch_size=BATCH_SIZE, shuffle = False, collate_fn = collate_batch )

## Building the Seq2Seq Model

We'll be building our model in three parts. The encoder, the decoder and a seq2seq model that encapsulates the encoder and decoder and will provide a way to interface with each.

### Encoder (LSTM)

First, the encoder, a 2 layer LSTM. The paper we are implementing uses a 4-layer LSTM, but in the interest of training time we cut this down to 2-layers. The concept of multi-layer RNNs is easy to expand from 2 to 4 layers. 


So our encoder looks something like this: 

![](assets/seq2seq2.png)



In [65]:
#?nn.Embedding

In [66]:
#?nn.LSTM

### Encoder

In [67]:
class Encoder(nn.Module):
    
    def __init__(self, vocab_size, emb_dim, hidden_dim,
                 pad_idx, emb_drop_prob, num_layers, 
                 rnn_drop_prob):
      
      super().__init__()

      self.vocab_size = vocab_size
      self.emb_dim = emb_dim
      self.pad_idx= pad_idx
      self.emb_drop_prob=emb_drop_prob
      self.hidden_dim = hidden_dim
      self.num_layers = num_layers
      self.rnn_drop_prob = rnn_drop_prob
      
      self.embedding = nn.Embedding(num_embeddings= self.vocab_size,
                                        embedding_dim=self.emb_dim,
                                        padding_idx=self.pad_idx)

      self.dropout = nn.Dropout(p = self.emb_drop_prob)

      self.lstm_layer = nn.LSTM(input_size=self.emb_dim,
                               hidden_size=self.hidden_dim,
                               num_layers=self.num_layers,
                               batch_first=False,
                               bidirectional=False,
                               dropout = self.rnn_drop_prob
                                 )

      
        
    def forward(self, source_indices):
      # source_indices: [seq_len, batch_size]
      emb = self.embedding(source_indices) # shape : [seq_len, batch_size, emb_dim]
      emb_drop = self.dropout(emb) # shape : [seq_len, batch_size, emb_dim]
      
      
      output, (hidden, cell) = self.lstm_layer(emb_drop ) # h0, c0 are optional
      # output : [seq_len, batch_size, directions * hidden_dim]
      # hidden : [directions* num_layers, batch_size, hidden_dim]

      return hidden, cell       


### Decoder

Next, we'll build our decoder, which will also be a 2-layer (4 in the paper) LSTM.

![](assets/seq2seq3.png)


In [68]:
class Decoder(nn.Module):
    def __init__(self, vocab_size, emb_dim, hidden_dim, 
                 pad_idx, emb_drop_prob, num_layers, 
                 rnn_drop_prob):
      
     
      super().__init__()

      self.vocab_size = vocab_size
      self.emb_dim = emb_dim
      self.pad_idx= pad_idx
      self.emb_drop_prob=emb_drop_prob
      self.hidden_dim = hidden_dim
      self.num_layers = num_layers
      self.rnn_drop_prob = rnn_drop_prob
      

      self.embedding = nn.Embedding(num_embeddings= self.vocab_size,
                                        embedding_dim=self.emb_dim,
                                        padding_idx=self.pad_idx)

      self.dropout = nn.Dropout(p = self.emb_drop_prob)

      self.lstm_layer = nn.LSTM(input_size=self.emb_dim,
                               hidden_size=self.hidden_dim,
                               num_layers=self.num_layers,
                               batch_first=False,
                               bidirectional=False,
                               dropout = self.rnn_drop_prob
                                 )
      
      self.linear= nn.Linear(in_features=self.hidden_dim,
                                 out_features=self.vocab_size)

      
        
    def forward(self, input_token, hidden, cell):

      # in decoder we pass one token at a time- seq_len is 1
      # shape of input_token : [batch_size]

      #input_token = input_token.unsqueeze(0) # [1, batch_size, emb_dim]
      #print(input_token.shape) 
            
      emb = self.dropout(self.embedding(input_token)) 
      # emb - [batch_size, emb_dim]
      
      # lstm layer needs input in the shape: [seq_len, batch_size, emb_dim]
      # we will add a redundant dimension to change  the shape of emb
      emb = emb.unsqueeze(0)
      
      #print(emb.shape)
      output, (hidden, cell) = self.lstm_layer(emb,(hidden, cell) )

      # output - [seq_len, batch_size, directions * hidden_dim]
      # sequence length is always one for decoder as we pass one token at a time
      # we never use bidirectional for decoder as decoder cannot look ahead
      # hence shapes 
      # output: [1, batch_size, hidden_dim) 
      # hidden - [num_layers, batch_size, hidden_dim]

      output = output.squeeze(0) # squeeze out the redundant dim to make it work for linear layer
      # output: [1, batch_size, hidden_dim) 
      prediction = self.linear(output)
      # prediction: [batch_size, decoder_vocab_size] 
      # each word is projected to vocab size
      


      return prediction, hidden, cell


### Seq2Seq

For the final part of the implemenetation, we'll implement the seq2seq model. This will handle: 
- receiving the input/source sentence
- using the encoder to produce the context vectors 
- using the decoder to produce the predicted output/target sentence

Our full model will look like this:

![](assets/seq2seq4.png)



In [69]:
#?nn.LSTM

In [70]:
#?torch.argmax

In [71]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device ):

      super().__init__()

      self.decoder = decoder
      self.encoder = encoder
      self.device = device
      

      assert self.decoder.hidden_dim == self.encoder.hidden_dim, \
      'The hidden_dim for encoder and decoder should be equal'

      assert self.decoder.num_layers == self.encoder.num_layers, \
      ' The number of layers in encoder and decoder should be same'


        
    def forward(self, source_indices, target_indices, teacher_enf_ratio ):
      # source_indices, target_indices # [seq_len, batch_size]
      
      

      seq_len = target_indices.shape[0]
      batch_size =  target_indices.shape[1]
      vocab_size = self.decoder.vocab_size

      predictions = torch.zeros((seq_len, batch_size, vocab_size)).to(self.device) # [tar_len, batch_size, target_vocab_size]

      hidden, cell = self.encoder(source_indices) # [directions * num_layer, batch_size, hidden_dim]

      input_token = target_indices[0,:] # [batch_size]

      # we will not update the predictions corresponding to first token -<BOS> 

      for i in range(1, len(target_indices)):
        prediction, hidden, cell = self.decoder(input_token, hidden, cell) # prediction: [batch_size, decoder_vocab_size] 

        # update predictions
        predictions[i] = prediction

        teacher_force = torch.rand(1) < teacher_enf_ratio

        if teacher_force:
          input_token = target_indices[i,:]
        else:
          input_token = torch.argmax(prediction, dim =1) # batch_size

      return predictions # [tar_len, batch_size, target_vocab_size]     


In [72]:
torch.rand(1)

tensor([0.0583])

# Training the Seq2Seq Model

Now we have our model implemented, we can begin training it. 

First, we'll initialize our model. As mentioned before, the input and output dimensions are defined by the size of the vocabulary. The embedding dimesions and dropout for the encoder and decoder can be different, but the number of layers and the size of the hidden/cell states must be the same. 

We then define the encoder, decoder and then our Seq2Seq model, which we place on the `device`.

In [73]:
# list of hyperparameters for encoder, decoder, model
ENC_VOCAB_SIZE = len(source_vocab)
ENC_EMB = 128
HID_DIM = 256
ENC_PAD_IDX = source_vocab['<PAD>']
ENC_EMB_DROP_PROB = 0.5
NUM_LAYERS = 2
ENC_RNN_DROP_PROB = 0.5

DEC_VOCAB_SIZE = len(target_vocab)
DEC_EMB = 256
DEC_PAD_IDX = target_vocab['<PAD>']
DEC_EMB_DROP_PROB = 0.5
DEC_RNN_DROP_PROB = 0.5

LEARNING_RATE = 0.001

enc = Encoder(vocab_size=ENC_VOCAB_SIZE , emb_dim =ENC_EMB, hidden_dim=HID_DIM,
              pad_idx=ENC_PAD_IDX, emb_drop_prob=ENC_EMB_DROP_PROB, num_layers=NUM_LAYERS, 
              rnn_drop_prob=ENC_RNN_DROP_PROB)

dec = Decoder(vocab_size=DEC_VOCAB_SIZE, emb_dim=DEC_EMB, hidden_dim=HID_DIM, 
              pad_idx=DEC_PAD_IDX, emb_drop_prob=DEC_EMB_DROP_PROB, num_layers=NUM_LAYERS, 
              rnn_drop_prob=DEC_RNN_DROP_PROB)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

model = Seq2Seq(encoder=enc, decoder=dec, device=device )
model.to(device)

cuda


Seq2Seq(
  (decoder): Decoder(
    (embedding): Embedding(6537, 256, padding_idx=3)
    (dropout): Dropout(p=0.5, inplace=False)
    (lstm_layer): LSTM(256, 256, num_layers=2, dropout=0.5)
    (linear): Linear(in_features=256, out_features=6537, bias=True)
  )
  (encoder): Encoder(
    (embedding): Embedding(6115, 128, padding_idx=3)
    (dropout): Dropout(p=0.5, inplace=False)
    (lstm_layer): LSTM(128, 256, num_layers=2, dropout=0.5)
  )
)

Next up is initializing the weights of our model. In the paper they state they initialize all weights from a uniform distribution between -0.08 and +0.08, i.e. $\mathcal{U}(-0.08, 0.08)$.

We initialize weights in PyTorch by creating a function which we `apply` to our model. When using `apply`, the `init_weights` function will be called on every module and sub-module within our model. For each module we loop through all of the parameters and sample them from a uniform distribution with `nn.init.uniform_`.

In [74]:
#?model.named_parameters

In [75]:
def init_weights(m):
    for param in model.parameters():
      nn.init.uniform_(param.data,-0.08, 0.08)
# initilaize weights of  the model
# model.apply(init_weights)

We also define a function that will calculate the number of trainable parameters in the model.

In [76]:
#?model.parameters

In [77]:
def count_parameters(model):
    return sum([param.numel() for param in model.parameters() if param.requires_grad == True])

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 6,110,473 trainable parameters


We define our optimizer, which we use to update our parameters in the training loop. Check out [this](http://ruder.io/optimizing-gradient-descent/) post for information about different optimizers. Here, we'll use Adam.

In [78]:
optimizer = torch.optim.Adam(model.parameters(), lr= LEARNING_RATE)

Next, we define our loss function. The `CrossEntropyLoss` function calculates both the log softmax as well as the negative log-likelihood of our predictions. 

Our loss function calculates the average loss per token, however by passing the index of the `<pad>` token as the `ignore_index` argument we ignore the loss whenever the target token is a padding token. 

In [79]:
#?nn.CrossEntropyLoss

In [80]:
criterion = nn.CrossEntropyLoss(ignore_index=DEC_PAD_IDX)

Next, we'll define our training loop. 

First, we'll set the model into "training mode" with `model.train()`. This will turn on dropout (and batch normalization, which we aren't using) and then iterate through our data iterator.

As stated before, our decoder loop starts at 1, not 0. This means the 0th element of our `outputs` tensor remains all zeros. So our `trg` and `outputs` look something like:

$$\begin{align*}
\text{trg} = [<sos>, &y_1, y_2, y_3, <eos>]\\
\text{outputs} = [0, &\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]
\end{align*}$$

Here, when we calculate the loss, we cut off the first element of each tensor to get:

$$\begin{align*}
\text{trg} = [&y_1, y_2, y_3, <eos>]\\
\text{outputs} = [&\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]
\end{align*}$$

At each iteration:
- get the source and target sentences from the batch, $X$ and $Y$
- zero the gradients calculated from the last batch
- feed the source and target into the model to get the output, $\hat{Y}$
- as the loss function only works on 2d inputs with 1d targets we need to flatten each of them with `.view`
    - we slice off the first column of the output and target tensors as mentioned above
- calculate the gradients with `loss.backward()`
- clip the gradients to prevent them from exploding (a common issue in RNNs)
- update the parameters of our model by doing an optimizer step
- sum the loss value to a running total

Finally, we return the loss that is averaged over all batches.

In [81]:
#?clip_grad_norm_

In [82]:
#scaler = torch.cuda.amp.GradScaler()
def train(model, iterator, optimizer, criterion, clip, teacher_enf_ratio):

  model.train()
  epoch_loss = 0

  for src , tgt in iterator:
    src = src.to(device)
    tgt = tgt.to(device)
    

    # get predictions
    logits = model(src, tgt, teacher_enf_ratio)

    tgt = tgt[1:,:] # first seq corresponds to token '<BOS>'
    logits = logits[1:, :]

    num_classes  = logits.shape[-1]
    logits = logits.view(-1, num_classes) # [(trg_seq_length-1) * batch_size, num_classes]
    tgt = tgt.view(-1) # [(trg_seq_length-1) * batch_size]

    # set gradients to zero to avoid gradient accumulation from previous iterations
    optimizer.zero_grad

    # calculate loss
    #with torch.cuda.amp.autocast():
    loss = criterion(logits, tgt)

    # calculate gradients
    loss.backward()
    #scaler.scale(loss).backward()

    # clip gradients
    clip_grad_norm_(model.parameters(), clip)

    # update parameters
    optimizer.step()
    
    #scaler.step(optimizer)
    #scaler.update()
    
    epoch_loss+= loss.item()

  return epoch_loss/len(iterator)   

Our evaluation loop is similar to our training loop, however as we aren't updating any parameters we don't need to pass an optimizer or a clip value.

We must remember to set the model to evaluation mode with `model.eval()`. This will turn off dropout (and batch normalization, if used).

We use the `with torch.no_grad()` block to ensure no gradients are calculated within the block. This reduces memory consumption and speeds things up. 

The iteration loop is similar (without the parameter updates), however we must ensure we turn teacher forcing off for evaluation. This will cause the model to only use it's own predictions to make further predictions within a sentence, which mirrors how it would be used in deployment.

In [83]:
#?model.state_dict

In [84]:
'''
m=torch.arange(10).view(5,2)
print(m)
m = m[1:]
print(m)
'''

'\nm=torch.arange(10).view(5,2)\nprint(m)\nm = m[1:]\nprint(m)\n'

In [85]:
def evaluate(model, iterator, criterion, teacher_enf_ratio):
    model.eval()
    epoch_loss = 0
  

    with torch.no_grad():
      for src , tgt in iterator:

        src = src.to(device) # [src_seq_len, batch_size]

        tgt = tgt.to(device) # [tgt_seq_len, batch_size]

        # get predictions
        logits = model(src, tgt, teacher_enf_ratio) # [trg_seq_length, batch_size, tgt_vocab_size]

        tgt = tgt[1:,:] # first seq corresponds to token '<BOS>'
        logits = logits[1:, :]

        num_classes  = logits.shape[-1]
        logits = logits.view(-1, num_classes) # [(trg_seq_length-1) * batch_size, num_classes]
        tgt = tgt.view(-1) # [(trg_seq_length-1) * batch_size]

        # calculate loss
        loss = criterion(logits, tgt)

        epoch_loss+= loss.item()

    return epoch_loss/len(iterator)  

Next, we'll create a function that we'll use to tell us how long an epoch takes.

In [86]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [87]:
# understanding the above function
'''
start_time = time.time()
print('start time', start_time)
time.sleep(63)
end_time = time.time()
print('end time', end_time)
elapsed_time = end_time-start_time
print('elapsed time', elapsed_time)
elapsed_mins= int(elapsed_time/60)
print('elapsed mins',elapsed_mins)
elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
print('elapsed secs', elapsed_secs)
'''

"\nstart_time = time.time()\nprint('start time', start_time)\ntime.sleep(63)\nend_time = time.time()\nprint('end time', end_time)\nelapsed_time = end_time-start_time\nprint('elapsed time', elapsed_time)\nelapsed_mins= int(elapsed_time/60)\nprint('elapsed mins',elapsed_mins)\nelapsed_secs = int(elapsed_time - (elapsed_mins * 60))\nprint('elapsed secs', elapsed_secs)\n"

We can finally start training our model!

At each epoch, we'll be checking if our model has achieved the best validation loss so far. If it has, we'll update our best validation loss and save the parameters of our model (called `state_dict` in PyTorch). Then, when we come to test our model, we'll use the saved parameters used to achieve the best validation loss. 

We'll be printing out both the loss and the perplexity at each epoch. It is easier to see a change in perplexity than a change in loss as the numbers are much bigger.

In [88]:
import  gc
gc.collect()
torch.cuda.empty_cache()

In [89]:
N_EPOCHS = 10
CLIP = 1
TRAIN_TEACHER_ENF_RATIO = 0.5
VALID_TEACHER_ENF_RATIO = 0 # turn off teacher enforcing for evaluation

min_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_loader, optimizer, criterion, CLIP, TRAIN_TEACHER_ENF_RATIO)
    valid_loss = evaluate(model, valid_loader, criterion, VALID_TEACHER_ENF_RATIO)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    # save the model corresponding to min vlaid loss
    if valid_loss<min_valid_loss:
      min_valid_loss = valid_loss
      torch.save(model.state_dict(), folder/'1c_en_hi.pt')
    
   
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 1m 41s
	Train Loss: 5.989 | Train PPL: 398.859
	 Val. Loss: 5.773 |  Val. PPL: 321.345
Epoch: 02 | Time: 1m 45s
	Train Loss: 5.684 | Train PPL: 294.098
	 Val. Loss: 5.739 |  Val. PPL: 310.634
Epoch: 03 | Time: 1m 42s
	Train Loss: 5.528 | Train PPL: 251.549
	 Val. Loss: 5.749 |  Val. PPL: 313.863
Epoch: 04 | Time: 1m 43s
	Train Loss: 5.412 | Train PPL: 224.110
	 Val. Loss: 5.678 |  Val. PPL: 292.425
Epoch: 05 | Time: 2m 29s
	Train Loss: 5.290 | Train PPL: 198.379
	 Val. Loss: 5.697 |  Val. PPL: 297.979
Epoch: 06 | Time: 2m 44s
	Train Loss: 5.193 | Train PPL: 179.974
	 Val. Loss: 5.673 |  Val. PPL: 290.974
Epoch: 07 | Time: 3m 41s
	Train Loss: 5.124 | Train PPL: 168.079
	 Val. Loss: 5.593 |  Val. PPL: 268.519
Epoch: 08 | Time: 4m 12s
	Train Loss: 5.076 | Train PPL: 160.099
	 Val. Loss: 5.594 |  Val. PPL: 268.719
Epoch: 09 | Time: 2m 25s
	Train Loss: 5.050 | Train PPL: 156.020
	 Val. Loss: 5.630 |  Val. PPL: 278.551
Epoch: 10 | Time: 1m 59s
	Train Loss: 4.995 | Train PPL

We'll load the parameters (`state_dict`) that gave our model the best validation loss and run it the model on the test set.

In [91]:
# load the saved model and get test loss
model.load_state_dict(torch.load(folder/'1c_en_hi.pt'))
test_loss = evaluate(model, test_loader, criterion, 0)
print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 5.561 | Test PPL: 260.001 |


In the following notebook we'll implement a model that achieves improved test perplexity, but only uses a single layer in the encoder and the decoder.