<a href="https://colab.research.google.com/github/nanmaharaj/CVIT-AI-Summer-School/blob/main/Day__01_Seq2Seq_MachineTranslation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Seq2Seq Translation

This exercise involves constructing a deep learning model using PyTorch and TorchText for sequence-to-sequence tasks. Specifically, we will focus on German to English neural machine translation. However, it is important to note that the underlying concept can be applied to other tasks like Named Entity Recognition (NER), Text Summarization, and more.

# Introduction


The sequence-to-sequence (seq2seq) model discussed in this exercise adopts an encoder-decoder architecture, employing LSTM (Long Short Term Memory) as a type of recurrent neural network. In this architecture, the encoder neural network takes the input German sequence and encodes it into a single vector known as the Context Vector. This vector serves as an abstract representation of the input German sequence.

Next, the Context Vector is fed into the decoder neural network, which generates the corresponding English translation sentence word by word. The decoder progressively outputs each word in the translation sequence based on the information encoded in the Context Vector..

# Necessary Imports

In [None]:
!pip install torchtext==0.6.0 --quiet
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator
import numpy as np
import pandas as pd
import spacy
import random
from torchtext.data.metrics import bleu_score
from pprint import pprint
from torch.utils.tensorboard import SummaryWriter
from torchsummary import summary

# Seeding for reproducible results everytime
SEED = 777

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.2/64.2 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m71.3 MB/s[0m eta [36m0:00:00[0m
[?25h

# Data Preparation & Pre-processing

We'll use SpaCy's vocabulary for our desired languages.



In [None]:
!python -m spacy download en --quiet
!python -m spacy download de --quiet

[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m57.3 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'de' are deprecated. Please use the
full pipeline package name 'de_core_news_sm' instead.[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/14.6 MB[0m [31m93.0 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')


In [None]:
spacy_german = spacy.load("de_core_news_sm")
spacy_english = spacy.load("en_core_web_sm")

## Tokenization

We will proceed to develop custom tokenization methods for the languages involved. Tokenization involves dividing a sentence into a list of individual tokens or words.

To facilitate this process, we can utilize PyTorch's TorchText library for data pre-processing. Additionally, we will employ SpaCy for vocabulary building and tokenization of our data in both English and German languages.

In [None]:
def tokenize_german(text):
  return [token.text for token in spacy_german.tokenizer(text)]

def tokenize_english(text):
  return [token.text for token in spacy_english.tokenizer(text)]

### Sample Run ###

sample_text = "I love machine learning"
print(tokenize_english(sample_text))

['I', 'love', 'machine', 'learning']



TorchText is a robust library designed to facilitate the preparation of text data for a wide range of natural language processing (NLP) tasks. It provides a comprehensive suite of tools for preprocessing textual data.



In [None]:
german = Field(tokenize=tokenize_german,
               lower=True,
               init_token="<sos>",
               eos_token="<eos>")

english = Field(tokenize=tokenize_english,
               lower=True,
               init_token="<sos>",
               eos_token="<eos>")

train_data, valid_data, test_data = Multi30k.splits(exts = (".de", ".en"),
                                                    fields=(german, english))

german.build_vocab(train_data, max_size=10000, min_freq=3)
english.build_vocab(train_data, max_size=10000, min_freq=3)

downloading training.tar.gz


training.tar.gz: 100%|██████████| 1.21M/1.21M [00:01<00:00, 716kB/s] 


downloading validation.tar.gz


validation.tar.gz: 100%|██████████| 46.3k/46.3k [00:00<00:00, 232kB/s]


downloading mmt_task1_test2016.tar.gz


mmt_task1_test2016.tar.gz: 100%|██████████| 66.2k/66.2k [00:00<00:00, 220kB/s]


In [None]:
print(f"Unique tokens in source (de) vocabulary: {len(german.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(english.vocab)}")

Unique tokens in source (de) vocabulary: 5374
Unique tokens in target (en) vocabulary: 4556


In [None]:
print(english.vocab.__dict__.keys())
print(list(english.vocab.__dict__.values()))
e = list(english.vocab.__dict__.values())
for i in e:
  print(i)

dict_keys(['freqs', 'itos', 'unk_index', 'stoi', 'vectors'])
0
None


In [None]:
word_2_idx = dict(e[3])
idx_2_word = {}
for k,v in word_2_idx.items():
  idx_2_word[v] = k

In [None]:
print(idx_2_word)



# Dataset

In [None]:
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")

print(train_data[5].__dict__.keys())
pprint(train_data[5].__dict__.values())

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000
dict_keys(['src', 'trg'])
dict_values([['ein', 'mann', 'in', 'grün', 'hält', 'eine', 'gitarre', ',', 'während', 'der', 'andere', 'mann', 'sein', 'hemd', 'ansieht', '.'], ['a', 'man', 'in', 'green', 'holds', 'a', 'guitar', 'while', 'the', 'other', 'man', 'observes', 'his', 'shirt', '.']])


The next step involves generating batches of training, testing, and validation data using iterators.Creating batches can be a tedious task, but fortunately, we can utilize the iterator libraries provided by TorchText.

In this case, we are utilizing the BucketIterator, which offers efficient padding for both the source (German) and target (English) sentences. By accessing the .src attribute, we can retrieve the batch of German data, and similarly, the .trg attribute provides access to the corresponding batch of English data.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
BATCH_SIZE = 32

train_iterator, valid_iterator, test_iterator = BucketIterator.splits((train_data, valid_data, test_data),
                                                                      batch_size = BATCH_SIZE,
                                                                      sort_within_batch=True,
                                                                      sort_key=lambda x: len(x.src),
                                                                      device = device)

In [None]:
count = 0
max_len_eng = []
max_len_ger = []
for data in train_data:
  max_len_ger.append(len(data.src))
  max_len_eng.append(len(data.trg))
  if count < 10 :
    print("German - ",*data.src, " Length - ", len(data.src))
    print("English - ",*data.trg, " Length - ", len(data.trg))
    print()
  count += 1

print("Maximum Length of English sentence {} and German sentence {} in the dataset".format(max(max_len_eng),max(max_len_ger)))
print("Minimum Length of English sentence {} and German sentence {} in the dataset".format(min(max_len_eng),min(max_len_ger)))

German -  zwei junge weiße männer sind im freien in der nähe vieler büsche .  Length -  13
English -  two young , white males are outside near many bushes .  Length -  11

German -  mehrere männer mit schutzhelmen bedienen ein antriebsradsystem .  Length -  8
English -  several men in hard hats are operating a giant pulley system .  Length -  12

German -  ein kleines mädchen klettert in ein spielhaus aus holz .  Length -  10
English -  a little girl climbing into a wooden playhouse .  Length -  9

German -  ein mann in einem blauen hemd steht auf einer leiter und putzt ein fenster .  Length -  15
English -  a man in a blue shirt is standing on a ladder cleaning a window .  Length -  15

German -  zwei männer stehen am herd und bereiten essen zu .  Length -  10
English -  two men are at the stove preparing food .  Length -  9

German -  ein mann in grün hält eine gitarre , während der andere mann sein hemd ansieht .  Length -  16
English -  a man in green holds a guitar while the other

In [None]:
count = 0
for data in train_iterator:
  if count < 1 :
    print("Shapes", data.src.shape, data.trg.shape)
    print()
    print("German - ",*data.src, " Length - ", len(data.src))
    print()
    print("English - ",*data.trg, " Length - ", len(data.trg))
    temp_ger = data.src
    temp_eng = data.trg
    count += 1

Shapes torch.Size([13, 32]) torch.Size([17, 32])

German -  tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2], device='cuda:0') tensor([   5,   15,    5,    5,    8,    5,   43,    5,   15,    8,    5,   18,
           8, 3935,    8,    8,    8,   47,    5,    5,    8,    5,   18,    5,
           8,    8,    8,    8,   18,   35,    5,    8], device='cuda:0') tensor([  70,   26,   13,   13,   36,  171,   45, 1144,  103,  168,   70,   80,
         423,   11, 2797,   16,   16,    6,   13,   70,   26,   96,   45,   13,
         274,   36, 2655,  168,   65,   44,  654,   16], device='cuda:0') tensor([  26,  550,   12,   29,   22,   32,    7,  272,   32, 2249,   26,  466,
           0,  605,  323,   31,   37,  348,    7,  820,   16,   13,    7,   20,
          16,  371,    0,   36, 4041, 3613,   13,   62], device='cuda:0') tensor([   7, 1057,    6,   12,   45,   29,  648,   13,    0,   22,   12,    9,
          30, 5031,  466,   

In [None]:
temp_eng_idx = (temp_eng).cpu().detach().numpy()
temp_ger_idx = (temp_ger).cpu().detach().numpy()


For our experiment, we will utilize a batch size of 32. A sample target batch is provided below for reference. The sentences have been tokenized into lists of words and indexed based on the vocabulary. Specifically, the "pad" token is assigned an index of 1.

In the given target batch, each column represents a sentence that has been indexed with numerical values. The batch consists of 32 such sentences, and the number of rows corresponds to the maximum length among those sentences. To maintain consistent dimensions, shorter sentences are padded with 1.

The table, contains the numerical indices representing the words. This table is later used as input for word embedding, converting the indices into a dense representation suitable for sequence-to-sequence processing in the Seq2Seq model.

In [None]:
df_eng_idx = pd.DataFrame(data = temp_eng_idx, columns = [str("S_")+str(x) for x in np.arange(1, 33)])
df_eng_idx.index.name = 'Time Steps'
df_eng_idx.index = df_eng_idx.index + 1
df_eng_idx

Unnamed: 0_level_0,S_1,S_2,S_3,S_4,S_5,S_6,S_7,S_8,S_9,S_10,...,S_23,S_24,S_25,S_26,S_27,S_28,S_29,S_30,S_31,S_32
Time Steps,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,2,2,2,2,2,2,2,2,2,2,...,2,2,2,2,2,2,2,2,2,2
2,4,7,4,9,4,7,176,145,7,4,...,16,4,4,4,4,4,16,226,4,4
3,53,24,9,10,38,61,17,683,70,59,...,50,9,122,38,166,381,196,4,26,14
4,34,224,8,36,12,35,48,9,35,448,...,6,6,14,12,200,12,17,188,9,10
5,6,105,4,8,50,89,50,13,0,12,...,204,4,22,123,137,905,1041,135,13,78
6,4,1128,264,25,617,6,6,4,7,233,...,2207,197,4,215,8,17,95,4,4,44
7,2167,69,10,211,6,7,90,441,59,19,...,533,3159,178,254,4,312,11,0,31,99
8,10,7,92,11,755,47,533,32,35,17,...,17,8,29,54,0,54,46,105,630,11
9,377,43,6,45,1411,11,36,40,6,32,...,239,4,314,7,149,4,10,10,10,304
10,13,493,7,4,5,1657,51,37,7,8,...,6,144,109,135,28,164,32,0,151,7


In [None]:
df_eng_word = pd.DataFrame(columns = [str("S_")+str(x) for x in np.arange(1, 33)])
df_eng_word = df_eng_idx.replace(idx_2_word)
df_eng_word

Unnamed: 0_level_0,S_1,S_2,S_3,S_4,S_5,S_6,S_7,S_8,S_9,S_10,...,S_23,S_24,S_25,S_26,S_27,S_28,S_29,S_30,S_31,S_32
Time Steps,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,...,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>,<sos>
2,a,the,a,man,a,the,there,old,the,a,...,two,a,a,a,a,a,two,during,a,a
3,little,young,man,is,group,brown,are,bald,small,large,...,women,man,blond,group,rock,lot,kids,a,black,woman
4,boy,basketball,on,standing,of,dog,three,man,dog,number,...,in,in,woman,of,band,of,are,baseball,man,is
5,in,player,a,on,women,stands,women,with,<unk>,of,...,purple,a,wearing,soccer,plays,bicyclists,shoveling,game,with,riding
6,a,moves,skateboard,white,competing,in,in,a,the,elderly,...,shiny,suit,a,players,on,are,snow,a,a,her
7,kilt,into,is,sand,in,the,pink,beard,large,people,...,dresses,reclines,dark,waiting,a,ready,and,<unk>,red,bike
8,is,the,jumping,and,roller,water,dresses,sitting,dog,are,...,are,on,blue,for,<unk>,for,one,player,mask,and
9,fishing,front,in,holding,derby,and,standing,down,in,sitting,...,dancing,a,tank,the,stage,a,is,is,is,enjoying
10,with,court,the,a,.,shakes,up,playing,the,on,...,in,bench,top,game,while,race,sitting,<unk>,carrying,the


# Long Short Term Memory (LSTM)

<img src="https://www.researchgate.net/profile/Savvas-Varsamopoulos/publication/329362532/figure/fig5/AS:699592479870977@1543807253596/Structure-of-the-LSTM-cell-and-equations-that-describe-the-gates-of-an-LSTM-cell.jpg">

The diagram above illustrates the components within a single LSTM Cell. Traditional RNNs and Gated Recurrent Units (GRUs) struggle to capture long-term dependencies due to their design limitations, and they are particularly susceptible to the Vanishing Gradient problem. This issue arises when the gradients become extremely small, rendering weight and bias updates negligible and adversely impacting generalization performance.

However, LSTM addresses these challenges through the incorporation of specialized units known as gates, including the Remember gate, Forget gate, and Update gate. These gates play a crucial role in overcoming the aforementioned problems.

Within the LSTM cell, we find a collection of mini neural networks, featuring sigmoid and TanH activations at the final layer, along with various vector addition, concatenation, and multiplication operations. These operations collectively contribute to the unique functionality of the LSTM architecture.

By leveraging these intricate components, LSTM models have demonstrated improved capability in capturing long-term dependencies and mitigating the Vanishing Gradient problem. This enhanced design empowers LSTM-based models to achieve more robust and accurate generalization, thereby overcoming the limitations encountered by traditional RNNs and GRUs.



# Encoder Architecture (Seq2Seq)

Before proceeding to the implementation of the seq2seq model, several components need to be created, including the Encoder and Decoder, along with establishing an interface between them within the seq2seq framework.

To illustrate the functionality of the model, let's consider the example of translating the German input sequence "Ich liebe tief lernen," which translates to "I love deep learning" in English.




<img src="https://cdn-images-1.medium.com/max/1200/1*aNcybCTdPlrXsCwIo1OfTg.png">

To provide a comprehensive explanation of the process depicted in the image:

The Encoder component of the Seq2Seq model takes one input at a time. In our example, the input German word sequence is "ich Liebe Tief Lernen."

To facilitate the encoding process, we include special tokens at the beginning and end of the input sentence, namely the "SOS" (start of sequence) token and the "EOS" (end of sentence) token.

At each time step, the Encoder processes the tokens sequentially. At time step 0, the "SOS" token is sent. At time step 1, the token "ich" is sent. This continues until all the tokens in the input sequence, including "Liebe," "Tief," and "Lernen," have been processed. Finally, at time step 5, the "EOS" token is sent.

The first block within the Encoder architecture is the word embedding layer, depicted as the green block in the image. This layer converts the indexed input words into dense vector representations known as word embeddings. The size of these embeddings is typically set to 100, 200, or 300.

The word embedding vectors are then passed to the LSTM (Long Short-Term Memory) cell. In the LSTM cell, the word embeddings are combined with the hidden state (hs) and the cell state (cs) from the previous time step. The Encoder block produces new hidden state (hs) and cell state (cs) values, which are then passed to the next LSTM cell in the sequence. Over time, the hidden state (hs) and cell state (cs) capture a vector representation of the sentence up to that point.



***Refer slides for more details***










## Encoder LSTM

In [None]:
class EncoderLSTM(nn.Module):
  def __init__(self, input_size, embedding_size, hidden_size, num_layers, p):
    super(EncoderLSTM, self).__init__()

    # Size of the one hot vectors that will be the input to the encoder
    #self.input_size = input_size

    # Output size of the word embedding NN
    #self.embedding_size = embedding_size

    # Dimension of the NN's inside the lstm cell/ (hs,cs)'s dimension.
    self.hidden_size = hidden_size

    # Number of layers in the lstm
    self.num_layers = num_layers

    # Regularization parameter
    self.dropout = nn.Dropout(p)
    self.tag = True

    # Shape --------------------> (5376, 300) [input size, embedding dims]
    self.embedding = nn.Embedding(input_size, embedding_size)

    # Shape -----------> (300, 2, 1024) [embedding dims, hidden size, num layers]
    self.LSTM = nn.LSTM(embedding_size, hidden_size, num_layers, dropout = p)

  # Shape of x (26, 32) [Sequence_length, batch_size]
  def forward(self, x):

    # Shape -----------> (26, 32, 300) [Sequence_length , batch_size , embedding dims]
    embedding = self.dropout(self.embedding(x))

    # Shape --> outputs (26, 32, 1024) [Sequence_length , batch_size , hidden_size]
    # Shape --> (hs, cs) (2, 32, 1024) , (2, 32, 1024) [num_layers, batch_size size, hidden_size]
    outputs, (hidden_state, cell_state) = self.LSTM(embedding)

    return hidden_state, cell_state

input_size_encoder = len(german.vocab)
encoder_embedding_size = 300
hidden_size = 1024
num_layers = 2
encoder_dropout = 0.5

encoder_lstm = EncoderLSTM(input_size_encoder, encoder_embedding_size,
                           hidden_size, num_layers, encoder_dropout).to(device)
print(encoder_lstm)

EncoderLSTM(
  (dropout): Dropout(p=0.5, inplace=False)
  (embedding): Embedding(5374, 300)
  (LSTM): LSTM(300, 1024, num_layers=2, dropout=0.5)
)


# Decoder Architecture (Seq2Seq)

<img src="https://cdn-images-1.medium.com/max/800/1*FtDDCniBMb8HXYEM6PRohQ.png">

The decoder also operates one step at a time, similar to the encoder.

The Context Vector obtained from the Encoder block serves as the initial hidden state (hs) and cell state (cs) for the first LSTM block in the decoder.

The "SOS" token, indicating the start of the sentence, is passed through the embedding neural network, followed by being fed into the first LSTM cell of the decoder. Subsequently, it passes through a linear layer (depicted in pink) that generates a set of probabilities for predicting the output English tokens (with 4556 probabilities). Additionally, the hidden state (hs) and cell state (cs) are updated.

The output word with the highest probability is selected, and its corresponding hidden state (hs) and cell state (cs) are passed as inputs to the next LSTM cell. This iterative process continues until the model predicts the "EOS" token, indicating the end of the sentence.

In the subsequent layers of the decoder, the hidden state and cell state from the previous time steps are utilized for processing and generating subsequent predictions. This step-by-step approach allows the decoder to progressively generate the desired English translation based on the input from the encoder and previous predictions.

***Refer slides for more details***


## Teacher Forcing


<img src="https://cdn-images-1.medium.com/max/600/1*YJpyqouvpmu4_Ej9ockl4A.png">

During model training, both the input (German sequence) and the target (English sequence) are provided. Once the context vector is obtained from the Encoder, it is passed along with the target to the Decoder for translation.

However, during model inference, the target is generated solely based on the generalization of the training data. The predicted words from the Decoder are used as the input for generating subsequent words until the <SOS> token is encountered, indicating the completion of the translated sentence.

To control the flow of input words to the Decoder during model training, the teacher forcing ratio (TFR) method is employed. This method allows for flexibility in determining whether to feed the actual target words (depicted in green) or the predicted target words (depicted in red) as input to the Decoder. The choice between the two options is governed by a probability of 50%, ensuring that either the actual or predicted target word is passed at each time step during training. By incorporating this technique, the model can be trained more efficiently.




## Decoder LSTM

In [None]:
class DecoderLSTM(nn.Module):
  def __init__(self, input_size, embedding_size, hidden_size, num_layers, p, output_size):
    super(DecoderLSTM, self).__init__()

    # Size of the one hot vectors that will be the input to the encoder
    #self.input_size = input_size

    # Output size of the word embedding NN
    #self.embedding_size = embedding_size

    # Dimension of the NN's inside the lstm cell/ (hs,cs)'s dimension.
    self.hidden_size = hidden_size

    # Number of layers in the lstm
    self.num_layers = num_layers

    # Size of the one hot vectors that will be the output to the encoder (English Vocab Size)
    self.output_size = output_size

    # Regularization parameter
    self.dropout = nn.Dropout(p)

    # Shape --------------------> (5376, 300) [input size, embedding dims]
    self.embedding = nn.Embedding(input_size, embedding_size)

    # Shape -----------> (300, 2, 1024) [embedding dims, hidden size, num layers]
    self.LSTM = nn.LSTM(embedding_size, hidden_size, num_layers, dropout = p)

    # Shape -----------> (1024, 4556) [embedding dims, hidden size, num layers]
    self.fc = nn.Linear(hidden_size, output_size)

  # Shape of x (32) [batch_size]
  def forward(self, x, hidden_state, cell_state):

    # Shape of x (1, 32) [1, batch_size]
    x = x.unsqueeze(0)

    # Shape -----------> (1, 32, 300) [1, batch_size, embedding dims]
    embedding = self.dropout(self.embedding(x))

    # Shape --> outputs (1, 32, 1024) [1, batch_size , hidden_size]
    # Shape --> (hs, cs) (2, 32, 1024) , (2, 32, 1024) [num_layers, batch_size size, hidden_size] (passing encoder's hs, cs - context vectors)
    outputs, (hidden_state, cell_state) = self.LSTM(embedding, (hidden_state, cell_state))

    # Shape --> predictions (1, 32, 4556) [ 1, batch_size , output_size]
    predictions = self.fc(outputs)

    # Shape --> predictions (32, 4556) [batch_size , output_size]
    predictions = predictions.squeeze(0)

    return predictions, hidden_state, cell_state

input_size_decoder = len(english.vocab)
decoder_embedding_size = 300
hidden_size = 1024
num_layers = 2
decoder_dropout = 0.5
output_size = len(english.vocab)

decoder_lstm = DecoderLSTM(input_size_decoder, decoder_embedding_size,
                           hidden_size, num_layers, decoder_dropout, output_size).to(device)
print(decoder_lstm)

DecoderLSTM(
  (dropout): Dropout(p=0.5, inplace=False)
  (embedding): Embedding(4556, 300)
  (LSTM): LSTM(300, 1024, num_layers=2, dropout=0.5)
  (fc): Linear(in_features=1024, out_features=4556, bias=True)
)


# Combining Encoder & Decoder

<img src="https://cdn-images-1.medium.com/max/1200/1*d9kP4XoWGnIcmyhX-g4Xvw.png">

Here's a step-by-step overview of the process:

1. Provide both the input (German) and output (English) sentences as input to the model.

2. The input sequence is passed to the Encoder, which processes the sequence and extracts context vectors. These context vectors capture the essential information from the input sequence.

3. The output sequence, along with the context vectors obtained from the Encoder, is passed to the Decoder. The Decoder utilizes the context vectors and employs them in generating the predicted output sequence in English. This prediction process is done step-by-step, generating one word at a time until the entire output sequence is produced.

*The seq2seq model effectively captures the relationship between the input and output sequences, enabling the translation of the input sequence into the desired output sequence. This implementation showcases the power of seq2seq models in handling sequence-to-sequence tasks, such as machine translation.*

In [None]:
for batch in train_iterator:
  print(batch.src.shape)
  print(batch.trg.shape)
  break

x = batch.trg[1]
print(x)

torch.Size([10, 32])
torch.Size([14, 32])
tensor([ 4,  4,  4,  4,  4,  4, 19,  4, 16,  4,  4,  4,  4,  4,  4, 16,  4,  4,
         4,  4,  0, 16,  7, 19,  4,  4,  4,  7,  4,  4, 16,  4],
       device='cuda:0')


In [None]:
class Seq2Seq(nn.Module):
  def __init__(self, Encoder_LSTM, Decoder_LSTM):
    super(Seq2Seq, self).__init__()
    self.Encoder_LSTM = Encoder_LSTM
    self.Decoder_LSTM = Decoder_LSTM

  def forward(self, source, target, tfr=0.5):
    # Shape - Source : (10, 32) [(Sentence length German + some padding), Number of Sentences]
    batch_size = source.shape[1]

    # Shape - Source : (14, 32) [(Sentence length English + some padding), Number of Sentences]
    target_len = target.shape[0]
    target_vocab_size = len(english.vocab)

    # Shape --> outputs (14, 32, 5766)
    outputs = torch.zeros(target_len, batch_size, target_vocab_size).to(device)

    # Shape --> (hs, cs) (2, 32, 1024) ,(2, 32, 1024) [num_layers, batch_size size, hidden_size] (contains encoder's hs, cs - context vectors)
    hidden_state, cell_state = self.Encoder_LSTM(source)

    # Shape of x (32 elements)
    x = target[0] # Trigger token <SOS>

    for i in range(1, target_len):
      # Shape --> output (32, 5766)
      output, hidden_state, cell_state = self.Decoder_LSTM(x, hidden_state, cell_state)
      outputs[i] = output
      best_guess = output.argmax(1) # 0th dimension is batch size, 1st dimension is word embedding
      x = target[i] if random.random() < tfr else best_guess # Either pass the next word correctly from the dataset or use the earlier predicted word

    # Shape --> outputs (14, 32, 5766)
    return outputs


In [None]:
# Hyperparameters

learning_rate = 0.001
writer = SummaryWriter(f"runs/loss_plot")
step = 0

model = Seq2Seq(encoder_lstm, decoder_lstm).to(device)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

pad_idx = english.vocab.stoi["<pad>"]
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)

In [None]:
model

Seq2Seq(
  (Encoder_LSTM): EncoderLSTM(
    (dropout): Dropout(p=0.5, inplace=False)
    (embedding): Embedding(5374, 300)
    (LSTM): LSTM(300, 1024, num_layers=2, dropout=0.5)
  )
  (Decoder_LSTM): DecoderLSTM(
    (dropout): Dropout(p=0.5, inplace=False)
    (embedding): Embedding(4556, 300)
    (LSTM): LSTM(300, 1024, num_layers=2, dropout=0.5)
    (fc): Linear(in_features=1024, out_features=4556, bias=True)
  )
)

In [None]:
def translate_sentence(model, sentence, german, english, device, max_length=50):
    spacy_ger = spacy.load("de_core_news_sm")

    if type(sentence) == str:
        tokens = [token.text.lower() for token in spacy_ger(sentence)]
    else:
        tokens = [token.lower() for token in sentence]
    tokens.insert(0, german.init_token)
    tokens.append(german.eos_token)
    text_to_indices = [german.vocab.stoi[token] for token in tokens]
    sentence_tensor = torch.LongTensor(text_to_indices).unsqueeze(1).to(device)

    # Build encoder hidden, cell state
    with torch.no_grad():
        hidden, cell = model.Encoder_LSTM(sentence_tensor)

    outputs = [english.vocab.stoi["<sos>"]]

    for _ in range(max_length):
        previous_word = torch.LongTensor([outputs[-1]]).to(device)

        with torch.no_grad():
            output, hidden, cell = model.Decoder_LSTM(previous_word, hidden, cell)
            best_guess = output.argmax(1).item()

        outputs.append(best_guess)

        # Model predicts it's the end of the sentence
        if output.argmax(1).item() == english.vocab.stoi["<eos>"]:
            break

    translated_sentence = [english.vocab.itos[idx] for idx in outputs]
    return translated_sentence[1:]

def bleu(data, model, german, english, device):
    targets = []
    outputs = []

    for example in data:
        src = vars(example)["src"]
        trg = vars(example)["trg"]

        prediction = translate_sentence(model, src, german, english, device)
        prediction = prediction[:-1]  # remove <eos> token

        targets.append([trg])
        outputs.append(prediction)

    return bleu_score(outputs, targets)

def checkpoint_and_save(model, best_loss, epoch, optimizer, epoch_loss):
    print('saving')
    print()
    state = {'model': model,'best_loss': best_loss,'epoch': epoch,'rng_state': torch.get_rng_state(), 'optimizer': optimizer.state_dict(),}
    torch.save(state, '/content/checkpoint-NMT')
    torch.save(model.state_dict(),'/content/checkpoint-NMT-SD')

# Model Training

In [None]:
epoch_loss = 0.0
num_epochs = 10 #TODO: Change this if required !
best_loss = 999999
best_epoch = -1
sentence1 = "ein mann in einem blauen hemd steht auf einer leiter und putzt ein fenster"
ts1  = []

for epoch in range(num_epochs):
  print("Epoch - {} / {}".format(epoch+1, num_epochs))
  model.eval()
  translated_sentence1 = translate_sentence(model, sentence1, german, english, device, max_length=50)
  print(f"Translated example sentence 1: \n {translated_sentence1}")
  ts1.append(translated_sentence1)

  model.train(True)
  for batch_idx, batch in enumerate(train_iterator):
    input = batch.src.to(device)
    target = batch.trg.to(device)

    # Pass the input and target for model's forward method
    output = model(input, target)
    output = output[1:].reshape(-1, output.shape[2])
    target = target[1:].reshape(-1)

    # Clear the accumulating gradients
    optimizer.zero_grad()

    # Calculate the loss value for every epoch
    loss = criterion(output, target)

    # Calculate the gradients for weights & biases using back-propagation
    loss.backward()

    # Clip the gradient value is it exceeds > 1
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)

    # Update the weights values using the gradients we calculated using bp
    optimizer.step()
    step += 1
    epoch_loss += loss.item()
    writer.add_scalar("Training loss", loss, global_step=step)

  if epoch_loss < best_loss:
    best_loss = epoch_loss
    best_epoch = epoch
    checkpoint_and_save(model, best_loss, epoch, optimizer, epoch_loss)
    if ((epoch - best_epoch) >= 5):
      print("no improvement in 5 epochs, break")
      break
  print("Epoch_Loss - {}".format(loss.item()))
  print()

print(epoch_loss / len(train_iterator))

score = bleu(test_data[1:100], model, german, english, device)
print(f"Bleu score {score*100:.2f}")

Epoch - 1 / 10
Translated example sentence 1: 
 ['a', 'man', 'in', 'a', 'blue', 'shirt', 'is', 'standing', 'on', 'a', 'a', 'a', 'a', '.', '<eos>']
saving

Epoch_Loss - 2.600125312805176

Epoch - 2 / 10
Translated example sentence 1: 
 ['a', 'man', 'in', 'a', 'blue', 'shirt', 'is', 'standing', 'on', 'a', 'wall', '.', '<eos>']
Epoch_Loss - 2.6908657550811768

Epoch - 3 / 10
Translated example sentence 1: 
 ['a', 'man', 'in', 'a', 'blue', 'shirt', 'is', 'standing', 'a', 'a', 'of', 'a', '.', '<eos>']
Epoch_Loss - 3.12144136428833

Epoch - 4 / 10
Translated example sentence 1: 
 ['a', 'man', 'in', 'a', 'blue', 'shirt', 'is', 'standing', 'on', 'a', 'ladder', 'a', 'a', '.', '<eos>']
Epoch_Loss - 1.6821285486221313

Epoch - 5 / 10
Translated example sentence 1: 
 ['a', 'man', 'in', 'a', 'blue', 'shirt', 'is', 'standing', 'on', 'a', 'ladder', 'painting', 'a', 'window', '.', '<eos>']
Epoch_Loss - 1.9400554895401

Epoch - 6 / 10
Translated example sentence 1: 
 ['a', 'man', 'in', 'a', 'blue', 'sh

In [None]:
#%load_ext tensorboard (OPTIONAL)
# %tensorboard --logdir runs/

UsageError: Line magic function `%tensorboard` not found.


# Model Inference

In [None]:
model.eval()
test_sentences  = ["Zwei Männer gehen die Straße entlang", "Kinder spielen im Park.", "Diese Stadt verdient eine bessere Klasse von Verbrechern. Der Spaßvogel"]
actual_sentences  = ["Two men are walking down the street", "Children play in the park", "This city deserves a better class of criminals. The joker"]
pred_sentences = []

for idx, i in enumerate(test_sentences):
  model.eval()
  translated_sentence = translate_sentence(model, i, german, english, device, max_length=50)
  progress.append(TreebankWordDetokenizer().detokenize(translated_sentence))
  print("German : {}".format(i))
  print("Actual Sentence in English : {}".format(actual_sentences[idx]))
  print("Predicted Sentence in English : {}".format(progress[-1]))
  print()


German : Zwei Männer gehen die Straße entlang
Actual Sentence in English : Two men are walking down the street
Predicted Sentence in English : two men are walking down the street . <eos>

German : Kinder spielen im Park.
Actual Sentence in English : Children play in the park
Predicted Sentence in English : children are playing in the park . <eos>

German : Diese Stadt verdient eine bessere Klasse von Verbrechern. Der Spaßvogel
Actual Sentence in English : This city deserves a better class of criminals. The joker
Predicted Sentence in English : this female is a <unk> <unk> <unk> <unk> <unk> <unk>. <eos>



# Assignment (optional)

Q. How does the performance of newtork change if we use *GRU cell* instead of LSTM cell ? Modify the network and report the results.