# Assignment 2

Student Name: **Jash Prakash Rana**

Student ID: **22222806**

In [123]:
import numpy as np
import pandas as pd
import nltk, re, string, warnings, random
from nltk.tokenize import word_tokenize
from typing import List
from tensorflow.keras.utils import pad_sequences
import torch 
import torch.nn as nn 
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset

warnings.filterwarnings('ignore')

np.random.seed(2023)

# Overview
**Assignment 2** focuses on the training on a Neural Machine Translation (NMT) system for English-Irish translation where English is the source language and Irish is the target language. 

**Grading Policy** 
Assignment 2 is graded and will be worth 25% of your overall grade. This assignment is worth a total of 50 points distributed over the tasks below.  Please note that this is an individual assignment and you must not work with other students to complete this assessment. Any copying from other students, from student exercises from previous years, and any internet resources will not be tolerated. Plagiarised assignments will receive zero marks and the students who commit this act will be reported. Feel free to reach out to the TAs and instructors if you have any questions.

## Task 1 - Data Collection and Preprocessing (10 points)
## Task 1a. Data Loading (5 pts)
Dataset: https://www.dropbox.com/s/zkgclwc9hrx7y93/DGT-en-ga.txt.zip?dl=0 
*  Download a English-Irish dataset and decompress it. The `DGT.en-ga.en` file contains a list english sentences and `DGT.en-ga.ga` contains the paralell Irish sentences. Read both files into the Jupyter environment and load them into a pandas dataframe. 
* Randomly sample 12,000 rows.
* Split the sampled data into train (10k), development (1k) and test set (1k)

In [124]:
# Your Code Here
'''
To load the model faster, I have 
taken 12000 random samples from both
the dataset and made a new csv file
to load the data faster.
'''
# df1 = pd.read_csv("./data/DGT.en-ga.en", sep = '.', header = None, error_bad_lines = False)
# df2 = pd.read_csv("./data/DGT.en-ga.ga", sep = '.', header = None, error_bad_lines = False)

# df1.to_csv('./data/dataframe1.csv')
# df2.to_csv('./data/dataframe2.csv')

In [126]:
df1 = pd.read_csv('./data/dataframe1.csv')
df2 = pd.read_csv('./data/dataframe2.csv')

df1 = df1.drop('Unnamed: 0', axis = 1)
df2 = df2.drop('Unnamed: 0', axis = 1)

df1 = df1.rename(columns = {"0": "EngText"})
df2 = df2.rename(columns = {"0": "IrText"})

In [127]:
df1 = df1[df1.index.isin(df2.index)]

df1_sp = df1.sample(n = 12000, random_state = 2023)
df2_sp = df2[df2.index.isin(df1_sp.index)]

df1_sp = df1_sp.sort_index()
df2_sp = df2_sp.sort_index()

assert df1_sp.shape[0] == df2_sp.shape[0]

In [128]:
df1_train = df1_sp.sample(n = 10000, random_state = 2023)
df1_sp = df1_sp[df1_sp.index.isin(df1_train.index) == False]
df1_dev = df1_sp.sample(n = 1000, random_state = 2023)
df1_sp = df1_sp[df1_sp.index.isin(df1_dev.index) == False]
df1_test = df1_sp

df1_train = df1_train.sort_index()
df1_dev = df1_dev.sort_index()
df1_test = df1_test.sort_index()

In [129]:
df2_train = df2_sp[df2_sp.index.isin(df1_train.index)]
df2_dev = df2_sp[df2_sp.index.isin(df1_dev.index)]
df2_test = df2_sp[df2_sp.index.isin(df1_test.index)]

df2_train = df2_train.sort_index()
df2_dev = df2_dev.sort_index()
df2_test = df2_test.sort_index()

In [130]:
assert df1_train.shape[0] == df2_train.shape[0]
assert df1_dev.shape[0] == df2_dev.shape[0]
assert df1_test.shape[0] == df2_test.shape[0]

In [131]:
df_train = pd.concat([df1_train, df2_train], axis=1).reset_index()
df_test = pd.concat([df1_test, df2_test], axis=1).reset_index()
df_dev = pd.concat([df1_dev, df2_dev], axis=1).reset_index()

df_train = df_train.drop("index", axis = 1)
df_test = df_test.drop("index", axis = 1)
df_dev = df_dev.drop("index", axis = 1)
df_test1 = df_test

In [132]:
df_train.head()

Unnamed: 0,EngText,IrText
0,"in Scotland, the Court of Session, or in the c...","in Albain, an Court of Session nó, i gcás brei..."
1,HAVE in this spirit DECIDED to conclude this C...,TAR ÉIS COMHAONTÚ MAR A LEANAS:
2,TITLE I,RAON FEIDHME
3,TITLE II,DLÍNSE
4,SECTION 1,Forálacha Ginearálta


## Task 1b. Preprocessing (5 pts)
* Add '<bof\>' to denote beginning of sentence and '<eos\>' to denote the end of the sentence to each target line.
* Perform the following pre-processing steps:
  * Lowercase the text
  * Remove all punctuation
  * tokenize the text 
*  Build seperate vocabularies for each language. 
  * Assign each unique word an id value 
*Print statistics on the selected dataset:
  * Number of samples
  * Number of unique source language tokens
  * Number of unique target language tokens
  * Max sequence length of source language
  * Max sequence length of target language

In [133]:
#Ref: Dr Paul Buitelaar/Dr Omnia Zayed - Lab 08 "Neural NMT"

'''
The Language class 'Lang' is used for
preprocessing, encoding sentences,
decoding IDs and storing words as a
unique ID and counting no. of words for
each language
'''
class Lang:
    def __init__(self, language: str):
        self.language = language
        self.word2index = {"PAD": 0, "<bof>": 1, "<eof>": 2}
        self.index2word = {0: "PAD", 1: "<bof>", 2: "<eof>"}
        self.word2count = {"<bof>": 0, "<eof>": 0}
        self.n_words = len(self.index2word)
        self.longestSent = ""
        self.longestIndex = 0
    
    def longest_sentence(self, sent: str,index):
        if len(sent) > len(self.longestSent):
            self.longestSent = sent
            self.index = index

    def preprocessing(self, data,index):
        data = data.lower()
        data = re.sub(r'[^\w\s]', '', data).strip()
        # data = data.translate(string.punctuation)
        data = data.replace("\n", " ")
        self.longest_sentence(data,index)
        data = word_tokenize(data)
        data = ["<bof>"] + data + ["<eof>"]
        for word in data:
            if word not in self.word2index:
                self.word2index[word] = self.n_words
                self.word2count[word] = 1
                self.index2word[self.n_words] = word
                self.n_words += 1
            else:
                self.word2count[word] += 1
    
    def encodeSent(self, sent: str) -> List[int]:
        sent = sent.lower()
        sent = re.sub(r'[^\w\s]', '', sent).strip()
        sent = sent.replace("\n", " ")
        sent = word_tokenize(sent)
        sent = ["<bof>"] + sent + ["<eof>"]
        return [self.word2index[word] for word in sent if word in self.word2index]


    def decodeIds(self, ids: list) -> List[str]:
        return " ".join([self.index2word[tok] for tok in ids])

In [134]:
from tqdm.notebook import tqdm

english = Lang("english")
irish = Lang("irish")

for index, row in tqdm(df_train.iterrows(), total=len(df_train)):
  english.preprocessing(row["EngText"],index)
  irish.preprocessing(row["IrText"],index)


  0%|          | 0/10000 [00:00<?, ?it/s]

In [135]:
print(f"Number of samples per language: {len(df_train)}")

Number of samples per language: 10000


In [136]:
print(f"Size of English vocab: {english.n_words}")
print(f"Size of Irish vocab: {irish.n_words}")

Size of English vocab: 9039
Size of Irish vocab: 12173


In [137]:
print(f"Longest sentence in English: {len(english.longestSent)}")
print(f"Longest sentence in Irish: {len(irish.longestSent)}")

Longest sentence in English: 777
Longest sentence in Irish: 1415


In [138]:
print(english.longestSent)

article 1 this regulation lays down rules concerning the applicability of articles 101 to 106 and of article 1081 and 3 of the treaty on the functioning of the european union tfeu in relation to production of or trade in the products listed in annex i to the tfeu with the exception of the products covered by council regulation ec no 12342007 and regulation eu no 13792013 of the european parliament and of the councilarticle 45 amendments to regulation ec no 12242009 regulation ec no 12242009 is hereby amended as follows in article 571 the following sentences are added article 585 is amended as follows point g is replaced by the following g the information to consumers provided for in article 35 of regulation eu no 13792013 of the european parliament and of the council


In [139]:
print(irish.longestSent)

airteagal 24b déanfaidh an coimisiún faisnéis a áireamh maidir le cur chun feidhme an rialacháin seo ina thuarascáil bhliantúil maidir le bearta cosanta trádála a chur i bhfeidhm agus a chur chun feidhme chuig parlaimint na heorpa agus chuig an gcomhairle de bhun airteagal 22a de rialachán ce uimh 12252009 ón gcomhairle15 rialachán ce uimh 1402008 maidir le rialachán ce uimh 1402008 is gá coinníollacha aonfhoirmeacha chun bearta cosanta agus bearta eile a ghlacadh i ndáil le clásail dhéthaobhacha chosanta an chomhaontaithe eatramhaigh agus an chomhaontaithe um chobhsaíocht agus um chomhlachas a chur chun feidhme ba cheart don choimisiún gníomhartha cur chun feidhme atá infheidhme láithreach a ghlacadh más rud é i gcásanna a bhfuil údar cuí leo a bhaineann le himthosca eisceachtúla agus géibheannacha a thagann chun cinn de réir bhrí airteagal 265b agus airteagal 274 den chomhaontú eatramhach agus dá éis airteagal 415b agus airteagal 424 den chomhaontú cobhsaíochta agus comhlachais go né

## Task 2. Model Implementation and Training (30 pts)



## Task 2a. Encoder-Decoder Model Implementation (10 pts)
Implement an Encoder-Decoder model in Pytorch with the following components
* A single layer RNN based encoder. 
* A single layer RNN based decoder
* A Encoder-Decoder model based on the above components that support sequence-to-sequence modelling. For the encoder/decoder you can use RNN, LSTMs or GRU. Use a hidden dimension of 256 or less depending on your compute constraints. 

In [140]:
# Your Code Here
#Ref: Dr Paul Buitelaar/Dr Omnia Zayed - Lab 08 "Neural NMT"

'''
This function is used to take sentences
from both the dataset and converting it
into embeddings to load it into the torch
model and padding sequences into a list,
and then converting source and target
embeddings to train, development and test
sets.
'''
def encode_features(
    df: pd.DataFrame, 
    english: Lang,
    irish: Lang,
    pad_token: int = 0,
    max_seq_length = 10
  ):

  source = []
  target = []

  for _, row in df.iterrows():
    source.append(english.encodeSent(row["EngText"]))
    target.append(irish.encodeSent(row["IrText"]))

  source = pad_sequences(
      source,
      maxlen=max_seq_length,
      padding="post",
      truncating = "post",
      value=pad_token
    )
  
  target = pad_sequences(
      target,
      maxlen=max_seq_length,
      padding="post",
      truncating = "post",
      value=pad_token
    )
  
  return source, target

train_source, train_target = encode_features(df_train, english, irish)
dev_source, dev_target = encode_features(df_dev, english, irish)
test_source, test_target = encode_features(df_test1, english, irish)

print(f"Shapes of train source {train_source.shape}, and target {train_target.shape}")

Shapes of train source (10000, 10), and target (10000, 10)


In [141]:
#Ref: Dr Paul Buitelaar/Dr Omnia Zayed - Lab 08 "Neural NMT"
'''
Converting embeddings done above into
tensor data so that it can be fed to 
training the model, with keeping batch
size as 32 for each set.
'''
train_dl = DataLoader(
    TensorDataset(
        torch.LongTensor(train_source),
        torch.LongTensor(train_target)
    ),
    shuffle = True,
    batch_size = 32
)

dev_dl = DataLoader(
    TensorDataset(
        torch.LongTensor(dev_source),
        torch.LongTensor(dev_target)
    ),
    shuffle = False,
    batch_size = 32
)

test_dl = DataLoader(
    TensorDataset(
        torch.LongTensor(test_source),
        torch.LongTensor(test_target)
    ),
    shuffle = False,
    batch_size = 32
)

In [142]:
#Ref: https://github.com/bentrevett/pytorch-seq2seq/blob/master/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb
'''
Creating an encoder where RNN is used to convert
embeddings and creating an encoding output, also
using dropout to drop 50% data randomly.
'''
class Encoder(nn.Module):
    def __init__(self, 
                 input_vocab_size, 
                 encoder_hid_dim, 
                 hid_dim, 
                 n_layers, 
                 dropout_prob = .5):
        super().__init__()
        
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(input_vocab_size, encoder_hid_dim)
        
        self.rnn = nn.LSTM(encoder_hid_dim, hid_dim, n_layers, dropout = dropout_prob)
        
        self.dropout = nn.Dropout(dropout_prob)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        
        outputs, (hidden, cell) = self.rnn(embedded)
        
        #outputs = [src len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #outputs are always from the top hidden layer
        
        return hidden, cell

In [143]:
#Ref: https://github.com/bentrevett/pytorch-seq2seq/blob/master/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb
'''
Creating an decoder where LSTM is used to convert
encoder output to find IDs for each embeddings so
that it converts the IDs to real words, and also
using dropout to drop 50% data randomly.
'''
class Decoder(nn.Module):
    def __init__(self, 
                 target_vocab_size, 
                 dec_hid_dim, 
                 hid_dim, 
                 n_layers, 
                 dropout_prob = .5):
        super().__init__()
        
        self.output_dim = target_vocab_size
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(target_vocab_size, dec_hid_dim)
        
        self.rnn = nn.LSTM(dec_hid_dim, hid_dim, n_layers, dropout = dropout_prob)
        
        self.fc_out = nn.Linear(hid_dim, target_vocab_size)
        
        self.dropout = nn.Dropout(dropout_prob)
        
    def forward(self, input, hidden, cell):
        
        #input = [batch size]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #n directions in the decoder will both always be 1, therefore:
        #hidden = [n layers, batch size, hid dim]
        #context = [n layers, batch size, hid dim]
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
                
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        
        #output = [seq len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #seq len and n directions will always be 1 in the decoder, therefore:
        #output = [1, batch size, hid dim]
        #hidden = [n layers, batch size, hid dim]
        #cell = [n layers, batch size, hid dim]
        
        prediction = self.fc_out(output.squeeze(0))
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden, cell

In [144]:
#Ref: https://github.com/bentrevett/pytorch-seq2seq/blob/master/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb
'''
Merging Encoder and Decoder into new model where
forward pass to grab encoder outputs and decoding
the values found while translating using decoder.
'''
class EncoderDecoderLSTM(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
        
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size)
        
        #last hidden state of the encoder is used as the initial hidden state of the decoder
        hidden, cell = self.encoder(src)
        
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden and previous cell states
            #receive output tensor (predictions) and new hidden and cell states
            output, hidden, cell = self.decoder(input, hidden, cell)
        
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1
        
        return outputs

In [145]:
INPUT_DIM = english.n_words
OUTPUT_DIM = irish.n_words
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 128
DEC_HID_DIM = 128
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT)

model_lstm = EncoderDecoderLSTM(enc, dec)

def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model_lstm.apply(init_weights)

EncoderDecoderLSTM(
  (encoder): Encoder(
    (embedding): Embedding(9039, 256)
    (rnn): LSTM(256, 128, num_layers=128, dropout=0.5)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(12173, 256)
    (rnn): LSTM(256, 128, num_layers=128, dropout=0.5)
    (fc_out): Linear(in_features=128, out_features=12173, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

## Task 2b. Training (10 pts)
Implement the code to train the Encoder-Decoder model on the Irish-English data. You will write code for the following:
* Training, validation and test dataloaders 
* A training loop which trains the model for 5 epoch. Evaluate the loop at the end of each Epoch. Print out the train perplexity and validation perplexity after each epoch.

In [146]:
optimizer = torch.optim.Adam(model_lstm.parameters())

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model_lstm.to(device)

EPOCHS = 5
best_val_loss = float('inf')

In [147]:
#Ref: https://github.com/bentrevett/pytorch-seq2seq/blob/master/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb

for epoch in range(EPOCHS):

  model_lstm.train()
  epoch_loss = 0
  for batch in tqdm(train_dl, total=len(train_dl)):

     src = batch[0].transpose(1, 0).to(device)
     trg = batch[1].transpose(1, 0).to(device)

     optimizer.zero_grad()

     output = model_lstm(src, trg)

     output_dim = output.shape[-1]
     output = output[1:].view(-1, output_dim).to(device)
     trg = trg[1:].reshape(-1)
     
     loss = F.cross_entropy(output, trg)
     loss.backward()

     torch.nn.utils.clip_grad_norm_(model_lstm.parameters(), 1)
     optimizer.step()
     epoch_loss += loss.item()

  train_loss = round(epoch_loss / len(train_dl), 3)
  
  eval_loss = 0
  model_lstm.eval()
  for batch in tqdm(dev_dl, total=len(dev_dl)):
    src = batch[0].transpose(1, 0).to(device)
    trg = batch[1].transpose(1, 0).to(device)

    with torch.no_grad():
      output = model_lstm(src, trg)
      
      output_dim = output.shape[-1]
      output = output[1:].view(-1, output_dim).to(device)
      trg = trg[1:].reshape(-1)
      
      loss = F.cross_entropy(output, trg)
      
      eval_loss += loss.item()
  
  dev_loss = round(eval_loss / len(dev_dl), 3)
  print(f"Epoch {epoch} | train loss {train_loss} | train ppl {np.exp(train_loss)} | val ppl {np.exp(dev_loss)}")


  if dev_loss < best_val_loss:
    best_val_loss = dev_loss
    torch.save(model_lstm.state_dict(), 'best-model-lstm.pt')  

  0%|          | 0/313 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 0 | train loss 5.748 | train ppl 313.56290692773194 | val ppl 150.05471586255422


  0%|          | 0/313 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 1 | train loss 5.144 | train ppl 171.39999894354537 | val ppl 139.770249560003


  0%|          | 0/313 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 2 | train loss 5.051 | train ppl 156.1785649881239 | val ppl 138.3795123399606


  0%|          | 0/313 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 3 | train loss 4.999 | train ppl 148.26482012532418 | val ppl 138.2412019943194


  0%|          | 0/313 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 4 | train loss 4.963 | train ppl 143.02221959891523 | val ppl 139.49098841511483


In [148]:
model_lstm.load_state_dict(torch.load("best-model-lstm.pt"))

<All keys matched successfully>

# Task 2c. Evaluation on the Test Set (10 pts)
Use the trained model to translate the text from the source language into the target language on the test set. Evaluate the performance of the model on the test set using the BLEU metric and print out the average the BLEU score.

In [172]:
'''
This function takes the english sentence 
and returns a translated output back to us 
which was done via the neural machine 
translator
'''

def translate_sentence(
    text: str, 
    model: EncoderDecoderLSTM, 
    english: Lang,
    irish: Lang,
    device: str,
    max_len: int = 10,
  ) -> str:

  # Encode english sentence and convert to tensor
  input_ids = english.encodeSent(text)
  input_tensor = torch.LongTensor(input_ids).unsqueeze(1).to(device)

  # Get encooder hidden states
  with torch.no_grad():
    encoder_outputs, hidden = model.encoder(input_tensor)

  # Build target holder list
  trg_indexes = [irish.word2index["<bof>"]]

  # Loop over sequence length of target sentence
  for i in range(max_len):
    trg_tensor = torch.LongTensor([trg_indexes[-1]]).to(device)
    
    # Decode the encoder outputs with respect to current target word
    with torch.no_grad():
      output, hidden, cell = model.decoder(trg_tensor, hidden, encoder_outputs)
    
    # Retrieve most likely word over target distribution
    pred_token = torch.argmax(output).item()
    trg_indexes.append(pred_token)

    if pred_token == irish.word2index["<eof>"]:
      break

  return "".join(irish.decodeIds(trg_indexes))

In [173]:
'''
This three functions work to translate
english sentences each row, does the 
preprocessing on the passed sentence
and returns the output to the bleu score
function which then calculates bleu score
for each row in the test dataframe.
'''

def auto_translate_lstm(sentence: str):
    output =  translate_sentence(sentence, model_lstm, english, irish, device)
    return output

def blue_score(reference_sentence, condidate_sentence):
  return len([word for word in condidate_sentence if word in reference_sentence])/len(reference_sentence)

def test_preprocessing(data: str):
        data = data.lower()
        data = re.sub(r'[^\w\s]', '', data).strip()
        data = data.replace("\n", " ")
        data = "<bof>" + data + "<eof>"
        return data

In [174]:
for index, row in tqdm(df_test.iterrows(), total=len(df_test)):
    df_test.loc[index, "TranslatedText"] = auto_translate_lstm(row["EngText"])
    df_test.loc[index, "BLEU Score"] = blue_score(test_preprocessing(row["IrText"]), auto_translate_lstm(row["EngText"]))

  0%|          | 0/1000 [00:00<?, ?it/s]

In [175]:
df_test.head()

Unnamed: 0,EngText,IrText,TranslatedText,BLEU Score
0,Procès-verbal of rectification to the Conventi...,Miontuairisc cheartaitheach maidir le Coinbhin...,<bof> PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD,0.073529
1,(Official Journal of the European Union L 147 ...,(Iris Oifigiúil an Aontais Eorpaigh L 147 an 1...,<bof> PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD,0.11194
2,"in Switzerland the higher cantonal court,’;","san Eilvéis ardchúirt an chantúin,”;",<bof> an PAD PAD PAD PAD PAD PAD PAD PAD PAD,0.395349
3,PREAMBLE,BROLLACH,<bof> PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD,0.277778
4,"However, in proceedings which have as their ob...",a bheith i scríbhinn nó arna fhianú i scríbhin...,<bof> PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD,0.25


In [176]:
mean = df_test["BLEU Score"].mean()
print(f"Average BLEU score of `LSTM` Model: {mean}")

Average BLEU score of `LSTM` Model: 0.3142213818144513


## Task 3. Improving NMT using Attention (10 pts) 
Extend the Encoder-Decoder model from Task 2 with the attention mechanism. Retrain the model and evaluate on test set. Print the updated average BLEU score on the test set. In a few sentences explains which model is the best for translation. 

In [177]:
#Ref: Dr Paul Buitelaar/Dr Omnia Zayed - Lab 08 "Neural NMT"

class EncoderGRU(nn.Module):
    def __init__(
        self, 
        input_vocab_size,  # size of source vocabulary  
        hidden_dim,        # hidden dimension of embeddings
        encoder_hid_dim,   # gru hidden dim
        decoder_hid_dim,   # decoder hidden dim 
        dropout_prob = .5
      ):
      
        super().__init__()
        self.embedding = nn.Embedding(input_vocab_size, hidden_dim)
        self.rnn = nn.GRU(hidden_dim, encoder_hid_dim, bidirectional = True)
        self.fc = nn.Linear(encoder_hid_dim * 2, decoder_hid_dim)
        self.dropout = nn.Dropout(dropout_prob)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        outputs, hidden = self.rnn(embedded)
                
        #outputs = [src len, batch size, hid dim * num directions]
        #hidden = [n layers * num directions, batch size, hid dim]        
        #hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        #outputs are always from the last layer
        
        #hidden [-2, :, : ] is the last of the forwards GRU
        #hidden [-1, :, : ] is the last of the backwards GRU
        
        #initial decoder hidden is final hidden state of the forwards and backwards 
        #  encoder RNNs fed through a linear layer
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))        
        return outputs, hidden

In [178]:
#Ref: Dr Paul Buitelaar/Dr Omnia Zayed - Lab 08 "Neural NMT"
'''
The attention layer is responsible to
generate a number for each word in the
training data so that the importance of
each word can be generated and used to 
output the next word in the seequence,
and also makes the encoder-decoder model
work faster than the previous.
'''
class Attention(nn.Module):
    def __init__(
        self, 
        enc_hid_dim,      # Encoder hidden dimension
        dec_hid_dim       # Decoder hidden dimension 
      ):
        super().__init__()
        
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, hidden, encoder_outputs):
        
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        
        #repeat decoder hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        #hidden = [batch size, src len, dec hid dim]
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) 
        
        #energy = [batch size, src len, dec hid dim]
        attention = self.v(energy).squeeze(2)
        
        #attention output: [batch size, src len]
        return F.softmax(attention, dim=1)

In [179]:
#Ref: Dr Paul Buitelaar/Dr Omnia Zayed - Lab 08 "Neural NMT"

class DecoderGRU(nn.Module):
    def __init__(
        self, 
        target_vocab_size,    # Size of target vocab 
        hidden_dim,           # hidden size of embedding  
        enc_hid_dim, 
        dec_hid_dim, 
        dropout
      ):
        super().__init__()

        self.output_dim = target_vocab_size
        self.attention = Attention(enc_hid_dim, dec_hid_dim)
        
        self.embedding = nn.Embedding(target_vocab_size, hidden_dim)
        
        self.rnn = nn.GRU((enc_hid_dim * 2) + hidden_dim, dec_hid_dim)
        
        self.fc_out = nn.Linear(
            (enc_hid_dim * 2) + dec_hid_dim + hidden_dim, 
            target_vocab_size
          )
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs):
             
        #input = [batch size]
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        
        input = input.unsqueeze(0)  # [1, batch size]
        
        embedded = self.dropout(self.embedding(input))  # [1, batch size, emb dim]
        
        a = self.attention(hidden, encoder_outputs)     # [batch size, src len]
        a = a.unsqueeze(1)                              # [batch size, 1, src len]
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2) # [batch size, src len, enc hid dim * 2]
        
        weighted = torch.bmm(a, encoder_outputs)           # [batch size, 1, enc hid dim * 2]
        weighted = weighted.permute(1, 0, 2)               # [1, batch size, enc hid dim * 2]
        
        rnn_input = torch.cat((embedded, weighted), dim = 2) # [1, batch size, (enc hid dim * 2) + emb dim]

        
        #output = [seq len, batch size, dec hid dim * n directions]
        #hidden = [n layers * n directions, batch size, dec hid dim]    
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        
        #seq len, n layers and n directions will always be 1 in this decoder, therefore:
        #output = [1, batch size, dec hid dim]
        #hidden = [1, batch size, dec hid dim]
        
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1)) # [batch size, output dim]
        return prediction, hidden.squeeze(0)

In [180]:
#Ref: Dr Paul Buitelaar/Dr Omnia Zayed - Lab 08 "Neural NMT"

class EncoderDecoderAtt(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the time     
        batch_size = src.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size)
        
        #encoder_outputs is all hidden states of the input sequence, back and forwards
        #hidden is the final forward and backward hidden states, passed through a linear layer
        encoder_outputs, hidden = self.encoder(src)
                
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):     
            #insert input token embedding, previous hidden state and all encoder hidden states
            #receive output tensor (predictions) and new hidden state
            output, hidden = self.decoder(input, hidden, encoder_outputs)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1
        return outputs

In [181]:
INPUT_DIM = english.n_words
OUTPUT_DIM = irish.n_words
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 128
DEC_HID_DIM = 128
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = EncoderGRU(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = DecoderGRU(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT)

model_att = EncoderDecoderAtt(enc, dec)

def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model_att.apply(init_weights)

EncoderDecoderAtt(
  (encoder): EncoderGRU(
    (embedding): Embedding(9039, 256)
    (rnn): GRU(256, 128, bidirectional=True)
    (fc): Linear(in_features=256, out_features=128, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): DecoderGRU(
    (attention): Attention(
      (attn): Linear(in_features=384, out_features=128, bias=True)
      (v): Linear(in_features=128, out_features=1, bias=False)
    )
    (embedding): Embedding(12173, 256)
    (rnn): GRU(512, 128)
    (fc_out): Linear(in_features=640, out_features=12173, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

## Task 3a. Training

In [182]:
# Your Code Here

optimizer = torch.optim.Adam(model_att.parameters())

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model_att.to(device)

EPOCHS = 5
best_val_loss = float('inf')

In [183]:
'''
As we have saved the weights below, we wont
train the model again and so commenting out
the code and using load_state.dict(torch.load())
below on the model class to call back the best
weights saved to save time
'''

for epoch in range(EPOCHS):

  model_att.train()
  epoch_loss = 0
  for batch in tqdm(train_dl, total=len(train_dl)):

     src = batch[0].transpose(1, 0).to(device)
     trg = batch[1].transpose(1, 0).to(device)

     optimizer.zero_grad()

     output = model_att(src, trg)

     output_dim = output.shape[-1]
     output = output[1:].view(-1, output_dim).to(device)
     trg = trg[1:].reshape(-1)
     
     loss = F.cross_entropy(output, trg)
     loss.backward()

     torch.nn.utils.clip_grad_norm_(model_att.parameters(), 1)
     optimizer.step()
     epoch_loss += loss.item()

  train_loss = round(epoch_loss / len(train_dl), 3)
  
  eval_loss = 0
  model_att.eval()
  for batch in tqdm(dev_dl, total=len(dev_dl)):
    src = batch[0].transpose(1, 0).to(device)
    trg = batch[1].transpose(1, 0).to(device)

    with torch.no_grad():
      output = model_att(src, trg)
      
      output_dim = output.shape[-1]
      output = output[1:].view(-1, output_dim).to(device)
      trg = trg[1:].reshape(-1)
      
      loss = F.cross_entropy(output, trg)
      
      eval_loss += loss.item()
  
  dev_loss = round(eval_loss / len(dev_dl), 3)
  print(f"Epoch {epoch} | train loss {train_loss} | train ppl {np.exp(train_loss)} | val ppl {np.exp(dev_loss)}")


  if dev_loss < best_val_loss:
    best_val_loss = dev_loss
    torch.save(model_att.state_dict(), 'best-model-att.pt')  

  0%|          | 0/313 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 0 | train loss 5.44 | train ppl 230.44218346064218 | val ppl 104.37602463655413


  0%|          | 0/313 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 1 | train loss 4.672 | train ppl 106.91135145513537 | val ppl 90.64946179433973


  0%|          | 0/313 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 2 | train loss 4.463 | train ppl 86.74736120689518 | val ppl 75.1134772455169


  0%|          | 0/313 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 3 | train loss 4.322 | train ppl 75.33915602616537 | val ppl 73.4055833378014


  0%|          | 0/313 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

Epoch 4 | train loss 4.151 | train ppl 63.49746602599656 | val ppl 71.37873529593749


In [184]:
model_att.load_state_dict(torch.load("best-model-att.pt"))

<All keys matched successfully>

# Task 3b. Evaluation on the Test Set

In [185]:
# Your code here
'''
This function takes the english sentence 
and returns a translated output back to us 
which was done via the neural machine 
translator
'''

def translate_sentence(
    text: str, 
    model: EncoderDecoderAtt, 
    english: Lang,
    irish: Lang,
    device: str,
    max_len: int = 10,
  ) -> str:

  # Encode english sentence and convert to tensor
  input_ids = english.encodeSent(text)
  input_tensor = torch.LongTensor(input_ids).unsqueeze(1).to(device)

  # Get encooder hidden states
  with torch.no_grad():
    encoder_outputs, hidden = model.encoder(input_tensor)

  # Build target holder list
  trg_indexes = [irish.word2index["<bof>"]]

  # Loop over sequence length of target sentence
  for i in range(max_len):
    trg_tensor = torch.LongTensor([trg_indexes[-1]]).to(device)
    
    # Decode the encoder outputs with respect to current target word
    with torch.no_grad():
      output, hidden = model.decoder(trg_tensor, hidden, encoder_outputs)
    
    # Retrieve most likely word over target distribution
    pred_token = torch.argmax(output).item()
    trg_indexes.append(pred_token)

    if pred_token == irish.word2index["<eof>"]:
      break

  return "".join(irish.decodeIds(trg_indexes))

In [186]:
'''
This three functions work to translate
english sentences each row, does the 
preprocessing (above) on the passed sentence
and returns the output to the bleu score
function (above) which then calculates bleu score
for each row in the test dataframe.
'''
def auto_translate_att(sentence: str):
    output =  translate_sentence(sentence, model_att, english, irish, device)
    return output

In [187]:
for index, row in tqdm(df_test1.iterrows(), total=len(df_test1)):
    df_test1.loc[index, "TranslatedText"] = auto_translate_att(row["EngText"])
    df_test1.loc[index, "BLEU Score"] = blue_score(test_preprocessing(row["IrText"]), auto_translate_att(row["EngText"]))

  0%|          | 0/1000 [00:00<?, ?it/s]

In [188]:
df_test1.head()

Unnamed: 0,EngText,IrText,TranslatedText,BLEU Score
0,Procès-verbal of rectification to the Conventi...,Miontuairisc cheartaitheach maidir le Coinbhin...,<bof> airteagal <eof>,0.102941
1,(Official Journal of the European Union L 147 ...,(Iris Oifigiúil an Aontais Eorpaigh L 147 an 1...,<bof> airteagal <eof>,0.156716
2,"in Switzerland the higher cantonal court,’;","san Eilvéis ardchúirt an chantúin,”;",<bof> airteagal <eof>,0.465116
3,PREAMBLE,BROLLACH,<bof> ciallaíonn an <eof>,0.944444
4,"However, in proceedings which have as their ob...",a bheith i scríbhinn nó arna fhianú i scríbhin...,<bof> airteagal <eof>,0.316667


In [189]:
mean = df_test1["BLEU Score"].mean()
print(f"Average BLEU score of `Attention` Model: {mean}")

Average BLEU score of `Attention` Model: 0.4730971935081272
