# NLP From Scratch: Translation with a Seq2Seq RNN model

In this project we will be teaching a neural network to translate from German to English.


This is made possible by the simple but powerful idea of the [sequence
to sequence network](https://arxiv.org/abs/1409.3215), in which two
recurrent neural networks work together to transform one sequence to
another. An encoder network condenses an input sequence into a vector,
and a decoder network unfolds that vector into a new sequence.


## Dependencies

In [None]:
# using Python 3.9
%pip install pandas torch matplotlib numpy ipython

In [1]:
from __future__ import unicode_literals, print_function, division
from io import open
import re
import random
import time
import pandas as pd
from IPython.display import Image
import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
from utils import *
%load_ext autoreload
%autoreload 3
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Loading & preprocessing text data

Implement the function `read_txt` to read the input file and transform it into a dataframe, which is then fed into the `parse_data` function in order to be normalized. If the parameter reverse is True, the order of languages in the tuple are reversed resulting in reverse translation. Since there are a *lot* of example sentences and we want to train
something quickly, we'll only select a subset of pairs: Implement the `filter_pairs` function to drop all pairs where there is no word ending by a predefined list of suffixes. 

In [None]:
def read_txt(path:str)-> pd.DataFrame:
  """
  #TODO: Task 1 (5 points)
  Parse the data from the file and return a DataFrame with columns ['ENG','GER'].
  """
  raise NotImplementedError("Task 1 not implemented")
  return pairs



def parse_data(pairs:pd.DataFrame, reverse=False)-> pd.DataFrame:
  pairs['GER'] = pairs['GER'].apply(normalize_string)
  pairs['ENG'] = pairs['ENG'].apply(normalize_string)

  if reverse:
    pairs = pairs.iloc[:, [1,0]]

  return pairs

In [None]:
suffixes = ["hood", "ness", "ment", "ship", "ance", "ise", "ize", "ly", "etion", "ity"]

"""
#TODO: Task 2 (10 pt)

Implement the filter_pairs function that it takes in a pd.DataFrame of pairs of sentences in 2 languages 
and only selects the rows for which in the sentence of the selected language, there is at least one word ending
by one of the suffices. E.g.:

"Sisterhood is very important" --> keep
"We use Mentimeter in lectures --> drop
"Kindness and perseverance are virtues" --> keep


- those where one of both exceeds the maximal sentence length (MAX_WORDS)
- those where the pair contains any of the words defined in interrogative_words or a question mark.

Your method should work by pattern detection instead than explicit iteration.

"""

def filter_pairs(pairs,
                 suffixes,
                 language="ENG"):
    raise NotImplementedError("Task 2 not implemented")
    return pairs



Each word in a language will be represented as a one-hot
vector, or giant vector of zeros except for a single one (at the index
of the word). Compared to the dozens of characters that might exist in a
language, there are many many more words, so the encoding vector is much
larger. We will however cheat a bit and trim the data to only use a few
thousand words per language.

We'll need a unique index per word to use as the inputs and targets of
the networks later. To keep track of all this we will use a helper class
called ``Language`` which has word → index (``word2index``) and index → word
(``index2word``) dictionaries, as well as a count of each word
``word2count`` which will be used to replace rare words later.




In [None]:
SOS_token = 0
EOS_token = 1

"""
TODO: Task 3 (10 pt)

Implement the method stem, that takes as input a list of suffixes and maps all the words 
that can be created with a common stem + one of the suffixes in the list to a common word stem-
 that removes words whose frequency is below
min_freq and makes sure that the size of the vocabulary is below max_vocab_size
by iteratively removing least frequent words. 

The function should also take care of:
- removing the original words from all counters and class dictionaries
- updating word2count to the total counts for all the stemmed words
- updating the index in the dictionary so that its values run from 0 to n_words

You can create additional function for the task. 

"""


class Language:
    def __init__(self):
        self.word2index = {} # maps word to integer index
        self.word2count = {} # maps word to its frequency
        self.index2word = {0: "SOS", 1: "EOS"} # maps index to a word
        self.n_words = 2  # Count SOS and EOS

    def add_sentence(self, sentence):
        for word in sentence.split(' '):
            self.add_word(word)

    def add_word(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1


    def remove_word(self, word):
        if word in self.word2index:
            del self.word2count[word]
            index = self.word2index[word]
            del self.word2index[word]
            del self.index2word[index]

    def stem(self, suffix_list):
        raise NotImplementedError("Task 3 not implemented")
    

The full process for preparing the data is:

-  Read text file and split into lines, split lines into pairs
-  Normalize text, filter by length and content
-  Make word lists from sentences in pairs




In [None]:
def prepare_data(pairs, suffixes, stem=None):

    print(f"Read {len(pairs)} sentence pairs")
    pairs = filter_pairs(pairs, suffixes=suffixes)
    print(f"Filtered {len(pairs)} sentence pairs")
    pairs = pairs.to_numpy()

    input_lang = Language()
    output_lang = Language()

    for pair in pairs:
        input_lang.add_sentence(pair[0])
        output_lang.add_sentence(pair[1])
    
    if stem:
        (input_lang.stem(suffixes) if stem=="input"
         else output_lang.stem(suffixes))

    print(f"Input language: {input_lang.n_words} words")
    print(f"Output language: {output_lang.n_words} words")
    return input_lang, output_lang, pairs

path = "data/pairs.txt"
suffixes = ["hood", "ness", "ment", "ship", "ance", "ise", "ize", "ly", "etion", "ity"]
pairs = parse_data(read_txt(path), reverse=True)
input_lang, output_lang, pairs = prepare_data(pairs, suffixes, stem="output")
### SHOW NOTEBOOK OUTPUT ###




## Preparing Training Data

To train, for each pair we will need an input tensor (indexes of the
words in the input sentence) and target tensor (indexes of the words in
the target sentence). While creating these vectors we will add the SOS token at the beginngin and the
EOS token at the end of both sequences.




In [17]:
def indexesFromSentence(lang, sentence):
    return [lang.word2index[word] for word in sentence.split(' ')]

def tensorFromSentence(lang, sentence):
    indexes = indexesFromSentence(lang, sentence)
    indexes.insert(0, SOS_token)
    indexes.append(EOS_token)
    return torch.tensor(indexes, dtype=torch.long, device=device).unsqueeze(0)

def tensorsFromPair(pair):
    input_tensor = tensorFromSentence(input_lang, pair[0])
    target_tensor = tensorFromSentence(output_lang, pair[1])
    return (input_tensor, target_tensor)

### Seq2Seq RNN Model

To train we run the input sentence through the encoder, and keep track
of every output and the latest hidden state. Then the decoder is given
the `<SOS>` token as its first input, and the last hidden state of the
encoder as its first hidden state. At each next iteration, the decoder makes a prediction 
based on the most likely token predicted in the previous step. If 
"teacher forcing" is used, real target outputs are used as
each next input, instead of using the decoder's guess as the next input.
Using teacher forcing causes it to converge faster but [when the trained
network is exploited, it may exhibit
instability](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.378.4095&rep=rep1&type=pdf).






In [None]:
""" Task 4 (10 pt)
Implement the decoder's iterative step to accommodate for a probabilistic usage of teacher forcing according to `teacher_forcing_ratio`.
At each step, the model should choose according to the ratio wheter to use its last prediction or the target token as an input for the next prediction. 

"""
class Encoder(nn.Module):
    def __init__(self, input_dim, embed_dim, hidden_dim, n_layers=1):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(input_dim, embed_dim)
        self.gru = nn.GRU(embed_dim, hidden_dim, num_layers=n_layers, batch_first=True)
        
    def forward(self, src):
        # src: (batch_size, src_len)
        embedded = self.embedding(src)  # (batch_size, src_len, embed_dim)
        outputs, hidden = self.gru(embedded)  # outputs: (batch_size, src_len, hidden_dim), hidden: (n_layers, batch_size, hidden_dim)
        return hidden  # Only hidden state is returned for decoder

# Decoder Model
class Decoder(nn.Module):
    def __init__(self, output_dim, embed_dim, hidden_dim, n_layers=1):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(output_dim, embed_dim)
        self.gru = nn.GRU(embed_dim, hidden_dim, num_layers=n_layers, batch_first=True)
        self.fc_out = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, trg, hidden):
        trg = trg.unsqueeze(1)  
        embedded = self.embedding(trg)  
        output, hidden = self.gru(embedded, hidden)  
        prediction = self.fc_out(output.squeeze(1))  
        return prediction, hidden

# Seq2Seq Model
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
    
    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        # src: (1, src_len), trg: (1, trg_len)
        trg_len = trg.size(1)
        output_dim = self.decoder.fc_out.out_features

        outputs = torch.zeros(1, trg_len, output_dim).to(device)
        
        # Encode the source sequence
        hidden = self.encoder(src)

        # First input to the decoder is the <sos> token
        input = trg[:, 0]

        for t in range(1, trg_len):
            # Decode one token at a time
            output, hidden = self.decoder(input, hidden)
            outputs[:, t, :] = output

            raise NotImplementedError("Task 4 not implemented")

        return outputs

In [None]:
cfg = {"n_iters": 10**3,
       "print_every":10, 
       "plot_every":100,
       "learning_rate":0.01, 
       "teacher_forcing_ratio":.5}

teacher_forcing_ratio = .5

def train(training_pair, 
          model,
          optimizer, 
          criterion, 
          teacher_forcing_ratio=0.5):
    
    optimizer.zero_grad()
    input_tensor, target_tensor = training_pair
    pred_tensor = model(input_tensor, target_tensor, teacher_forcing_ratio)
    loss = criterion(pred_tensor.squeeze(0), target_tensor.squeeze(0))
    
    loss.backward()
    optimizer.step()
    return loss.item() 


def trainIters(model, cfg):

    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    optimizer = optim.SGD(model.parameters(), 
                          lr=cfg["learning_rate"])
    
    training_pairs = [tensorsFromPair(random.choice(pairs))
                      for i in range(cfg["n_iters"])]
    
    criterion = nn.NLLLoss()

    for iter in range(1, cfg["n_iters"] + 1):
        training_pair = training_pairs[iter - 1]
        loss = train(training_pair, 
                     model,
                     optimizer, 
                     criterion,
                     cfg["teacher_forcing_ratio"]
                     )
        print_loss_total += loss
        plot_loss_total += loss

        if iter % cfg["print_every"] == 0:
            print_loss_avg = print_loss_total / cfg["print_every"]
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (timeSince(start, iter / cfg["n_iters"]),
                                         iter, iter / cfg["n_iters"] * 100, print_loss_avg))

        if iter % cfg["plot_every"] == 0:
            plot_loss_avg = plot_loss_total / cfg["plot_every"]
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0

    showPlot(plot_losses)

    return plot_losses



## Evaluation

Evaluation is mostly the same as training, but there are no targets so
we simply feed the decoder's predictions back to itself for each step.
Every time it predicts a word we add it to the output string, and if it
predicts the EOS token we stop there.




In [121]:
def evaluate(model, sentence):
    input_tensor = tensorFromSentence(input_lang, sentence)
    with torch.no_grad():
        pred_tensor = model(input_tensor, input_tensor, teacher_forcing_ratio=0)
        pred_indices = pred_tensor.squeeze(0).argmax(1).cpu().numpy()

    pred_sentence = ' '.join([output_lang.index2word[i] for i in pred_indices])
        
    return pred_sentence

#We can evaluate random sentences from the training set and print out the input, target, and output to make some subjective quality judgements
def evaluateRandomly(model, n=10):
    for i in range(n):
        pair = random.choice(pairs)
        print('>', pair[0])
        print('=', pair[1])
        output_sentence = evaluate(model, pair[0])
        print('<', output_sentence)
        print('')

## Training and Evaluating

With all these helper functions in place (it looks like extra work, but
it makes it easier to run multiple experiments) we can actually
initialize a network and start training.

Remember that the input sentences were heavily filtered. For this small
dataset we can use relatively small networks of 256 hidden nodes and a
single GRU layer. When training from scratch after about 40 minutes on a MacBook CPU we'll get some
reasonable results.



In [None]:

decoder = Decoder(output_lang.n_words, 256, 512)
encoder = Encoder(input_lang.n_words, 256, 512)
model = Seq2Seq(encoder, decoder).to(device)
losses = trainIters(model, cfg)
### TODO: SHOW NOTEBOOK OUTPUT ###

In [None]:
#TODO: print outcome of trained model 
 
evaluateRandomly(model, 5)

# Transfer learning
## Fine tune out-of-the box encoder-decoder model

In [None]:
"""
TODO: Task 5 (15 pt)

Load a transformer (seq2seq model using attention layers) and fine-tune it to your problem.
Show the training progress of the model and the training metrics. Briefly
explain (3-4 sentences) the model choice and comment on the outcomes.

Note: you don't have to use the above-defined methods for training.
Splitting the dataset in train and test is welcomed, but not required for the task.

"""

raise NotImplementedError("Task 5 not implemented")

# Credits

This problem set is based upon an official PyTorch [tutorial](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html). Many thanks to PyTorch, [Sean Robertson](https://github.com/spro/practical-pytorch) and  [Florian Nachtigall](https://github.com/FlorianNachtigall).

Be cautious with looking in the original notebook for answers. Many details have been changed and you won't be able to copy-and-paste solutions.
