# NLP. Lesson 11. RNN with attention for translation


## RNN for Seq2Seq

Seq2Seq model or Sequence-to-Sequence model, is a machine learning architecture designed for tasks involving sequential data. It takes an input **sequence**, processes it, and generates an output **sequence**. Today we will consider the Neural Machine Translation with seq2seq.


## Architecture. Encoder decoder for translation

Encoder Decoder network, is a model consisting of two RNNs (LSTMs or GRUs) called the encoder and decoder. The encoder reads an input sequence and outputs a single vector, and the decoder reads that vector to produce an output sequence using a `context vector`.

With a seq2seq model the encoder creates a single vector which, in the ideal case, encodes the “meaning” of the input sequence into a single vector — a single point in some N dimensional space of sentences. Unlike sequence prediction with a single RNN, where every input corresponds to an output, the seq2seq model frees us from sequence length and order, which makes it ideal for **translation** between two languages.

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab11/DecEncHighLevel.png" alt="Decoder and Encoder" width="800"/>

Consider the sentence: "nice to meet you", which must be translated into French language - "ravi de vous rencontrer".

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab11/DecEncArchitecture.png" alt="Decoder and Encoder Construction" width="800"/>

### Encoder

The encoder part is an LSTM or GRUs cell.
It is fed in the input-sequence over time and it tries to encapsulate all its information and store it in its final internal states $h_t$ (hidden state) and $c_t$ (cell state). It outputs some value for every word from the input sentence and for every input word the encoder outputs a vector and a hidden state, and uses the hidden state for the next input word. The internal states are then passed onto the decoder part, which it will use to try to produce the target-sequence. The outputs at each time-step of the encoder part are all discarded.

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab11/Encoder.png" alt="Encoder Construction" width="300"/>

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab11/Encoder2.png" alt="Encoder Construction 2" width="600"/>

### Decoder

The decoder is another RNN that takes the encoder output vector(s) and outputs a sequence of words to create the translation.

In the simplest seq2seq decoder we use only last output - context vector = of the encoder. This context vector is used as the **initial** hidden state of the decoder.

At every step of decoding, the decoder is given an input token and hidden state. The initial input token is the start-of-string `<SOS>` or `<START>` token, and the first hidden state is the context vector (the encoder’s last hidden state).

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab11/Decoder.png" alt="Decoder" width="800"/>

Decoder's output at any time-step t is supposed to be the $t^{th}$ word in the target-sequence (“ravi de vous rencontrer”). To explain this, let's see what happens at each time-step:
1. The input is fed to the decoder is a special symbol “\<START\>" or "\<SOS\>". This is used to signify the start of the output-sequence. Now the decoder uses this input and the internal states ($h_t$, $c_t$) to produce the output in the 1st time-step which is supposed to be the 1st word/token in the target-sequence i.e. ‘ravi’.
2. The output from the 1st time-step “ravi” is fed as input to the 2nd time-step. The output in the 2nd time-step is supposed to be the 2nd word in the target-sequence i.e. ‘de’
3. Continue till we get the “\<END\>” or "\<EOS\>" symbol which is again a special symbol used to mark the end of the output-sequence. The final internal states of the decoder are discarded.

### Teacher Forcing

During the training phase, especially first epochs, the predictions are more random than conscious. The word ‘\<START\>’ in vector form is fed as the input vector. The model will output something random, suppose we got word 'de'. Now, should we use this predicted word as the input at time-step 2?. We can do that, but in practice, it was seen that this leads to problems like slow convergence, model instability, and poor skill.

Teacher forcing was introduced to rectify this. We feed the **true** output/token (and not the predicted output) from the previous time-step as input to the current time-step. That means the input to the time-step 2 will be a true value 'ravi' and 'de'. This procedure continues during the entire phase of training.

The testing doesn't include Teacher Forcing since we 'do not know' the real outputs. Only predicted from the previous step words are used as an input to the next step.

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab11/DecoderTestPhase.png" alt="Decoder in the Test Phase" width="800"/>

### Problem 
Fixed source representation is suboptimal: (i) for the encoder, it is hard to compress the sentence; (ii) for the decoder, different information may be relevant at different steps. Not only it is hard for the encoder to put all information into a single vector - this is also hard for the decoder. The decoder sees only one representation of source. However, at each generation step, different parts of source can be more useful than others. But in the current setting, the decoder has to extract relevant information from the same fixed representation - hardly an easy thing to do.

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab11/bottleneck-min.png" alt="Bottleneck" width="600"/>

## Attention

Attention allows the decoder network to “focus” on a different part of the encoder’s outputs for every step of the decoder’s own outputs.
- Initialization: final states from the encoder are assigned as the initial states of the decoder.
- Operation: the decoder, at each time step, produces an output as well as its own hidden state, using the hidden state from the previous unit
- Output Generation: the output $y_t$ at each time step is computed using a softmax function. This function generates a probability distribution over the output vocabulary, aiding in determining the final output (like a word in translation)


First we calculate a set of attention weights/attention scores (weight distribution).

These will be multiplied by the encoder output vectors to create a weighted combination. The result should contain information about that specific part of the input sequence, and thus help the decoder choose the right output words.

Bahdanau attention, also known as additive attention, is a commonly used attention mechanism in sequence-to-sequence models, particularly in neural machine translation tasks. It was introduced by Bahdanau et al. in their paper titled Neural Machine Translation by Jointly Learning to Align and Translate. This attention mechanism employs a learned alignment model to compute attention scores between the encoder and decoder hidden states. It utilizes a feed-forward neural network to calculate alignment scores.

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab11/bahdanau_model-min.png" alt="Bahdanau model" width="900"/>

However, there are alternative attention mechanisms available, such as Luong attention, which computes attention scores by taking the dot product between the decoder hidden state and the encoder hidden states. It does not involve the non-linear transformation used in Bahdanau attention.

### Query, Key, and Value in Self-Attention
Formally, this intuition is implemented with a query-key-value attention. Each input token in self-attention receives three representations corresponding to the roles it can play:
- query - asking for information;
- key - saying that it has some information;
- value - giving the information.

The query is used when a token looks at others - it's seeking the information to understand itself better. The key is responding to a query's request: it is used to compute attention weights. The value is used to compute attention output: it gives information to the tokens which "say" they need it (i.e. assigned large weights to this token).

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab11/qkv_explained-min.png" alt="Bottleneck" width="900"/>
----------

[THE DEEPER LOOK on the Seq2Seq models and attention](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html)

[CODE EXAMPLE](https://medium.com/@mervebdurna/exploring-seq2seq-encoder-decoder-and-attention-mechanisms-in-nlp-theory-and-practice-9b1022cf50b4)

-------

Pros:

- **Ability to handle variable-length input and output.** The encoder-decoder model can handle input and output sequences of varying lengths, making it suitable for tasks like machine translation and text summarization.
-  **Flexibility in model architecture.** The encoder and decoder can be implemented using different architectures, such as recurrent neural networks (RNNs), long short-term memory (LSTM) networks, or transformers, allowing for flexibility in model design.
-  **Effective for sequential data.** The encoder-decoder model is particularly effective for sequential data, such as text, speech, or time series data, where the order of the input data matters.
- **Can learn complex patterns.** The encoder-decoder model can learn complex patterns in the input data, including long-range dependencies and contextual relationships.
- **Can be used for both supervised and unsupervised learning.** The encoder-decoder model can be used for both supervised learning tasks, such as machine translation, and unsupervised learning tasks, such as text generation.

Cons:
- **Training can be computationally expensive.** Training an encoder-decoder model can be computationally expensive, especially for large datasets or complex models.
- **Requires large amounts of data.** The encoder-decoder model requires large amounts of data to train effectively, which can be a challenge for low-resource languages or tasks.
- **Can be prone to overfitting.** The encoder-decoder model can be prone to overfitting, especially when the model is complex or the training data is limited.
- **Difficult to interpret.** The encoder-decoder model can be difficult to interpret, as the learned representations and transformations are not always transparent.
- **May not perform well on out-of-domain data.** The encoder-decoder model may not perform well on out-of-domain data, which can be a challenge for tasks like machine translation or text summarization.
- **Requires careful tuning of hyperparameters.** The encoder-decoder model requires careful tuning of hyperparameters, such as the number of layers, batch size, hidden size, and dropout rate, to achieve good performance.

### Loading the data


In [1]:
!wget https://raw.githubusercontent.com/Dnau15/LabImages/main/data/txt_data/por.txt

--2024-07-12 19:04:58--  https://raw.githubusercontent.com/Dnau15/LabImages/main/data/txt_data/por.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24609459 (23M) [text/plain]
Saving to: ‘por.txt.1’


2024-07-12 19:04:58 (255 MB/s) - ‘por.txt.1’ saved [24609459/24609459]



In [2]:
import pandas as pd

train_df = pd.read_csv(
    "por.txt",
    sep="\t",
    usecols=[0, 1],
    names=["EN", "PR"],
)
train_df.head()

Unnamed: 0,EN,PR
0,Go.,Vai.
1,Go.,Vá.
2,Hi.,Oi.
3,Run!,Corre!
4,Run!,Corra!


### Preprocess


In [3]:
import unicodedata
import re


def unicode2ascii(s: str) -> str:
    """ Converts Unicode characters to their closest ASCII representation.
    Example: Vá -> Va
    Args:
        s (str): string to be coverted

    Returns:
        str: converted string
    """
    return "".join(
        c for c in unicodedata.normalize("NFD", s) if unicodedata.category(c) != "Mn"
    )


def normalize_string(s: str) -> str:
    """ The function is similar to the preprocess_text function from the previous lesson. """
    s = unicode2ascii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z!?]+", r" ", s)
    return s.strip()

In [4]:
from tqdm import tqdm


def preprocess(texts: str):
    """ Function for splitting the text into tokens. Behaves as a generator. """

    for text in tqdm(texts, desc="Building vocab"):
        tokens = normalize_string(str(text)).split()
        yield tokens

### Create vocabs


In [5]:
from torchtext.vocab import build_vocab_from_iterator

# Vocabulary building for both languages
# Uses the preprocess function to tokenize texts
# special tokens are already cobered

SPECIAL_TOKENS = ["<SOS>", "<EOS>", "<UNK>", "<PAD>"]
english_vocab = build_vocab_from_iterator(
    preprocess(train_df["EN"].values), special_first=True, specials=SPECIAL_TOKENS
)

port_vocab = build_vocab_from_iterator(
    preprocess(train_df["PR"].values), special_first=True, specials=SPECIAL_TOKENS
)

Building vocab: 100%|██████████| 168903/168903 [00:02<00:00, 57337.61it/s]
Building vocab: 100%|██████████| 168903/168903 [00:02<00:00, 57086.90it/s]


In [6]:
print("Length of english vocab:", len(english_vocab))
print("Length of portugese vocab:", len(port_vocab))

Length of english vocab: 12221
Length of portugese vocab: 20582


In [7]:
# More functions for text processing
# tensor and DataLoader preparation

import torch
import numpy as np
import torchtext.vocab
from torch.utils.data import TensorDataset, DataLoader


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
SOS_TOKEN_IDX = english_vocab["<SOS>"]
EOS_TOKEN_IDX = english_vocab["<EOS>"]

# maximum length of any sentence/word from the dataframe
MAX_LENGTH = 100


def sentence2idxs(vocab: torchtext.vocab, sentence: str) -> list[int]:
    """ Token indexing function.
    Converts each word to its corresponding index in the vocabulary.
    Args:
        vocab (torchtext.vocab): built vocabulary {word:index}
        sentence (str): string to be indexed

    Returns:
        list[int]: list of indexes
    """
    tokens = normalize_string(str(sentence)).split(" ")
    return [vocab[word] if word in vocab else vocab["<UNK>"] for word in tokens]


def sentence2tensor(vocab: torchtext.vocab, sentence: str) -> torch.tensor:
    """ Converts sentence to a tensor ready to be fed in NN.
    Turns a sentence into list of indexes from vocab, append the ending symbol, and converts to tensor
    Args:
        vocab (torchtext.vocab): built vocabulary
        sentence (str): text to be tensored

    Returns:
        torch.tensor: processed tensor
    """
    indexes = sentence2idxs(vocab, sentence)
    indexes.append(EOS_TOKEN_IDX)
    return torch.tensor(indexes, dtype=torch.long, device=device).view(1, -1)


def process_sentence(vocab: torchtext.vocab, sentence: str) -> list:
    """ Similar to sentence2idxs but returns a list of indices. """

    idxs = sentence2idxs(vocab, sentence)
    idxs.append(EOS_TOKEN_IDX)
    return idxs


def get_dataloader(
    df: pd.DataFrame,
    input_vocab: torchtext.vocab,
    output_vocab: torchtext.vocab,
    batch_size=64, in_column="EN", out_column="PR"
) -> DataLoader:
    """_summary_

    Args:
        df (pd.DataFrame): dataframe containing the text data
        input_vocab (torchtext.vocab): vocabulary for the input (Endlish) lang.
        output_vocab (torchtext.vocab): vocab for the output (Portuguese) lang.
        batch_size (int, optional): number of samples per batch. Defaults to 64.
        in_column (str, optional): column which must be translated (contains
                                language from the input_vocab). Defaults to "EN".
        out_column (str, optional): columns with translated words (contains
                                language from the output_vocab). Defaults to "PR".

    Returns:
        DataLoader: DataLoader with 'features' and 'target' variables
    """

    n = len(df)

    # arrays for filling them with processed data: indexes of words from vocabs
    # shape: (n, MAX_LENGTH), all values are "<PAD>"
    input_idxs = np.ones((n, MAX_LENGTH), dtype=np.int32) * input_vocab["<PAD>"]
    target_idxs = np.ones((n, MAX_LENGTH), dtype=np.int32) * output_vocab["<PAD>"]

    # iterate over the dataframe, idx - row index, row - data
    for idx, row in tqdm(df.iterrows(), total=n):
        in_lang_idxs = process_sentence(input_vocab, row[in_column])
        out_lang_idxs = process_sentence(output_vocab, row[out_column])

        input_idxs[idx, :len(in_lang_idxs)] = in_lang_idxs
        if len(out_lang_idxs) == 109:
            print(row[out_column])
            print(row[in_column])
            print(idx)
        target_idxs[idx, :len(out_lang_idxs)] = out_lang_idxs

    data = TensorDataset(
        torch.LongTensor(input_idxs).to(device),
        torch.LongTensor(target_idxs).to(device),
    )
    # Convert input_idxs and target_idxs to PyTorch LongTensors and combine
    # them into a TensorDataset, which pairs input tensors with their
    # corresponding target tensors

    dataloader = DataLoader(data, batch_size=batch_size)
    return dataloader


In [8]:
sentence2idxs(english_vocab, 'hello world')    # Output: [1440, 455]
sentence2tensor(english_vocab, 'hello world')  # Output: tensor([[1440, 455, 1]]) where 1 is the symbol <EOS>
process_sentence(english_vocab, 'hello world') # Output: [1440, 455, 1]

[1440, 455, 1]

### Encoder Decoder RNN Model


In [8]:
# Models building

import torch.nn as nn


class EncoderRNN(nn.Module):
    """ Already observed architecture of an Encoder:
        embedding layer, GRU layer, and return states. """
    def __init__(self, input_size, hidden_size, dropout_p=0.1):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, input):
        embedded = self.dropout(self.embedding(input))
        output, hidden = self.gru(embedded)
        return output, hidden

In [9]:
import torch.nn.functional as F


class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()

        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.out = nn.Linear(hidden_size, output_size)

    def forward(self, encoder_outputs, encoder_hidden, target_tensor=None):
        batch_size = encoder_outputs.size(0)                  # gets the batch size from encoder_outputs (the same as for encoder)

        decoder_input = torch.empty(                          # creates an initial input tensor filled with
            batch_size, 1, dtype=torch.long, device=device    # the Start-of-Sequence token index (SOS_TOKEN_IDX).
        ).fill_(SOS_TOKEN_IDX)

        decoder_hidden = encoder_hidden                       # decoder's initial state = encoder's final state
        decoder_outputs = []

        for i in range(MAX_LENGTH):
            decoder_output, decoder_hidden = self.forward_step(  # forward_step - for getting the output and hidden
                decoder_input, decoder_hidden                    # state for the current time step
            )
            decoder_outputs.append(decoder_output)

            if target_tensor is not None:
                # Teacher forcing: feed the target as the next input
                decoder_input = target_tensor[:, i].unsqueeze(1) # Teacher forcing
            else:
                # Without teacher forcing: use its own predictions as the next input: the top predicted token
                _, topi = decoder_output.topk(1)
                decoder_input = topi.squeeze(-1).detach()        # detach from history as input

        decoder_outputs = torch.cat(decoder_outputs, dim=1)      # list concatenation along the time dimension to form a tensor.
        decoder_outputs = F.log_softmax(decoder_outputs, dim=-1)
        return (
            decoder_outputs,
            decoder_hidden,
            None,                                                # return None for consistency in the training loop
        )

    def forward_step(self, input, hidden):
        output = self.embedding(input)
        output = F.relu(output)
        output, hidden = self.gru(output, hidden)
        output = self.out(output)
        return output, hidden

### Train


In [24]:
def train_epoch(
    dataloader, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion
) -> float:
    """ Train function for exploiting Encoder and Decoder. """

    total_loss = 0
    for data in tqdm(dataloader):
        input_tensor, target_tensor = data

        encoder_optimizer.zero_grad()
        decoder_optimizer.zero_grad()

        encoder_outputs, encoder_hidden = encoder(input_tensor)
        decoder_outputs, _, _ = decoder(encoder_outputs, encoder_hidden, target_tensor)

        loss = criterion(decoder_outputs.view(-1, decoder_outputs.size(-1)), target_tensor.view(-1))
        loss.backward()

        encoder_optimizer.step()
        decoder_optimizer.step()

        total_loss += loss.item()

    return total_loss / len(dataloader)

In [23]:
def train(train_dataloader, encoder, decoder, n_epochs, learning_rate=0.001):
    """ Training loop. """
    encoder_optimizer = torch.optim.Adam(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = torch.optim.Adam(decoder.parameters(), lr=learning_rate)
    criterion = nn.NLLLoss()

    for epoch in range(n_epochs):
        loss = train_epoch(
            train_dataloader,
            encoder,
            decoder,
            encoder_optimizer,
            decoder_optimizer,
            criterion,
        )
        print("Epoch:", epoch, "loss:", loss)

In [13]:
hidden_size = 128
batch_size = 64

train_dataloader = get_dataloader(
    train_df[:10000],
    batch_size=batch_size,
    input_vocab=english_vocab,
    output_vocab=port_vocab,
)

encoder = EncoderRNN(len(english_vocab), hidden_size).to(device)
decoder = DecoderRNN(hidden_size, len(port_vocab)).to(device)

100%|██████████| 10000/10000 [00:01<00:00, 9331.16it/s]


In [14]:
train(train_dataloader, encoder, decoder, n_epochs=5)

100%|██████████| 157/157 [00:22<00:00,  7.03it/s]


Epoch: 0 loss: 1.0296695083379745


100%|██████████| 157/157 [00:20<00:00,  7.50it/s]


Epoch: 1 loss: 0.2116721787839938


100%|██████████| 157/157 [00:20<00:00,  7.58it/s]


Epoch: 2 loss: 0.19829069875228178


100%|██████████| 157/157 [00:21<00:00,  7.32it/s]


Epoch: 3 loss: 0.18981548821090893


100%|██████████| 157/157 [00:21<00:00,  7.42it/s]

Epoch: 4 loss: 0.18284030913547344





## Task Description


> The goal of The following tasks is to translate sentences from English to Dutch with Encoder-Decoder RNN with attention mechanism. You are given an example class of Bahdanau Attention, and you need to modify the Decoder class.


### Data Loading

In [10]:
!wget https://raw.githubusercontent.com/Dnau15/LabImages/main/data/txt_data/English_Dutch_sentences.zip
!unzip English_Dutch_sentences.zip
!rm English_Dutch_sentences.zip

--2024-07-12 19:05:24--  https://raw.githubusercontent.com/Dnau15/LabImages/main/data/txt_data/English_Dutch_sentences.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17653435 (17M) [application/zip]
Saving to: ‘English_Dutch_sentences.zip’


2024-07-12 19:05:24 (405 MB/s) - ‘English_Dutch_sentences.zip’ saved [17653435/17653435]

Archive:  English_Dutch_sentences.zip
replace English_Dutch_sentences.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: English_Dutch_sentences.csv  


In [26]:
df = pd.read_csv("English_Dutch_sentences.csv", lineterminator='\n')
df.drop('Unnamed: 0', inplace=True, axis=1)
print(df.shape)
df.head()

(397546, 2)


Unnamed: 0,EN,NL
0,In this you have confounded the sceptics.,Damit haben Sie die Skeptiker überzeugt.
1,That is all I have to say.,Dies wollte ich feststellen.
2,How do I collect a prize if I win?,Wie kann ich meinen Gewinn kassieren?
3,How long can a human being actually bear all t...,Wie lange kann ein Mensch das überhaupt verkra...
4,Join the VN TerraTop professional league!,Steigen Sie ein in die VN TerraTop Profiliga!


In [27]:
df = df.iloc[:80000, :]

### Task 1. Data Preprocessing
1. Split the dataframe on train (0.75) and test parts (0.25).
2. Reindex 2 DataFrames, do not use .reindex so that not to get a lot of None values.
3. Create vocabularies for 2 languages. Add special tokens for **unknown words, padding, start and end of strings**, put them in the beginning of Vocab.

In [28]:
df["NL"] = df["NL"].apply(lambda x: x[:MAX_LENGTH] if len(x) > MAX_LENGTH else x)

In [29]:
# Data Splitting
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df)

In [30]:
train_df.index = range(len(train_df))
test_df.index = range(len(test_df))
train_df.head()

Unnamed: 0,EN,NL
0,So there is a problem here.,Das ist doch der springende Punkt.
1,The overnight stays vary between 66 and 76€.,Die Preise für eine Übernachtung mit Abendesse...
2,Our work is badly organised.,Unsere Arbeit ist schlecht organisiert.
3,Last but not least: the old lift is very cute!,"Wir haben uns sehr, sehr wohl gefühlt."
4,View into the new oil condensing unit WTC-OW.,Ein Blick in das neue Weishaupt Öl-Brennwertge...


In [31]:
from torchtext.vocab import build_vocab_from_iterator

SPECIAL_TOKENS = ["<SOS>", "<EOS>", "<UNK>", "<PAD>"]
english_vocab = build_vocab_from_iterator(
    preprocess(train_df["EN"].values), special_first=True, specials=SPECIAL_TOKENS
)

dutch_vocab = build_vocab_from_iterator(
    preprocess(train_df["NL"].values), special_first=True, specials=SPECIAL_TOKENS
)

Building vocab: 100%|██████████| 60000/60000 [00:01<00:00, 54492.71it/s]
Building vocab: 100%|██████████| 60000/60000 [00:01<00:00, 44182.22it/s]


In [32]:
print("Length of english vocab:", len(english_vocab))
print("Length of dutch vocab:", len(dutch_vocab))

Length of english vocab: 26614
Length of dutch vocab: 48467


In [33]:
train_df

Unnamed: 0,EN,NL
0,So there is a problem here.,Das ist doch der springende Punkt.
1,The overnight stays vary between 66 and 76€.,Die Preise für eine Übernachtung mit Abendesse...
2,Our work is badly organised.,Unsere Arbeit ist schlecht organisiert.
3,Last but not least: the old lift is very cute!,"Wir haben uns sehr, sehr wohl gefühlt."
4,View into the new oil condensing unit WTC-OW.,Ein Blick in das neue Weishaupt Öl-Brennwertge...
...,...,...
59995,I come now to the more long-term aspects.,Nun zu den eher langfristigen Aspekten.
59996,The restaurant is open from 11:30 to 22:00.,Das Restaurant ist von 11:30 bis 22:00 Uhr geö...
59997,• Safe deposit box in your room or at reception.,• Safe im Zimmer und an der Rezeption.
59998,Lost Horizon,Der Horizont geht verloren


In [34]:
assert len(train_df) == 60000
assert train_df.index[0] == 0 and train_df.index[1] == 1
assert 0 <= english_vocab['<UNK>'] <= 3
assert 0 <= english_vocab['<SOS>'] <= 3

### Task 2. Model
Fill the gaps in the AttnDecoderRNN class. Complete the forward and forward_step functions.

- Architecture: Embedding layer, BahdanauAttention layer, GRU layer (`Important note:` use **2*hidden_size** as an input_size for GRU, since we use Attention weights), and Linear layer.
- Add dropout
- As it was shown in this lesson, apply log_softmax to decoder_outputs

In [35]:
import torch.nn.functional as F


class BahdanauAttention(nn.Module):
    def __init__(self, hidden_size):
        super(BahdanauAttention, self).__init__()
        self.Wa = nn.Linear(hidden_size, hidden_size)
        self.Ua = nn.Linear(hidden_size, hidden_size)
        self.Va = nn.Linear(hidden_size, 1)

    def forward(self, query, keys):
        scores = self.Va(torch.tanh(self.Wa(query) + self.Ua(keys)))
        scores = scores.squeeze(2).unsqueeze(1)

        weights = F.softmax(scores, dim=-1)
        context = torch.bmm(weights, keys)

        return context, weights


class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1):
        super(AttnDecoderRNN, self).__init__()

        self.embedding = nn.Embedding(output_size, hidden_size)
        self.attention = BahdanauAttention(hidden_size)
        self.gru = nn.GRU(2 * hidden_size, hidden_size, batch_first=True)
        self.out = nn.Linear(hidden_size, output_size)
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, encoder_outputs, encoder_hidden, target_tensor=None):
        batch_size = encoder_outputs.size(0)

        # create a tensor (size = batch_size) filled by '<SOS>' tokens
        decoder_input = torch.empty(
            batch_size, 1, dtype=torch.long, device=device
        ).fill_(SOS_TOKEN_IDX)

        # decoder's initial hidden state = encoder's final hidden state
        decoder_hidden = encoder_hidden
        decoder_outputs = []
        attentions = []

        for i in range(MAX_LENGTH):
            decoder_output, decoder_hidden, attn_weights = self.forward_step(
                decoder_input, decoder_hidden, encoder_outputs
            )

            decoder_outputs.append(decoder_output)

            # add weights to the 'attention' list
            attentions.append(attn_weights)

            if target_tensor is not None:
                decoder_input = target_tensor[:, i].unsqueeze(1)
            else:
                _, topi = decoder_output.topk(1)
                decoder_input = topi.squeeze(-1).detach()

        decoder_outputs = torch.cat(decoder_outputs, dim=1)
        decoder_outputs = F.log_softmax(decoder_outputs, dim=-1)
        attentions = torch.cat(attentions, dim=1)

        return decoder_outputs, decoder_hidden, attentions

    def forward_step(self, input, hidden, encoder_outputs):
        embedded = self.dropout(self.embedding(input))

        query = hidden.permute(1, 0, 2)
        context, attn_weights = self.attention(query, encoder_outputs)
        input_gru = torch.cat((embedded, context), dim=2)

        output, hidden = self.gru(input_gru, hidden)
        output = self.out(output)

        return output, hidden, attn_weights

In [36]:
hidden_size = 128
batch_size = 64

# Build a dataloader
train_dataloader = get_dataloader(
    train_df,
    batch_size=batch_size,
    input_vocab=english_vocab,
    output_vocab=dutch_vocab,
    in_column="EN",
    out_column="NL",
)

encoder = EncoderRNN(len(english_vocab), hidden_size).to(device)
decoder = AttnDecoderRNN(hidden_size, len(dutch_vocab)).to(device)

100%|██████████| 60000/60000 [00:10<00:00, 5925.75it/s]


### Task 3. Training

In [37]:
# Train models. Try 10 epochs

train(train_dataloader, encoder, decoder, n_epochs=5)

100%|██████████| 938/938 [05:07<00:00,  3.05it/s]


Epoch: 0 loss: 0.7304370665092712


100%|██████████| 938/938 [05:07<00:00,  3.05it/s]


Epoch: 1 loss: 0.5120910929718505


100%|██████████| 938/938 [05:06<00:00,  3.06it/s]


Epoch: 2 loss: 0.4609920818414261


100%|██████████| 938/938 [05:06<00:00,  3.06it/s]


Epoch: 3 loss: 0.424088426299695


100%|██████████| 938/938 [05:06<00:00,  3.06it/s]

Epoch: 4 loss: 0.3939473174973083





In [None]:
torch.save(encoder.state_dict(), "encoder.pt")
torch.save(decoder.state_dict(), "decoder.pt")

### Task 4. Prediction
Build the function predict based on the considered example. Build test dataloader and predict answers.

In [38]:
def predict(
    encoder, decoder, dataloader, input_vocab=english_vocab, output_vocab=dutch_vocab
):
    with torch.no_grad():
        predictions = []
        for data in tqdm(dataloader):
            input_tensor, _ = data
            input_tensor = input_tensor.to(device)

            encoder_outputs, encoder_hidden = encoder(input_tensor)
            decoder_outputs, decoder_hidden, decoder_attn = decoder(
                encoder_outputs, encoder_hidden
            )

            _, topi = decoder_outputs.topk(1)
            decoded_ids = topi.squeeze()

            for sentence in decoded_ids:
                decoded_words = []
                for idx in sentence:
                    if idx.item() == EOS_TOKEN_IDX:
                        break
                    decoded_words.append(output_vocab.get_itos()[idx.item()])
                predictions.append(" ".join(decoded_words))
    return predictions

In [39]:
test_dataloader = get_dataloader(
    test_df,
    batch_size=batch_size,
    input_vocab=english_vocab,
    output_vocab=dutch_vocab,
    in_column="EN",
    out_column="NL",
)

100%|██████████| 20000/20000 [00:02<00:00, 7031.03it/s]


In [40]:
encoder.eval()
decoder.eval()

predictions = predict(encoder, decoder, test_dataloader)

100%|██████████| 313/313 [16:26<00:00,  3.15s/it]


In [41]:
test_df["prediction"] = predictions
# test_df[["id", "prediction"]].to_csv("submission.csv", index=False)

In [42]:
test_df

Unnamed: 0,EN,NL,prediction
0,"It's functioning with 1, 2 or maximum 3 coins.","(funktioniert mit 1, 2 oder maximal 3 gleichen...",es ist ein neues jahr und die zimmer
1,Get the latest version of the Tunngle software.,Hier erhältst du die aktuelle Tunngle Software...,sie konnen die liedtexte von der datei mainwin
2,The debate is closed.,Die Aussprache wird beendet.,die aussprache ist geschlossen
3,Vicar at the Immaculate Conception Parish.,Ernennung zum Pfarrvikar in der Pfarrei der Un...,<UNK>
4,"Commissioner, I have stressed the timetable.","Frau Kommissarin, die Festlegung der Fristen w...",herr kommissar ich habe die berichterstatterin
...,...,...,...
19995,Such models are very costly and hard to control.,Solche Modelle sind außerordentlich kostspieli...,wir sind sehr gut und nicht mehr als auch in d...
19996,That concludes the vote.,Die Abstimmung ist geschlossen.,die abstimmung findet morgen statt
19997,It is in our interest.,Dies liegt in unserem Interesse.,es ist uns in der tat
19998,You can do it.,"Und sie sind in der Lage, dies zu tun.",sie konnen sie nicht


Most probably the results are not so imressive, since the Dutch consists of a complicated vocabulary, and moreover, such models must be trained on a much bigger corpus and much longer time. However, it is a good practice for applying knowledge in action.

### Evaluation
There are special metrics for evaluating NMT systems. One of them is ROUGE-1 where a 'comparison' of predicted and real data is a comparison of each word separately.
 
You can find more [here](https://medium.com/trusted-data-science-haleon/metrics-for-evaluation-of-translation-accuracy-5d0bacd647ca).

# Conclusion

In this NLP lesson, we delved into the fascinating world of sequence-to-sequence (Seq2Seq) models and their applications in neural machine translation (NMT). We explored the foundational concepts and architectures that power modern translation systems, focusing on the following key themes:
- We began with an introduction to Seq2Seq models, which are designed to convert sequences from one domain (e.g., English sentences) to another domain (e.g., Portuguese sentences). These models are fundamental to tasks where the output is a sequence of variable length, such as translation, summarization, and dialogue generation.
- We examined the encoder-decoder architecture, where the encoder processes the input sequence and converts it into a context vector (a fixed-size representation). The decoder then generates the output sequence from this context vector. This architecture forms the backbone of many NMT systems.
- We discussed the concept of teacher forcing, a technique used during training to improve convergence. By providing the actual target output to the decoder at each step, we help the model learn more effectively.
- A significant enhancement to the basic Seq2Seq model is the attention mechanism. We learned how attention allows the decoder to focus on different parts of the input sequence at each step of the output generation, addressing the limitations of the fixed-size context vector and improving translation quality, especially for longer sentences.
- We implemented a practical example of translating English sentences to Portuguese using a Seq2Seq model with an attention mechanism. We covered the entire pipeline, including data preprocessing, building vocabulary, designing encoder and decoder classes, and training the model.
- We created custom classes for the encoder and decoder. The encoder class processes the input sequence and outputs the context vector, while the decoder class uses this context vector to generate the output sequence. We also incorporated the attention mechanism within the decoder to enhance performance.
- Finally, we applied our knowledge to a practical task involving the translation of English sentences to German. This task included implementing attention within the decoder and demonstrated the significant improvements that attention mechanisms can bring to translation tasks.

By mastering these techniques, you are now equipped to develop more sophisticated and accurate translation models.