This notebook was inspired by neural network & machine learning labs led by [GMUM](https://gmum.net/).

See also the PyTorch tutorial [NLP From Scratch: Translation with a Sequence to Sequence Network and Attention](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html) (which we'll mostly be following this week) and Lilian Weng's [Attention? Attention!](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html).

# Seq2seq and attention
Today we'll be teaching a neural network to translate from Polish to English. We'll be using a [sequence to sequence learning](https://arxiv.org/abs/1409.3215), in which two recurrent neural networks (and *encoder* and a *decoder*) work together to transform one sequence into another.
![layer based](figures/seq2seq.png)

<center>Source: <a href="https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html">NLP From Scratch: Translation with a Sequence to Sequence Network and Attention</a>.</center>

Later we'll also use an [attention mechanism](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html) to improve upon our results.

In [None]:
from io import open
import unicodedata
import string
import re
import random

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
%matplotlib inline

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

First we'll need to download the data for today.

In [None]:
!wget https://www.manythings.org/anki/pol-eng.zip
!unzip pol-eng.zip

We'll define the class `Lang` to help us manage our data. Each word in a language will have its own separate index, as well as a count of how often it shows up (this will later help us replace rare words). Additionally we'll define three special indices:
* 0 for the start-of-sequence token (SOS),
* 1 for the end-of-sequence token (EOS),
* 2 for padding (to make all batch sequences equal length so as to enable GPU parallelization). 

In [None]:
SOS_token = 0
EOS_token = 1
PAD_token = 2

class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS", 2: "PAD"}
        self.n_words = 3 # Count SOS, EOS and PAD

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

To simplify we will turn Unicode characters to ASCII, make everything lowercase, and trim most punctuation.

In [None]:
# https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# Lowercase, trim, and remove non-letter characters
def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = s.replace("ł", "l")
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

To read the data file you need to split the file into lines, and then split lines into pairs.

In [None]:
def readLangs(lang1, lang2, reverse=False):
    print("Reading lines...")

    # Read the file and split into lines
    lines = open(f'{lang1}.txt', encoding='utf-8').\
        read().strip().split('\n')

    # Split every line into pairs and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')[:2]] for l in lines]

    # Reverse pairs, make Lang instances
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(lang2)
        output_lang = Lang(lang1)
    else:
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)

    return input_lang, output_lang, pairs

To simplify the problem we'll remove sentences which are above 20 words in length.

In [None]:
MAX_LENGTH = 20

def filterPair(p):
    return len(p[0].split(' ')) <= MAX_LENGTH and \
        len(p[1].split(' ')) <= MAX_LENGTH


def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

The full process for preparing the data is:

- read text file and split into lines, split lines into pairs,
- normalize text, filter by length,
- make word lists from sentences in pairs.

In [None]:
def prepareData(lang1, lang2, reverse=False):
    input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)
    print(f"Read {len(pairs)} sentence pairs")
    pairs = filterPairs(pairs)
    print(f"Trimmed to {len(pairs)} sentence pairs")
    print("Counting words...")
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    return input_lang, output_lang, pairs


input_lang, output_lang, pairs = prepareData('pol', 'eng', True)
for _ in range(3):
    print(random.choice(pairs))

We also need some additional functions to prepare the data.

In [None]:
def indexesFromSentence(lang, sentence):
    return [lang.word2index[word] for word in sentence.split(' ')]


def tensorFromSentence(lang, sentence):
    indexes = indexesFromSentence(lang, sentence)
    indexes.append(EOS_token)
    return torch.tensor(indexes, dtype=torch.long, device=device).view(-1)


def tensorsFromPair(pair):
    input_tensor = tensorFromSentence(input_lang, pair[0])
    target_tensor = tensorFromSentence(output_lang, pair[1])
    return (input_tensor, target_tensor)


def pad_sequences(data_batch):
    pl_batch, en_batch = [], []
    for pl_sentence, en_sentence in data_batch:
        pl_batch += [pl_sentence]
        en_batch += [en_sentence]
    pl_batch = pad_sequence(pl_batch, padding_value=PAD_token, batch_first=True)
    en_batch = pad_sequence(en_batch, padding_value=PAD_token, batch_first=True)
    return pl_batch, en_batch


def prepare_dataset(batch_size):
    rng = np.random.RandomState(567)
    indices = np.arange(len(pairs))
    rng.shuffle(indices)
    train_indices = indices[:int(len(pairs) * 0.8)]
    test_indices = indices[int(len(pairs) * 0.8):]
    train_pairs = list(pairs[idx] for idx in train_indices)
    test_pairs = list(pairs[idx] for idx in test_indices)
    tensor_train_pairs = [tensorsFromPair(pairs[idx]) for idx in train_indices]
    tensor_test_pairs = [tensorsFromPair(pairs[idx]) for idx in test_indices]

    train_loader = DataLoader(tensor_train_pairs, batch_size=batch_size, shuffle=True, collate_fn=pad_sequences)
    test_loader = DataLoader(tensor_test_pairs, batch_size=batch_size, shuffle=True, collate_fn=pad_sequences)
    return train_pairs, test_pairs, train_loader, test_loader

## Seq2seq

The encoder reads an input sequence and outputs a single vector, and the decoder reads that vector to produce an output sequence. Unlike sequence prediction with a single RNN, where every input corresponds to an output, the seq2seq model frees us from sequence length and order, which makes it ideal for translation between two languages.

![layer based](figures/seq2seq_chainer.png)<center>Source: <a href="https://docs.chainer.org/en/stable/examples/seq2seq.html">Write a Sequence to Sequence (seq2seq) Model</a>.</center>

### Teacher forcing

*Teacher forcing* is the concept of using the real target outputs as each next input, instead of using the decoder’s guess as the next input. This can help us converge faster, but may also exhibit some instability. Because of PyTorch's autograd engine we'll be able to randomly choose whether to use teacher forcing or not with a simple if-statement (`teacher_forcing_ratio` will modify how much of it is present).

Some additional helper functions.

In [None]:
def predict(encoder, decoder, inputs, targets=None, max_len=MAX_LENGTH):
    batch_size = inputs.size(0)

    encoder_outputs, encoder_hidden = encoder(inputs)

    decoder_input = torch.tensor([[SOS_token]] * batch_size, device=device)
    decoder_hidden = encoder_hidden
    decoder_output, decoder_attention = decoder(
        decoder_input,
        decoder_hidden,
        targets=targets,
        max_len=max_len,
        encoder_outputs=encoder_outputs)
    return decoder_output, decoder_attention

def translate(encoder, decoder, sentence):
    inputs = tensorFromSentence(input_lang, sentence).unsqueeze(0).to(device)
    decoder_output, decoder_attention = predict(encoder, decoder, inputs)

    decoded_words = []
    for word in decoder_output[0]:
        top_word = word.argmax(-1).item()
        decoded_words.append(output_lang.index2word[top_word])
        if top_word == EOS_token:
            break

    if decoder_attention is not None:
        # [out_words, in_words]
        att = decoder_attention.cpu().detach().numpy()
        att = att[0, :len(decoded_words), :]
        fig, ax = plt.subplots()

        ax.imshow(att)
        ax.set_xticklabels([''] + sentence.split(' ') + ['EOS'], rotation=90)
        ax.set_yticklabels([''] + decoded_words)


        ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
        ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

        plt.show()

    return decoded_words
        
def translate_randomly(encoder, decoder, pairs, n=10):
    for i in range(n):
        pair = random.choice(pairs)
        print('>', pair[0])
        print('=', pair[1])
        output_words = translate(encoder, decoder, pair[0])
        output_sentence = ' '.join(output_words)
        print('<', output_sentence)
        print('')

Training loop.

In [None]:
def train(encoder, decoder, lr=0.01, batch_size=256, teacher_forcing_ratio=0.5, n_epochs=100):

    train_pairs, test_pairs, train_loader, test_loader = prepare_dataset(batch_size)

    criterion = nn.CrossEntropyLoss(ignore_index=PAD_token)

    encoder_optimizer = optim.Adam(encoder.parameters(), lr=lr)
    decoder_optimizer = optim.Adam(decoder.parameters(), lr=lr)

    encoder.to(device)
    decoder.to(device)

    for epoch in range(n_epochs + 1):

        epoch_train_loss = 0.
        for in_batch, out_batch in train_loader:
            in_batch, out_batch = in_batch.to(device), out_batch.to(device)

            encoder_optimizer.zero_grad()
            decoder_optimizer.zero_grad()
        
            teacher_inputs = out_batch if random.random() < teacher_forcing_ratio else None
        
            decoder_output, decoded_attention = predict(
                encoder, decoder, in_batch,
                targets=teacher_inputs,
                max_len=out_batch.size(1)
            )
        
            loss = criterion(decoder_output.transpose(1, 2), out_batch)
            loss.backward()
        
            encoder_optimizer.step()
            decoder_optimizer.step()

            epoch_train_loss += loss.item()

        if epoch % 25 == 0:
            with torch.no_grad():
                print("=" * 25, "Translation test", "=" * 25)
                translate_randomly(encoder, decoder, test_pairs, n=5)

        mean_train_loss = epoch_train_loss / len(train_loader)
        print(f"Epoch: {epoch+1}. Train loss: {mean_train_loss}")

A simple seq2seq encoder, where we only use the last output (often called a *context vector*, as it encodes the context of the entire sentence). This context vector is used as the initial hidden state of the decoder.

In [None]:
class EncoderRNN(nn.Module):
    def __init__(self, vocab_size, embedding_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(vocab_size, embedding_size)
        self.lstm = nn.LSTM(embedding_size, hidden_size, batch_first=True)

    def forward(self, input):
        embedded = self.embedding(input)
        output, hidden = self.lstm(embedded)
        return output, hidden

## Task 1 (1p)
Implement the decoder network for the seq2seq model. 

At every step of decoding, the decoder is given an input token and hidden state. The initial input token is the start-of-string (SOS) token, and the first hidden state is the context vector (the encoder’s last hidden state). 

The decoder is supposed to return two variables:
- `output`: a tensor of shape `(batch_size, seq_len, vocab_size)` representing the logits, which after applying a softmax (done outside of the decoder) will represent the probabilities of different words predicted by the decoder,
- `attention_weights`: in this task always `None`.

Some remarks:
* Use `batch_first=True` when defining the RNN.
* In the encoder we could call the LSTM class once, as we already had access to all of the words we wanted to translate. This is not the case for the decoder, as the input for time $t+1$ is the output for time $t$ (hence you'll probably need a for-loop).

In [None]:
class DecoderRNN(nn.Module):
    def __init__(self, vocab_size, embedding_size, hidden_size):
        super(DecoderRNN, self).__init__()
        ???

    def forward(self, input, hidden, targets=None, max_len=None, encoder_outputs=None):
        if targets is not None:  # teacher forcing
            ???
        else:
            ???
        return output, None

The cell below will train a model with your implementation.

In [None]:
hidden_size = 128
embedding_size = 256
encoder = EncoderRNN(input_lang.n_words, embedding_size, hidden_size).to(device)
decoder = DecoderRNN(output_lang.n_words, embedding_size, hidden_size).to(device)

train(encoder, decoder, lr=0.005, n_epochs=100)

## Attention

If only the context vector is passed between the encoder and decoder, that single vector carries the burden of encoding the entire sentence. Attention is a mechanism that allows the network to focus on a different part of the encoder's outputs for every step of the decoder's outputs. 

Rather than building a single context vector out of the encoder's last hidden state, the main idea of attention is to create shortcuts between the context vector and the entire source input. The weights of these shortcut connections are customizable for each output element.

![](figures/An-attention-based-seq2seq-model.ppm)

In the simpler decoder, at timestep $t$ the input was the embedded representation $\mathbf{\bar{y}_t}$. In a decoder with attention the input will be a concatenation of that vector and a vector $\mathbf{z_t}$ created from the outputs of the encoder: $\mathbf{\tilde{h}_t} = [\mathbf{\bar{y}_t}, \mathbf{z_t}]$. 

The vector $\mathbf{z_t}$ is produced by the attention mechanism. Intuitively, we would like it to contain the information from the encoder which will be the most important for decoding a given word. 
Assume we have access to a *score function* $\mathtt{score}(\mathbf{h}, \mathbf{e})$, which can tell us how similar the hidden state of the decoder $\mathbf{h}$ and the word representation $\mathbf{e}$ are.

Then $w_i = \frac{ \exp(\mathtt{score}(\mathbf{h}, \mathbf{e_i})) }{\sum_{j} \exp(\mathtt{score}(\mathbf{h}, \mathbf{e_j}))}$ and 
$\mathbf{z_t} = \sum_i w_i \cdot \mathbf{e_i}$.

## Task 2 (1p)
Implement the decoder network for the seq2seq model utilizing the attention mechanism.

The decoder:
- receives an additional tensor `encoder_outputs` of shape `(batch_size, encoder_seq_len, hidden_size)` (these are the representations $\mathbf{e_i}$, which you need to use in the attention mechanism).
- outputs an additional tensor `attention_weights` of shape `(batch_size, decoder_seq_len, encoder_seq_len)` containing the weights $w_i$ (the values of this tensor should sum to one on the last dimension). 

The scoring function $\mathtt{score}(\mathbf{h}, \mathbf{e})$ is going to be a neural network with two hidden layers of dimensions `hidden_size+hidden_size`, `hidden_size`, and `1`, respectively, with the `tanh` activation function after the first layer.

In [None]:
class AttnDecoderRNN(nn.Module):
    def __init__(self, vocab_size, embedding_size, hidden_size):
        super(AttnDecoderRNN, self).__init__()
        ???

    def forward(self, input, hidden, targets=None, max_len=None, encoder_outputs=None):

        ???

        return output, seq_att_weights

The cell below will train a model with your implementation.

In [None]:
hidden_size = 128
embedding_size = 256
encoder = EncoderRNN(input_lang.n_words, embedding_size, hidden_size).to(device)
decoder = AttnDecoderRNN(output_lang.n_words, embedding_size, hidden_size).to(device)

train(encoder, decoder, lr=0.005, n_epochs=100)