# Exercise 5. Recurrent neural networks

## Part 1. Training a translation model one sequence at a time

## Learning goals of part 1

* to get familiar with recurrent neural networks used for sequential data processing
* to get familiar with the sequence-to-sequence model for machine translation

You may find it useful to look at this tutorial:
* [Translation with a Sequence to Sequence Network and Attention](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)

In [1]:
skip_training = True  # Set this flag to True before validation and submission

In [2]:
# During evaluation, this cell sets skip_training to True
# skip_training = True

In [3]:
# Select data directory
import os
if os.path.isdir('/coursedata'):
    course_data_dir = '/coursedata'
elif os.path.isdir('../data'):
    course_data_dir = '../data'
else:
    # Specify course_data_dir on your machine
    # course_data_dir = ...
    # YOUR CODE HERE
    raise NotImplementedError()

print('The data directory is %s' % course_data_dir)

The data directory is ../data


In [4]:
import os
import random
import numpy as np

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

In [5]:
# Select the device for training (use GPU if you have one)
device = torch.device('cuda:0')
# device = torch.device('cpu')

In [6]:
if skip_training:
    # The models are always evaluated on CPU
    device = torch.device("cpu")

## Data

The dataset that we are going to use consists of pairs of sentences in French and English.

In [7]:
from data import TranslationDataset, MAX_LENGTH, SOS_token, EOS_token

In [8]:
data_dir = os.path.join(course_data_dir, 'translation_data')
trainset = TranslationDataset(path=data_dir, train=True)

* `TranslationDataset` supports indexing as required by `torch.utils.data.Dataset`
* Sentences are tensors of maximum length `MAX_LENGTH`
* Words in a (sentence) tensor are represented as an index (integer) in a language vocabulary
* The string representation of a word from the input language can be obtained from index `i` with `dataset.input_lang.index2word[i]`
* Similarly for the output language `dataset.output_lang.index2word[j]`

 Let us look at samples from that dataset.

In [9]:
input_sentence, output_sentence = trainset[np.random.choice(len(trainset))]
print('Input sentence: "%s"' % ' '.join(trainset.input_lang.index2word[i.item()] for i in input_sentence))
print('Sentence as tensor of word indices:')
print(input_sentence)

print('\nOutput sentence: "%s"' % ' '.join(trainset.output_lang.index2word[i.item()] for i in output_sentence))
print('Sentence as tensor of word indices:')
print(output_sentence)

Input sentence: "je suis sure que tout ira bien . EOS"
Sentence as tensor of word indices:
tensor([[   6],
        [  11],
        [  52],
        [ 914],
        [  35],
        [3927],
        [   8],
        [   5],
        [   1]])

Output sentence: "i m sure everything will be fine . EOS"
Sentence as tensor of word indices:
tensor([[   2],
        [   3],
        [  33],
        [1783],
        [1681],
        [1226],
        [  23],
        [   4],
        [   1]])


In [10]:
print('Number of input-output pairs in the training set: ', len(trainset))

Number of input-output pairs in the training set:  8682


## Sequence-to-sequence model for machine translation

In this exercise, we are going to build a machine translation system which transforms a sentence in one language into a sentence in another one. The computational graph of the translation model is shown below:

<img src="seq2seq.png" width=900 style="float: left;">

We are going to use a simplified model without the dotted connections.

## Encoder

The encoder encodes an input sequence $(x_1, x_2, ..., x_T)$ into a single vector $h_T$ using the following recursion:
$$
  h_{t} = f(h_{t-1}, x_t) \qquad t = 1, \ldots, T
$$
where:
* intial state $h_0$ is often chosen arbitrarily (we choose it to be zero)
* function $f$ is defined by the type of the RNN cell (in our experiments, we will use [GRU](https://pytorch.org/docs/stable/nn.html#torch.nn.GRU))
* $x_t$ is a vector that represents the $t$-th word in the input sentence.

A common practice in natural language processing is to _learn_ the word representations $x_t$ (instead of, for example, using one-hot coded vectors). In PyTorch, this is supported by class [Embedding](https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding) which we are going to use.

The computational graph of the encoder is shown below:

<img src="seq2seq_encoder.png" width=500 style="float: left;">

Let us implement the encoder whose `forward` function can processes _one input sequence at a time_.

Your task is to implement the `forward` function which should:
* embed the words in the input sequence (convert words' indexes into vectors using `self.embedding`)
* perform GRU computations by feeding the embedded words and the given state of the GRU cell (`hidden`)

In [11]:
class Encoder(nn.Module):
    def __init__(self, dictionary_size, hidden_size):
        """
        Args:
          dictionary_size (int): Size of dictionary in the source language.
          hidden_size (int): Size of the hidden state.
        """
        super(Encoder, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(dictionary_size, hidden_size)
        self.gru = nn.GRU(input_size=hidden_size, hidden_size=hidden_size)

    def forward(self, input_seq, hidden):
        """
        Args:
          input_seq (tensor):  Tensor of words (word indices) of the input sentence. The shape is
                               [seq_length, batch_size] with batch_size = 1.
          hidden (tensor):    The state of the GRU (shape [1, batch_size, hidden_size] with batch_size=1).

        Returns:
          output (tensor): Output of the GRU (shape [seq_length, 1, hidden_size]).
          hidden (tensor): New state of the GRU (shape [1, batch_size, hidden_size] with batch_size=1).
        """
        batch_size = input_seq.size(1)
        assert batch_size == 1, "Encoder can process only one sequence at a time."
        # YOUR CODE HERE
        embedded = self.embedding(input_seq)
        outputs = embedded
        outputs, hidden = self.gru(outputs, hidden)
        return outputs, hidden

    def init_hidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

In [12]:
# Let's test your code
hidden_size = 20
test_encoder = Encoder(dictionary_size=10, hidden_size=hidden_size).to(device)

hidden = test_encoder.init_hidden()
input_seq = torch.tensor([1, 2, 3, 4], device=device).view(4, 1)  # reshape to (seq_length, 1)
outputs, hidden = test_encoder.forward(input_seq, hidden)
assert outputs.shape == torch.Size([4, 1, hidden_size]), \
    "Bad shape of outputs: outputs.shape={}, expected={}".format(outputs.shape, torch.Size([4, 1, hidden_size]))
assert hidden.shape == torch.Size([1, 1, hidden_size]), \
    "Bad shape of outputs: hidden.shape={}, expected={}".format(hidden.shape, torch.Size([1, 1, hidden_size]))

print('The shapes seem to be ok.')

The shapes seem to be ok.


## Decoder

The decoder takes as input the representation computed by the encoder and transforms it into a sentence in the target language. The computational graph of the decoder is shown below:

<img src="seq2seq_decoder.png" width=500 align="top">

Notes:
* $z_0$ is the output of the encoder, that is $z_0 = h_5$, thus `hidden_size` of the decoder should be the same as `hidden_size` of the encoder.
* $y_{i}$ are the log-probabilities of the words in the output language, the dimensionality of $y_{i}$ is the size of the output (target) dictionary.
* $z_{i}$ is mapped to $y_{i}$ using a linear layer followed by `F.log_softmax` (because we use `nn.NLLLoss` loss for training).
* Each cell of the decoder is a GRU, it receives as inputs the previous state $z_{i-1}$ and relu of the **embedding** of the previous word. Thus, you need to embed the words of the output language as well. The previous word is taken as the word with the maximum log-probability.

Note that the decoder outputs a word at every step and the same word is used as the input to the recurrent unit at the next step. At the beginning of decoding, the previous word input is fed with a special word SOS which stands for "start of a sentence". During training, we know the target sentence for decoding, therefore we can feed the correct words $y_i$ as inputs to the recurrent unit.

There is one thing that it is wise to take care of. When the target sentence is fed to the decoder during training, the decoder learns to generate only the next word (this scenario is called "teacher forcing" in the literature). In test time, the decoder works differently: It generates the whole sequence using its own predictions as inputs at each step. Therefore, it makes sense to train the decoder to produce full sentences. In order to do that, we will alternate between two modes during training:
* "teacher forcing": the decoder is fed with the words in the target sequence
* no "teacher forcing": the decoder generates the output sequence using its own predictions. We will limit the maximum length of generated sequences to `MAX_LENGTH`.

In the code below, your task is to implement the decoder which has the structure shown in the figure above.

In [13]:
class Decoder(nn.Module):
    def __init__(self, hidden_size, output_dictionary_size):
        """
        Args:
          hidden_size (int): Size of the hidden state.
          output_dictionary_size (int): Size of dictionary in the target language.
        """
        super(Decoder, self).__init__()
        self.hidden_size = hidden_size

        # YOUR CODE HERE
        self.embedding = nn.Embedding(output_dictionary_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_dictionary_size)
        self.softmax = nn.LogSoftmax(dim=2)

    def forward(self, hidden, target_seq=None, teacher_forcing=False):
        """
        Args:
          hidden (tensor):        The state of the GRU (shape [1, batch_size, hidden_size] with batch_size=1).
          target_seq (tensor):    Tensor of words (word indices) of the target sentence. The shape is
                                   [target_seq_length, batch_size] with batch_size=1. If None, the output sequence
                                   is generated by feeding the decoder's outputs (teacher_forcing has to be False).
          teacher_forcing (bool): Whether to use teacher forcing or not.

        Returns:
          outputs (tensor): Tensor of log-probabilities of words in the output language
                             (shape [output_seq_length, batch_size, output_dictionary_size] with batch_size=1).
          hidden (tensor):  New state of the GRU (shape [1, batch_size, hidden_size] with batch_size=1).
        """
        if target_seq is None:
            assert not teacher_forcing, 'Cannot use teacher forcing without a target sequence.'

        prev_word = torch.tensor([SOS_token], device=device, dtype=torch.int64)
        out_length = target_seq.size(0) if target_seq is not None else MAX_LENGTH
        outputs = []  # Collect decoder outputs at different processing steps in this list
        for t in range(out_length):
            # YOUR CODE HERE
            output = self.embedding(prev_word).view(1, 1, -1)
            output = F.relu(output)
            output, hidden = self.gru(output, hidden)
            output = self.softmax(self.out(output))
            outputs.append(output)

            if teacher_forcing:
                # Feed the target as the next input
                prev_word = target_seq[t]  # Teacher forcing
            else:
                # Use its own predictions as the next input
                topv, topi = output[0, :].topk(1)
                prev_word = topi.squeeze().detach()  # detach from history as input

                if prev_word.item() == EOS_token:
                    break

        outputs = torch.cat(outputs, dim=0)  # [max_length, batch_size, output_dictionary_size]

        return outputs, hidden 

    def init_hidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

In [14]:
# Let's test the shapes
hidden_size = 20
output_dictionary_size = 10
test_decoder = Decoder(hidden_size, output_dictionary_size).to(device)

hidden = test_decoder.init_hidden()
target_seq = torch.tensor([1, 2, 3, 4], device=device).view(4, 1)  # reshape to (seq_length, 1)

outputs, hidden = test_decoder.forward(hidden, target_seq, teacher_forcing=False)
assert outputs.size(0) <= 4, "Too long output sequence: outputs.size(0)={}".format(outputs.size(0))
assert outputs.shape[1:] == torch.Size([1, output_dictionary_size]), \
    "Bad shape of outputs: outputs.shape[1:]={}, expected={}".format(outputs.shape[1:], torch.Size([1, output_dictionary_size]))
assert hidden.shape == torch.Size([1, 1, hidden_size]), \
    "Bad shape of hidden: hidden.shape={}, expected={}".format(hidden.shape, torch.Size([1, 1, hidden_size]))

outputs, hidden = test_decoder.forward(hidden, target_seq, teacher_forcing=True)
assert outputs.shape == torch.Size([4, 1, output_dictionary_size]), \
    "Bad shape of outputs: outputs.shape={}, expected={}".format(outputs.shape, torch.Size([4, 1, output_dictionary_size]))
assert hidden.shape == torch.Size([1, 1, hidden_size]), \
    "Bad shape of hidden: hidden.shape={}, expected={}".format(hidden.shape, torch.Size([1, 1, hidden_size]))

# Generation mode
outputs, hidden = test_decoder.forward(hidden, target_seq=None, teacher_forcing=False)
assert outputs.shape[1:] == torch.Size([1, output_dictionary_size]), \
    "Bad shape of outputs: outputs.shape[1:]={}, expected={}".format(outputs.shape[1:], torch.Size([1, output_dictionary_size]))
assert hidden.shape == torch.Size([1, 1, hidden_size]), \
    "Bad shape of hidden: hidden.shape={}, expected={}".format(hidden.shape, torch.Size([1, 1, hidden_size]))

print('The shapes seem to be ok.')

The shapes seem to be ok.


## Training a sequence-to-sequence model

Now we are going to train the sequence-to-sequence model on the toy translation dataset.

In [15]:
# Create encoder and decoder
hidden_size = 256
encoder = Encoder(trainset.input_lang.n_words, hidden_size).to(device)
decoder = Decoder(hidden_size, trainset.output_lang.n_words).to(device)

In [16]:
teacher_forcing_ratio = 0.5

encoder_optimizer = optim.SGD(encoder.parameters(), lr=0.01)
decoder_optimizer = optim.SGD(decoder.parameters(), lr=0.01)
criterion = nn.NLLLoss(reduction='sum')

In [17]:
trainloader = torch.utils.data.DataLoader(trainset, batch_size=1, shuffle=True)

In [18]:
n_epochs = 8

In the training loop below, we are going to process one pair of sequences at a time. Your task is to implement the input sequence encoding and decoding. Toggle `teacher_forcing` on and off during decoding according to the `teacher_forcing_ratio` specified above.

The loss computations are implemented already.

In [19]:
for epoch in range(n_epochs):
    running_loss = 0.0
    print_every = 100  # pairs
    for i, (input_seq, target_seq) in enumerate(trainloader):
        # We process one sequence at a time
        input_seq, target_seq = input_seq[0], target_seq[0]
        input_seq, target_seq = input_seq.to(device), target_seq.to(device)
        
        encoder_hidden = encoder.init_hidden()
        encoder_optimizer.zero_grad()
        decoder_optimizer.zero_grad()

        #input_length = input_seq.size(0)
        target_length = target_seq.size(0)

        # YOUR CODE HERE
        encoder_output, encoder_hidden = encoder(input_seq, encoder_hidden)
        teacher_forcing = True if random.random() < teacher_forcing_ratio else False 
        decoder_outputs, decoder_hidden = decoder(encoder_hidden, target_seq, teacher_forcing)
       
        # Compute the loss
        # In case of no teacher forcing, the output sequence can be shorter than the target sequence
        # We need to take care of that
        output_length, _, output_dictionary_size = decoder_outputs.size()
        assert (output_length == target_length) or not teacher_forcing, \
            "In case of teacher forcing, output_length ({}) should be equal to target_length ({}).".format(
            output_length, target_length)
        loss = criterion(decoder_outputs.view(output_length, output_dictionary_size),
                         target_seq[:output_length].view(output_length))

        loss.backward()

        encoder_optimizer.step()
        decoder_optimizer.step()

        # print statistics
        running_loss += loss.item() / target_length
        if (i % print_every) == (print_every-1):
            print('[%d, %5d] loss: %.4f' % (epoch+1, i+1, running_loss/print_every))
            running_loss = 0.0
        
        if skip_training:
            break
    if skip_training:
        break

print('Finished Training')

[1,   100] loss: 4.1993
[1,   200] loss: 3.4586
[1,   300] loss: 3.6254
[1,   400] loss: 3.3925
[1,   500] loss: 3.4402
[1,   600] loss: 3.4454
[1,   700] loss: 3.3348
[1,   800] loss: 3.4003
[1,   900] loss: 3.4788
[1,  1000] loss: 3.2630
[1,  1100] loss: 3.2900
[1,  1200] loss: 3.1816
[1,  1300] loss: 3.2388
[1,  1400] loss: 3.0982
[1,  1500] loss: 3.0590
[1,  1600] loss: 3.2239
[1,  1700] loss: 3.0776
[1,  1800] loss: 3.0815
[1,  1900] loss: 2.9286
[1,  2000] loss: 3.0110
[1,  2100] loss: 2.8290
[1,  2200] loss: 2.9903
[1,  2300] loss: 2.8386
[1,  2400] loss: 2.9828
[1,  2500] loss: 2.9665
[1,  2600] loss: 2.9467
[1,  2700] loss: 3.0005
[1,  2800] loss: 2.7151
[1,  2900] loss: 3.0112
[1,  3000] loss: 2.6622
[1,  3100] loss: 2.9964
[1,  3200] loss: 2.8447
[1,  3300] loss: 2.9438
[1,  3400] loss: 2.7748
[1,  3500] loss: 2.7385
[1,  3600] loss: 2.6998
[1,  3700] loss: 2.7588
[1,  3800] loss: 2.6960
[1,  3900] loss: 2.7739
[1,  4000] loss: 2.8013
[1,  4100] loss: 2.7406
[1,  4200] loss:

[4,  8500] loss: 1.4612
[4,  8600] loss: 1.3029
[5,   100] loss: 1.1003
[5,   200] loss: 1.0368
[5,   300] loss: 0.9070
[5,   400] loss: 1.0057
[5,   500] loss: 0.9350
[5,   600] loss: 1.0687
[5,   700] loss: 1.0217
[5,   800] loss: 1.1872
[5,   900] loss: 1.1218
[5,  1000] loss: 1.0248
[5,  1100] loss: 1.1095
[5,  1200] loss: 1.0538
[5,  1300] loss: 1.0384
[5,  1400] loss: 0.9359
[5,  1500] loss: 1.1296
[5,  1600] loss: 1.1310
[5,  1700] loss: 1.1263
[5,  1800] loss: 1.1790
[5,  1900] loss: 0.9624
[5,  2000] loss: 1.1470
[5,  2100] loss: 1.0558
[5,  2200] loss: 0.9932
[5,  2300] loss: 1.1164
[5,  2400] loss: 1.0201
[5,  2500] loss: 1.1370
[5,  2600] loss: 1.0504
[5,  2700] loss: 1.2182
[5,  2800] loss: 1.1163
[5,  2900] loss: 1.1398
[5,  3000] loss: 1.0103
[5,  3100] loss: 1.1549
[5,  3200] loss: 1.2066
[5,  3300] loss: 1.1102
[5,  3400] loss: 1.0755
[5,  3500] loss: 1.1301
[5,  3600] loss: 1.2686
[5,  3700] loss: 1.2682
[5,  3800] loss: 1.2059
[5,  3900] loss: 1.1218
[5,  4000] loss:

[8,  8300] loss: 0.6157
[8,  8400] loss: 0.5523
[8,  8500] loss: 0.6452
[8,  8600] loss: 0.4667
Finished Training


If you do well, the running loss should reach 0.5-0.6.

Hint: The training procedure may take an hour on your laptop. You may first train the model for a few epochs, proceed to the following task and train the model longer later again.

In [20]:
# Save the model to disk, submit these files together with your notebook
encoder_filename = '5_encoder.pth'
decoder_filename = '5_decoder.pth'
if not skip_training:
    try:
        do_save = input('Do you want to save the model (type yes to confirm)? ').lower()
        if do_save == 'yes':
            torch.save(encoder.state_dict(), encoder_filename)
            torch.save(decoder.state_dict(), decoder_filename)
            print('Model saved to %s, %s.' % (encoder_filename, decoder_filename))
        else:
            print('Model not saved.')
    except:
        raise Exception('The notebook should be run or validated with skip_training=True.')
else:
    hidden_size = 256
    encoder = Encoder(trainset.input_lang.n_words, hidden_size)
    encoder.load_state_dict(torch.load(encoder_filename, map_location=lambda storage, loc: storage))
    print('Encoder loaded from %s.' % encoder_filename)
    encoder = encoder.to(device)
    encoder.eval()

    decoder = Decoder(hidden_size, trainset.output_lang.n_words)
    decoder.load_state_dict(torch.load(decoder_filename, map_location=lambda storage, loc: storage))
    print('Decoder loaded from %s.' % decoder_filename)
    decoder = decoder.to(device)
    decoder.eval()

Do you want to save the model (type yes to confirm)? yes
Model saved to 5_encoder.pth, 5_decoder.pth.


## Evaluate the trained model

Let us now test the trained model.

In [21]:
# Load the test set
testset = TranslationDataset(path=data_dir, train=False)

Your task is to write a function that takes an input sequence (which is a tensor of word indexes) and produces a sequence of outputs using the trained encoder and decoder.

In [22]:
def evaluate(encoder, decoder, input_seq):
    """Translate given sentence input_seq using trained encoder and decoder.
    
    Args:
      encoder (Encoder): Trained encoder.
      decoder (Decoder): Trained decoder.
      input_seq (tensor): Tensor of words (word indices) of the input sentence (shape [input_seq_length, 1]).
    
    Returns:
      output_seq (tensor): Tensor of words (word indices) of the output sentence (shape [output_seq_length, 1]).
    """
    # YOUR CODE HERE
    encoder_hidden = encoder.init_hidden()
    encoder_output, encoder_hidden = encoder(input_seq, encoder_hidden)
    
    decoder_out, decoder_hidden = decoder(encoder_hidden, encoder_output)
    val, indices = decoder_out.max(2)
    return indices


In [23]:
input_seq = torch.tensor([1, 2, 3, 4], device=device).view(4, 1)  # reshape to (seq_length, 1)
output_seq = evaluate(encoder, decoder, input_seq)
assert output_seq.shape[0] <= MAX_LENGTH, \
    "Too long output sequence: output_seq.shape[0]={}".format(output_seq.shape[0])
assert output_seq.shape[1:] == torch.Size([1]), \
    "Bad shape of output_seq: output_seq.shape[1:]={}, expected={}".format(output_seq.shape[1:], torch.Size([1]))
print('The shapes seem to be ok.')

The shapes seem to be ok.


Let us now evaluate random sentences from the training set and print the input, target, and output.

In [24]:
for i in range(5):
    input_sentence, target_sentence = trainset[np.random.choice(len(trainset))]
    print('>', ' '.join(trainset.input_lang.index2word[i.item()] for i in input_sentence))
    print('=', ' '.join(trainset.output_lang.index2word[i.item()] for i in target_sentence))
    output_sentence = evaluate(encoder, decoder, input_sentence.to(device)).view(-1).cpu().data.numpy()
    print('<', ' '.join(trainset.output_lang.index2word[i] for i in output_sentence))
    print('')

> j attends impatiemment ta lettre . EOS
= i am looking forward to your letter . EOS
< i am looking forward to your letter

> j ai super faim . EOS
= i m super hungry . EOS
< i m taking hungry . .

> je suis bien plus jeune que vous . EOS
= i m much younger than you . EOS
< i m much younger than you . EOS

> nous n allons pas sortir . EOS
= we re not going out . EOS
< we re not going out . EOS

> vous devenez rouge . EOS
= you re turning red . EOS
< you re turning thirty .



If you trained the model well enough, the model should memorize the training data well.

In [27]:
# Evaluate random sentences from the test set and print the input, target, and output
for i in range(5):
    input_sentence, target_sentence = testset[np.random.choice(len(testset))]
    print('>', ' '.join(testset.input_lang.index2word[i.item()] for i in input_sentence))
    print('=', ' '.join(testset.output_lang.index2word[i.item()] for i in target_sentence))
    output_sentence = evaluate(encoder, decoder, input_sentence.to(device)).view(-1).cpu().data.numpy()
    print('<', ' '.join(testset.output_lang.index2word[i] for i in output_sentence))
    print('')

> j ai la flemme de faire mes devoirs . EOS
= i m too lazy to do my homework . EOS
< i m afraid of my my my . . EOS

> je quitte la ville pour quelques jours . EOS
= i m leaving town for a few days . EOS
< i am leaving for for few days off .

> je ne lui suis pas lie . EOS
= i am not acquainted with him . EOS
< i m not good . EOS

> nous ne sommes pas des barbares . EOS
= we re not barbarians . EOS
< we re not not . EOS

> vous etes vraiment embetante . EOS
= you re really annoying . EOS
< you re really annoying . EOS



A well-trained model should output sentences that look similar to the target ones. The mistakes are usually done for words that were rare in the training set.