# Exercise 7: Recurrent Neural Networks (RNNs)

**Note**: Please insert the names of all participating students:
1. 
2. 
3. 

## Preamble
The following code downloads and imports all necessary files and modules into the virtual machine of Colab. Please make sure to execute it before solving this exercise. This mandatory preamble will be found on all exercise sheets.

In [None]:
import sys, os
if 'google.colab' in sys.modules:
  if os.getcwd() == '/content':
    !git clone 'https://github.com/inb-uni-luebeck/cs4405.git'
    os.chdir('cs4405')

#checking if data is unzipped and unzip if necessary
if not os.path.isfile('data/ptb.train.txt'):
    !unzip data/data_7.zip -d data/

import numpy as np
from matplotlib import pyplot as plt
from utils import utils_7 as utils

## Language modelling
A _Language Model_ is a model that takes an unfinished sequence of characters (in character-level language models) or words (in word-level language models) as input and predicts the next character or word token in the sequence. This approach is called semi-supervised because the labels (the upcoming word when given the string up to that point) are part of the training samples, so no extra labeling is required. 

In this exercise we will train a word-level language model on the PTB dataset using RNNs.

## RNNs in PyTorch
PyTorch provides efficient implementations of a number of recurrent architectures, including the most prominent ones Gated Recurrent Units (GRUs) and Long Short Term Memory (LSTMs). 

While it is possible to manually build recurrent neural network architectures from basic tensor operations, it is usually much slower due to additional overhead and the sequential nature of RNNs. 

### The Penn Tree Bank (PTB) dataset
The PTB dataset consists of sentences from newspaper articles. Numerals are replaced by just the capital letter "N" and any words outside the 10.000 most frequent ones are replaced with the "`<unk>`" token. This way all words are guaranteed to be frequent enough to allow a language model to generalize. 

In [None]:
#Loading the PTB dataset
print('Loading dataset...')
    
ptb_train = utils.PTB("data/ptb.train.txt")#training set
ptb_valid = utils.PTB("data/ptb.valid.txt")#validation set
ptb_test = utils.PTB("data/ptb.test.txt")#test set

#determine the full set of words
word_set = ptb_train.word_set.union(ptb_valid.word_set, ptb_test.word_set)

dictionary = {"<padding>": 0}#dictionary to form network inputs from words
inv_dictionary = {0: "<padding>"}#inverse dictionary to retrieve the actual words from network outputs
#assign a dictionary index to every word
for i, word in enumerate(word_set):
    dictionary[word] = i+1
    inv_dictionary[i+1] = word

ptb_train.encode_sentences(dictionary)
ptb_valid.encode_sentences(dictionary)
ptb_test.encode_sentences(dictionary)

print("done")

## Exercise 7.1: Padding of sequences
RNNs are often used for sequence data like audio waves, texts in the form of character sequences or word sequences, or any type of time series. 

For good computational performance it's important to process a batch of samples at once rather than each sample individually. However, since each sample represents a sequence and sequences can vary in length, it's not possible to just stack them to form a batch tensor, so we need to apply padding to the sequences to make them the same length. 

The PTB class instantiated above is a Dataset object. It handles loading of single samples from the dataset but is not responsible for padding, batching or shuffling - that's the job of the `DataLoader`.

The `DataLoader` selects sample indices for a batch, retrieves the corresponding samples from the dataset and calls the `collate` function to combine the list of samples into a batch.

__Task__:
 - Implement the collate function "`pad_and_batch_sequences`" for the dataloaders below.

__Programming Hints__:
 - Use [`torch.nn.utils.rnn.pad_sequence`](https://pytorch.org/docs/stable/nn.html#pad-sequence) for padding and use a padding value of `0` as defined in the dictionary above.
 - Depending on your choice `pad_sequence` will either use the first dimension as batch dimension and the second dimension as position index within the sequences or vice versa. For convenience we recommend sticking to the default `batch_first=False`, but it is important to keep in mind that the batch index then is the __second__ index of the `batch` tensor, not the first index.

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

batch_size = 32

# our custom collate function
# input: list of sequences, expected output: batch of padded sequences
def pad_and_batch_sequences(sequences):
    #the "sequences" parameter is a list of 1-dimensional tensors (vectors) of different sizes
    #each tensor represents a sequence, or more specifically, a sentence
    #each tensor element is a number: the index of a word in the dictionary
    
    # TODO: sort the list of sequences by DESCENDING length (required in PyTorch 1.0, possibly unnecessary in later versions)
    sequences
    
    # TODO: pad the sequences and combine them into a batch
    batch = 
    
    # TODO: calculate the original lengths of the sequences and store them in a tensor
    lengths = 
    
    return batch, lengths

train_loader = torch.utils.data.DataLoader(
                 dataset=ptb_train,
                 batch_size=batch_size,
                 shuffle=True,
                 collate_fn=pad_and_batch_sequences)

valid_loader = torch.utils.data.DataLoader(
                dataset=ptb_valid,
                batch_size=batch_size,
                shuffle=False,
                collate_fn=pad_and_batch_sequences)

test_loader = torch.utils.data.DataLoader(
                dataset=ptb_test,
                batch_size=batch_size,
                shuffle=False,
                collate_fn=pad_and_batch_sequences)

#sanity check to test your implementation for errors
sanity_check_sample1 = torch.tensor([1, 5])
sanity_check_sample2 = torch.tensor([2, 4, 0, 3])
sanity_check_batch = torch.tensor([[2, 1], [4, 5], [0, 0], [3, 0]])
sanity_check_lengths = torch.tensor([4, 2])

if (pad_and_batch_sequences([sanity_check_sample1, sanity_check_sample2])[0] != sanity_check_batch).any():
  print('Sanity check failed for output "batch".')
  print('Expected:\n', sanity_check_batch)
  print('Got:\n', pad_and_batch_sequences([sanity_check_sample1, sanity_check_sample2])[0])
elif (pad_and_batch_sequences([sanity_check_sample1, sanity_check_sample2])[1] != sanity_check_lengths).any():
  print('Sanity check failed for output "lengths".')
  print('Expected:\n', sanity_check_lengths)
  print('Got:\n', pad_and_batch_sequences([sanity_check_sample1, sanity_check_sample2])[1])
else:
  print('Sanity checks passed.')

## Exercise 7.2: Building the model
Let's start with a simple model with just a recurrent module and a linear layer to generate an output score for each dictionary entry. 

For the input of the network it has proven advantageous to represent every word by a vector rather than a single number. The values of the vectors can start out randomly and be learned and optimized through backpropagation during training. This is called embedding. The embedding vectors for each word in the dictionary are stored in an embedding matrix which is stored in the model. PyTorch provides an embedding-module with a convenient lookup implementation for this purpose. 

`Packing` allows you to pack the padded batch tensor tightly and to let the recurrent module know how long each of the sequences really are, so it doesn't need to calculate the outputs for the padding as well. PyTorch provides methods to create a PackedSequence from a padded tensor and vice versa.

__Tasks__:
 - Add an embedding module, a recurrent layer and a linear layer to the model.
 - For each batch replace the word indices with the corresponding word embeddings.
 - Transform the batch tensor into a PackedSequence, run it through the recurrent layer and unpack its output, then apply the linear layer.

__Preparation__:
 - Choose a type of recurrent layer from [the pytorch documentation](https://pytorch.org/docs/stable/nn.html#recurrent-layers). We recommend `LSTM` or `GRU`. 
 - Get familiar with [`torch.nn.utils.rnn.pack_padded_sequence`](https://pytorch.org/docs/stable/nn.html#pack-padded-sequence) to create `PackedSequence` objects. Let the recurrent module work on `PackedSequence` objects rather than Tensors.
 
__Programming Hints__:
 - Use the PyTorch module [`torch.nn.Embedding`](https://pytorch.org/docs/stable/nn.html#embedding) for the word embeddings. An instance of `torch.nn.Embedding` takes a tensor of indices and returns a tensor where every entry is replaced with the corresponding embedding-vector (so the result tensor also has one additional dimension of size `embedding_dim`). 
 - Let the recurrent module have `2` layers.
 - The recurrent module can either take a non-packed `Tensor` and return its outputs and final hidden state as `Tensor` objects, or take a `PackedSequence` in which case the outputs object will also be a `PackedSequence`. Use the [`torch.nn.utils.rnn.pack_padded_sequence`](https://pytorch.org/docs/stable/nn.html#pack-padded-sequence) method to pack the padded batch before giving it to the recurrent layer and use [`torch.nn.utils.rnn.pad_packed_sequence`](https://pytorch.org/docs/stable/nn.html#pad-packed-sequence) to unpack the outputs of the recurrent layer.

In [None]:
import torch.nn.functional as F

use_gpu = torch.cuda.is_available()

class MyModel(torch.nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        
        embedding_dim = 1000
        hidden_size = 1000
        dict_size = len(dictionary)
        
        # TODO: instantiate an Embedding module
        self.embedding = 
        
        # TODO: give the model a recurrent module with 2 layers
        self.rnn = 
        
        # TODO: create a linear layer to generate the output that scores each entry of the dictionary
        self.fc = 
        
    def forward(self, batch, lengths):
        # TODO: generate the tensor of embedding vectors from the tensor of word indices by applying the embedding module
        batch_embedded = 
        
        # TODO: pack the sequences tightly
        batch_embedded_packed = 
        
        # TODO: apply the RNN, note that it returns not just the outputs but also the hidden states
        outputs, _ = 
        
        # TODO: unpack the results by transforming them into a padded tensor again (pad_packed_sequence)
        outputs_padded, _ = 
        
        # TODO: apply the linear layer
        predictions = 
        
        return predictions
###
# Hyperparameters:
###
    
#The starting learning rate
lr=0.007

#The factor by which the learning rate is decreased after each epoch
lr_decay=0.6

#The smallest value the learning rate can decay to
lr_min=5e-4

#The importance of the L2-regularization term - not actually weight decay for Adam optimizer implementation but still using the same name since it is equivalent to actual weight decay in the vanilla SGD optimizer
weight_decay = 8e-6

#sanity check to test your implementation for errors
sanity_check_model = MyModel()
sanity_check_sample1 = torch.tensor([12, 8])
sanity_check_sample2 = torch.tensor([2, 411, 90, 31])
sanity_check_batch, sanity_check_lengths = pad_and_batch_sequences([sanity_check_sample1, sanity_check_sample2])
sanity_check_output_size = (max(sanity_check_sample1.shape[0], sanity_check_sample2.shape[0]), 2, len(dictionary))
with torch.no_grad():
  if tuple(sanity_check_model(sanity_check_batch, sanity_check_lengths).size()) != sanity_check_output_size:
    print('Sanity check failed for model output size.')
    print('Expected:\n', sanity_check_output_size)
    print('Got:\n', tuple(sanity_check_model(sanity_check_batch, sanity_check_lengths).size()))
  else:
    print('Sanity check passed.')

## Exercise 7.3: Training the model
For each sample sequence we can use the first word to predict the second, use the first two words to predict the third, use the first three words to predict the fourth, and so on. Each prediction is a classification problem where the number of classes is the dictionary size. The elements of the input sequence are the labels (except for the first element, since there is no input to predict it) and the outputs are the predicted classes (except for the last output, which goes beyond the last label). 

For each predicted next word we can calculate the cross entropy loss, using the actual next word as the classification label. The total loss for a batch is then the average loss over all the predictions of a sequence and over all the sequences of a batch. 

__Task__:
 - Implement the `train` and `evaluate` functions. Use the `model` to get the predictions. Extract the tensor of classification labels (`targets`) from the batch. Use the targets, the predictions and the `loss_func` to calculate the batch loss.

__Preparation__:
 - Think carefully about which part of the input sequences and the network outputs can be used for the loss calculation and how they align.

__Programming Hints__:
 - Remember that the batch index is the second index and the sequence index is the first index (unless you chose to use `batch_first=True` above). So `batch` and `predictions` both have shape (T, N, C) by default, where T is the length of the longest sequence (sentence) in the batch, N is the number of samples (sequences) in the batch (=batch_size) and C is the number of classes (number of words in the dictionary).
 - The loss function is set up to ignore predictions where the label is the padding value (see `ignore_index` further down), so the loss function can take padded tensors.
 - The loss function expects predictions of shape (N, C), where N is the number of predictions and C is the number of classes. And it expects targets of shape (N). So the dimension for the sequence index and the dimension for the batch index need to be combined into N. You can use `Tensor.flatten(start_dim, end_dim)` or `Tensor.reshape(*shape)`.
 - `Perplexity` is a commonly used measurement to evaluate a language model's performance. Roughly speaking it tells us how many words the model considers as candidates per predicion (on average). So for a completely untrained model with a dictionary size of 10.000 we would expect a perplexity of 10.000 and for a perfect model we would expect a perplexity of 1.

In [None]:
def train(model, dataloader, use_gpu, optimizer, loss_func):
    model.train()
    for i, (batch, lengths) in enumerate(dataloader):
        if use_gpu:
            #move the batch to gpu memory
            batch = batch.cuda()
            
        optimizer.zero_grad()
        
        # TODO: get the predictions
        out = 
        
        # TODO: get the part of the batch that should be used as labels
        targets = 

        #sanity check
        assert targets.numel() == max(0, batch.shape[0] - 1) * batch.shape[1], "Sanity check failed for size of 'targets'. (remove check/switch indices if you use batch_first=True)"
        
        # TODO: calculate the loss using loss_func (predictions)
        loss = 

        #back propagation
        loss.backward()
        
        grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 0.25)

        #learning step
        optimizer.step()
        
        if (i % 100 == 0 and use_gpu) or (i % 5 == 0 and not use_gpu):
            perplexity = loss.exp()
            print("training perplexity:", perplexity)
            
def evaluate(model, dataloader, use_gpu, optimizer, loss_func):
    model.eval()
    losses = []
    with torch.no_grad():
        for i, (batch, lengths) in enumerate(dataloader):
            if use_gpu:
                #move the batch to gpu memory
                batch = batch.cuda()
            
            # TODO: get the predictions
            out = 
            
            # TODO: get the part of the batch that should be used as labels
            targets = 
            
            # TODO: calculate the loss
            loss = 

            losses.append(loss)

        perplexity = torch.stack(losses, dim=0).mean().exp()

        print('evaluation perplexity:', perplexity)

Let's train the model. If everything is set up correctly, the perplexity should go well below 1000 within the first epoch.

__Note__: The model starts overfitting after a few epochs. 

In [None]:
model = MyModel()
if use_gpu:
    model = model.cuda()

optimizer = torch.optim.Adam([
    {"params": model.parameters(), "weight_decay": weight_decay},
    ], lr=lr)

# 0 is the <padding> word index as defined in dictionary["<padding>"] = 0 and conversely inv_dictionary[0] = "<padding>" above
loss_func = torch.nn.CrossEntropyLoss(ignore_index=0)#ignore targets in the padding section (label=0)

epochs = 4
for epoch in range(epochs):
    print("epoch " + str(epoch))
    
    train(model, train_loader, use_gpu, optimizer, loss_func)
    
    evaluate(model, valid_loader, use_gpu, optimizer, loss_func)
    
    lr *= lr_decay
    if lr < lr_min:
        lr = lr_min

    for param_group in optimizer.param_groups:
        param_group['lr'] = lr
        
    torch.save((model.state_dict(), dictionary), "model" + str(epoch) + ".pt")
        
print("Test set perplexity:")
evaluate(model, test_loader, use_gpu, optimizer, loss_func)

### Trying the language model
You can use the textbox below to get predictions from your model. Use `backspace` while the textbox is empty to remove word tokens.

In [None]:
load_weights = False

if not load_weights:
    #to test the model from this notebook:
    trained_model, original_dictionary, original_inv_dictionary = model, dictionary, inv_dictionary
else:
    #to test the model from this notebook from stored weights:
    model_state_dict, original_dictionary = torch.load("model.pt", map_location="cpu")
    original_inv_dictionary = {v: k for k, v in original_dictionary.items()}
    model.load_state_dict(model_state_dict)
    trained_model = model
    
batch_first = False #False unless you explicitly specified batch_first=true in the packing, padding and RNN functions and implemented the loss calculation accordingly

if use_gpu:
    trained_model = trained_model.cuda()

def predict_func(sentence):#sentence as list of word strings
    #max number of additional words to predict
    max_len = 50
    for _ in range(max_len):
        #create a tensor from the words' dictionary indices
        input_sentence = torch.tensor([original_dictionary[word] for word in sentence])
        lengths = torch.tensor([input_sentence.size()[0]])
        
        #introduce the singular batch dimension
        if batch_first:
            input_sentence = input_sentence.unsqueeze(dim=0)
        else:
            input_sentence = input_sentence.unsqueeze(dim=1)
            
        if use_gpu:
            input_sentence = input_sentence.cuda()
        
        #use the language model to predict the most likely next word
        trained_model.eval()
        with torch.no_grad():
            out = trained_model(input_sentence, lengths)
        
        #ignore predictions of the placeholder for rare words '<unk>' (dictionary limited to the 10k most common words in the dataset)
        if batch_first:
            out[0, -1, original_dictionary['<unk>']] = 0
            out = out.argmax(-1)[0,-1]
        else:
            out[-1, 0, original_dictionary['<unk>']] = 0
            out = out.argmax(-1)[-1,0]
        
        #get the string representation of the predicted word
        out_word = original_inv_dictionary[out.item()]
        
        #append word to sentence word list
        sentence.append(out_word)
        
        #stop predictions if end-of-sentence was predicted
        if out_word == '<eos>': break
            
    #return the sentence as a single string
    return ' '.join(sentence)

#create the input field with autocomplete for the available dictionary (autocomplete only shows suggestions if the number of matching suggestions is not too large)
if 'google.colab' in sys.modules:
  from google.colab import output
  output.register_callback('predict_func', predict_func)
utils.make_prediction_field(original_dictionary, "predict_func")