# Exercise 7: Recurrent Neural Networks (RNNs)

## RNNs in PyTorch
PyTorch provides efficient implementations of a number of recurrent architectures, including the most prominent ones Gated Recurrent Units (GRUs) and Long Short Term Memory (LSTMs). 

While it is possible to build recurrent neural network architectures from basic tensor operations, it is usually much slower due to additional overhead and the sequential nature of RNNs. 

### Language modelling
Language models are a semi-supervised approach where a model receives a partial sequence of characters (in character-level language models) or words (in word-level language models) as input and is expected to predict the next character/word token in the sequence. 

In this exercise we will train a word-level language model on the PTB dataset using RNNs.

### The Penn Tree Bank (PTB) dataset
The PTB dataset consists of sentences from newspaper articles. Numberals are replaced by just the capital letter "N" and any words outside the 10.000 most frequent ones are replaced with the "<unk>" token. This way all words are guaranteed to be frequent enough to allow a language model to generalize. 

In [None]:
#Loading the PTB dataset
from utils.utils_7 import PTB

print('Loading dataset...')
    
ptb_train = PTB("data/ptb.train.txt")#training set
ptb_valid = PTB("data/ptb.valid.txt")#validation set
ptb_test = PTB("data/ptb.test.txt")#test set

#determine the full set of words
word_set = ptb_train.word_set.union(ptb_valid.word_set, ptb_test.word_set)

dictionary = {"<padding>": 0}#dictionary to form network inputs from words
inv_dictionary = {0: "<padding>"}#inverse dictionary to retrieve the actual words from network outputs
#assign a dictionary index to every word
for i, word in enumerate(word_set):
    dictionary[word] = i+1
    inv_dictionary[i+1] = word

ptb_train.encode_sentences(dictionary)
ptb_valid.encode_sentences(dictionary)
ptb_test.encode_sentences(dictionary)

print("done")

## Exercise 7.1: Padding of sequences
RNNs are often used for sequence data like audio waves, texts in the form of character sequences or word sequences, or any type of time series. 

Invoking the same operations multiple times (once per sequence rather than once per batch of sequences) introduces a lot of overhead. But since sequences can vary in length, it's not possible to just stack them to form a batch tensor, so we need to pad the sequences to make them the same length. 

The PTB class instantiated above is a Dataset object. It handles loading of single samples from the dataset but is not responsible for padding, batching or shuffling - that's the job of the `DataLoader`.

The `DataLoader` selects sample indices for a batch, retrieves the corresponding samples from the dataset and calls the `collate` function to combine the list of samples into a batch. Implement the collate function for the dataloaders below.

__Programming Hints__:
 - Use `torch.nn.utils.rnn.pad_sequence` for padding and use a padding value of `0` as defined in the dictionary above.
 - Depending on your choice `pad_sequence` will either use the first tensor dimension of its result tensor for the sequence position and the second dimension for the batch index or vice versa. This will be important to remember further down.

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

batch_size = 32

def pad_and_batch_sequences(sequences):
    #sequences is a list of 1-dimensional tensors of different size
    #each tensor represents a sentence
    #each element of a tensor is the index of a word in the dictionary
    
    # TODO: sort the list of sequences by DESCENDING length (required in PyTorch 1.0, possibly unnecessary in later versions)
    sequences...
    
    # TODO: pad the sequences and combine them into a batch
    batch = 
    
    # TODO: calculate the original lengths of the sequences and store them in a tensor
    lengths = 
    
    return batch, lengths

train_loader = torch.utils.data.DataLoader(
                 dataset=ptb_train,
                 batch_size=batch_size,
                 shuffle=True,
                 collate_fn=pad_and_batch_sequences)

valid_loader = torch.utils.data.DataLoader(
                dataset=ptb_valid,
                batch_size=batch_size,
                shuffle=False,
                collate_fn=pad_and_batch_sequences)

test_loader = torch.utils.data.DataLoader(
                dataset=ptb_test,
                batch_size=batch_size,
                shuffle=False,
                collate_fn=pad_and_batch_sequences)

## Exercise 7.2: Building the model
Let's start with a simple model with just a recurrent module and a linear layer to generate the output scores for the dictionary entries. 

It has proven advantageous to represent every entry of the dictionary by a vector, called embedding. The values of the vectors can start out randomly and be optimized through backpropagation during training. 

__Preparation__:
 - Choose a type of recurrent layer.
 - Get familiar with `torch.nn.utils.rnn.pack_padded_sequence` to create `PackedSequence` objects. Let the recurrent module work on `PackedSequence` objects rather than Tensors.
 
__Programming Hints__:
 - Use the PyTorch module `torch.nn.Embedding` for the word embeddings. An instance of `torch.nn.Embedding` takes a tensor of indices and returns a tensor where every entry is replaced with the corresponding embedding-vector (so the result tensor also has one additional dimension of size `embedding-dim`). 
 - Let the recurrent module have `2` layers.
 - `Packing` allows you to pack the padded batch tensor tightly and to let the recurrent module know how long each of the sequences really are, so it doesn't need to calculate the outputs for the padding as well. The recurrent module can either take a `Tensor` and return its outputs and final hidden state as `Tensor` objects, or take a `PackedSequence` in which case the outputs object will also be a `PackedSequence`.

In [None]:
import torch.nn.functional as F

use_gpu = torch.cuda.is_available()

class MyModel(torch.nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        
        embedding_dim = 1000
        hidden_size = 1000
        
        # TODO: instanciate an Embedding module
        self.embedding = 
        
        # TODO: give the model a recurrent module with 2 layers
        self.rnn = 
        
        # TODO: create a linear layer to generate the output that scores each entry of the dictionary
        self.fc = 
        
    def forward(self, batch, lengths):
        # TODO: generate the tensor of embedding vectors from the tensor of word indices
        batch_embedded = 
        
        # TODO: pack the sequences tightly
        batch_embedded_packed = 
        
        # TODO: apply the RNN
        outputs, _ = 
        
        # TODO: unpack the results by transforming them into a padded tensor again (pad_packed_sequence)
        outputs_padded, _ = 
        
        # TODO: apply the linear layer
        predictions = 
        
        return predictions
###
# Hyperparameters:
###
    
#The starting learning rate
lr=0.007

#The factor by which the learning rate is decreased after each epoch
lr_decay=0.6

#The smallest value the learning rate can decay to
lr_min=5e-4

#The importance of the L2-regularization term - not actually weight decay for Adam optimizer implementation but still using the same name since it is equivalent to actual weight decay in the vanilla SGD optimizer
weight_decay = 8e-6

## Exercise 7.3: Training the model
For each sample sequence we can use the first word to predict the second, use the first two words to predict the third, use the first three words to predict the fourth, and so on. Each prediction is a classification problem where the number of classes is the dictionary size. The elements of the input sequence are the labels (except for the first element, since there is no input to predict it) and the outputs are the predicted classes (except for the last output, which goes beyond the last label). 

For each predicted word we can calculate the cross entropy loss and average it over all the predictions of a sequence and over all the sequences of a batch. Implement the training and evaluation functions.

__Preparation__:
 - Think carefully about which part of the input sequences and the network outputs can be used for the loss calculation and how they align.

__Programming Hints__:
 - The loss function is set up to ignore predictions where the label is the padding value (see `ignore_index` further down), so the loss function can take padded tensors.
 - `Perplexity` is commonly used to evaluate a language model's performance. Roughly speaking it tells us how many words the model considers as candidates per predicion (on average). So for a completely untrained model with a dictionary size of 10.000 we would expect a perplexity of 10.000 and for a perfect model we would expect a perplexity of 1.

In [None]:
def train(model, dataloader, use_gpu, optimizer, loss_func):
    model.train()
    for i, (batch, lengths) in enumerate(dataloader):
        if use_gpu:
            #move the batch to gpu memory
            batch = batch.cuda()
            
        optimizer.zero_grad()
        
        # TODO: get the predictions
        out = 
        
        # TODO: get the part of the batch that should be used as labels
        targets = 
        
        # TODO: calculate the loss
        loss = 

        loss.backward()
        
        grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 0.25)

        optimizer.step()
        
        if (i % 100 == 0 and use_gpu) or (i % 5 == 0 and not use_gpu):
            perplexity = loss.exp()
            print(perplexity)
            
def evaluate(model, dataloader, use_gpu, optimizer, loss_func):
    model.eval()
    losses = []
    with torch.no_grad():
        for i, (batch, lengths) in enumerate(dataloader):
            if use_gpu:
                #move the batch to gpu memory
                batch = batch.cuda()
            
            # TODO: get the predictions
            out = 
            
            # TODO: get the part of the batch that should be used as labels
            targets = 
            
            # TODO: calculate the loss
            loss = 

            losses.append(loss)

        perplexity = torch.stack(losses, dim=0).mean().exp()

        print('evaluation perplexity:', perplexity)

Let's train the model. If everything is set up correctly, the perplexity should go well below 1000 within the first epoch.

__Note__: The model starts overfitting after a few epochs. 

In [None]:
model = MyModel()
if use_gpu:
    model = model.cuda()

optimizer = torch.optim.Adam([
    {"params": model.parameters(), "weight_decay": weight_decay},
    ], lr=lr)

loss_func = torch.nn.CrossEntropyLoss(ignore_index=0)#ignore targets in the padding section (label=0)

epochs = 4
for epoch in range(epochs):
    print("epoch " + str(epoch))
    
    train(model, train_loader, use_gpu, optimizer, loss_func)
    
    evaluate(model, valid_loader, use_gpu, optimizer, loss_func)
    
    lr *= lr_decay
    if lr < lr_min:
        lr = lr_min

    for param_group in optimizer.param_groups:
        param_group['lr'] = lr
        
    torch.save((model.state_dict(), dictionary), "model" + str(epoch) + ".pt")
        
print("Test set perplexity:")
evaluate(model, test_loader, use_gpu, optimizer, loss_func)

### Trying the language model
Since this model takes a long time to train on a CPU we have also provided a pretrained model. 

You can use the textbox below to get predictions from either the pretrained model or your own one. Use `backspace` while the textbox is empty to remove word tokens.

In [None]:
from utils.utils_7 import get_trained_model, make_prediction_field

use_pretrained = True

if use_pretrained:
    #to test the pretrained model:
    trained_model, original_dictionary, original_inv_dictionary = get_trained_model()
    batch_first = True #we used batch_first=true out of habit
else:
    load_weights = True
    
    if not load_weights:
        #to test the model from this notebook:
        trained_model, original_dictionary, original_inv_dictionary = model, dictionary, inv_dictionary
    else:
        #to test the model from this notebook from stored weights:
        model_state_dict, original_dictionary = torch.load("model3.pt", map_location="cpu")
        original_inv_dictionary = {v: k for k, v in original_dictionary.items()}
        model.load_state_dict(model_state_dict)
        trained_model = model
        
    batch_first = False #False unless you explicitly specified batch_first=true in the packing, padding and RNN functions and implemented the loss calculation accordingly

if use_gpu:
    trained_model = trained_model.cuda()

def predict_func(sentence):#sentence as list of word strings
    #max number of additional words to predict
    max_len = 50
    for _ in range(max_len):
        #create a tensor from the words' dictionary indices
        input_sentence = torch.tensor([original_dictionary[word] for word in sentence])
        
        #introduce the singular batch dimension
        if batch_first:
            input_sentence = input_sentence.unsqueeze(dim=0)
            lengths = torch.tensor([input_sentence.size()[1]])
        else:
            input_sentence = input_sentence.unsqueeze(dim=1)
            lengths = torch.tensor([input_sentence.size()[0]])
            
        if use_gpu:
            input_sentence = input_sentence.cuda()
        
        #use the language model to predict the most likely next word
        trained_model.eval()
        with torch.no_grad():
            out = trained_model(input_sentence, lengths)
        
        #ignore predictions of the placeholder for rare words (dictionary limited to the 10k most common words in the dataset)
        if batch_first:
            out[0, -1, original_dictionary['<unk>']] = 0
            out = out.argmax(-1)[0,-1]
        else:
            out[-1, 0, original_dictionary['<unk>']] = 0
            out = out.argmax(-1)[-1,0]
        
        #get the string representation of the predicted word
        out_word = original_inv_dictionary[out.item()]
        
        #append word to sentence word list
        sentence.append(out_word)
        
        #stop predictions if end-of-sentence was predicted
        if out_word == '<eos>': break
            
    #return the sentence as a single string
    return ' '.join(sentence)

#create the input field with autocomplete for the available dictionary (autocomplete only shows suggestions if the number of matching suggestions is not too large)
make_prediction_field(original_dictionary, "predict_func")