# Seminar on recurrent neural networks
During this seminar, we will train LSTM to solve sentiment analysis task, i. e. predict sentiment label for the text.

The recurrent neural networks work with the input of arbitrary length. However, in the implementation, it is usually much simpler to fix the sequence length (even in pytorch with its dynamic graphs :) So we will crop the sequences so that they have fixed length.

During completing this task, you will train LSTM with different level of "black box" (from using torch.nn.LSTM to implementing layer yourself). Also, you will learn different ways of applying dropout to the gated RNNs (in the RNNs there are more places where to insert binary dropout mask than in feed-forward networks).

The task can be completed on CPU but you will feel more comfortable with GPU. You may try using [https://colab.research.google.com](https://colab.research.google.com).

### Hyperparameters

In [1]:
vocab_size = 20000 
index_from = 3
n_hidden = 32 # 128
n_emb = 32 # 128
seq_len = 32 # 200
# small network on small data for seminar purposes
# after # normal size goes

batch_size = 128
learning_rate = 0.001
num_epochs = 30

### Data loading
Function load_matrix_imdb downloads data, preprocesses it and returns numpy-arrays. 

If you don't have wget, please download [archive imdb.npz](https://s3.amazonaws.com/text-datasets/imdb.npz)

In [4]:
#from rnn_utils import load_matrix_imdb
import numpy as np
import torch
import torch.utils.data
import os
import re
from collections import defaultdict
import operator

In [11]:
def load_matrix_imdb(path='imdb.npz', num_words=None, skip_top=0,
              maxlen=None, seed=113,
              start_char=1, oov_char=2, index_from=3, **kwargs):
    """
    Modified code from Keras
    Loads data matrixes from npz file, crops and pads seqs and returns
    shuffled (x_train, y_train), (x_test, y_test)
    """
    if not os.path.exists(path):
        print("Downloading matrix data into current folder")
        os.system("wget https://s3.amazonaws.com/text-datasets/imdb.npz")
        
    with np.load(path, allow_pickle=True) as f:
        x_train, labels_train = f['x_train'], f['y_train']
        x_test, labels_test = f['x_test'], f['y_test']

    np.random.seed(seed)
    indices = np.arange(len(x_train))
    np.random.shuffle(indices)
    x_train = x_train[indices]
    labels_train = labels_train[indices]

    indices = np.arange(len(x_test))
    np.random.shuffle(indices)
    x_test = x_test[indices]
    labels_test = labels_test[indices]

    xs = np.concatenate([x_train, x_test])
    labels = np.concatenate([labels_train, labels_test])

    if start_char is not None:
        xs = [[start_char] + [w + index_from for w in x] for x in xs]
    elif index_from:
        xs = [[w + index_from for w in x] for x in xs]

    if not num_words:
        num_words = max([max(x) for x in xs])
    if not maxlen:
        maxlen = max([len(x) for x in xs])

    # by convention, use 2 as OOV word
    # reserve 'index_from' (=3 by default) characters:
    # 0 (padding), 1 (start), 2 (OOV)
    xs_new = []
    for x in xs:
        x = x[:maxlen] # crop long sequences
        if oov_char is not None: # replace rare or frequent symbols 
            x = [w if (skip_top <= w < num_words) else oov_char for w in x]
        else: # or filter rare and frequent symbols
            x = [w for w in x if skip_top <= w < num_words]
        x_padded = np.zeros(maxlen)#, dtype = 'int32')
        x_padded[-len(x):] = x
        xs_new.append(x_padded)    
            
    idx = len(x_train)
    x_train, y_train = np.array(xs_new[:idx]), np.array(labels[:idx])
    x_test, y_test = np.array(xs_new[idx:]), np.array(labels[idx:])

    return (x_train, y_train), (x_test, y_test)

In [21]:
np.random.seed(0)
(X_train, y_train), (X_test, y_test) = load_matrix_imdb(num_words=vocab_size,
                                                        maxlen=seq_len)

In [13]:
set(y_train) # binary classification

{0, 1}

In [14]:
X_train.shape, X_test.shape

((25000, 32), (25000, 32))

In [15]:
X_train[0] # sequence of coded words

array([1.000e+00, 1.400e+01, 2.200e+01, 1.600e+01, 4.300e+01, 5.300e+02,
       9.730e+02, 1.622e+03, 1.385e+03, 6.500e+01, 4.580e+02, 4.468e+03,
       6.600e+01, 3.941e+03, 4.000e+00, 1.730e+02, 3.600e+01, 2.560e+02,
       5.000e+00, 2.500e+01, 1.000e+02, 4.300e+01, 8.380e+02, 1.120e+02,
       5.000e+01, 6.700e+02, 2.000e+00, 9.000e+00, 3.500e+01, 4.800e+02,
       2.840e+02, 5.000e+00])

In [16]:
y_train

array([1, 0, 0, ..., 0, 1, 0])

In [17]:
train_dset = torch.utils.data.TensorDataset(torch.tensor(X_train, dtype=torch.long), 
                               torch.tensor(y_train, dtype=torch.long))

In [18]:
test_dset = torch.utils.data.TensorDataset(torch.tensor(X_test, dtype=torch.long), 
                               torch.tensor(y_test, dtype=torch.long))

In [19]:
train_loader = torch.utils.data.DataLoader(train_dset,
                          batch_size=batch_size,
                          shuffle=True,
                          num_workers=4
                         )

In [20]:
test_loader = torch.utils.data.DataLoader(test_dset,
                          batch_size=batch_size,
                          shuffle=True,
                          num_workers=4
                         )

### Defining and training RNN in pytorch

In [12]:
import os
import torch.optim as optim
import torch.nn as nn

In [13]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Our RNN will process the input sequence by words (word level). We will use simple architecture consisting of embedding layer, 1 LSTM layer and fully-connected layer on the last hidden state.

The code below defines and trains the network. __Pay attention__ to "### pay attention here" marks: they point to RNN specifics.

Run this code so that you can compare training time with different models and implementations later.

In [14]:
class RNNClassifier(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, label_size, \
                 batch_size, rec_layer=nn.LSTM, embedding=nn.Embedding, \
                 dropout=None):
        super(RNNClassifier, self).__init__()
        self.hidden_dim = hidden_dim
        self.batch_size = batch_size

        self.word_embeddings = embedding(vocab_size, embedding_dim)
        if dropout:
            self.rnn = rec_layer(embedding_dim, hidden_dim, dropout=dropout)
        else:
            self.rnn = rec_layer(embedding_dim, hidden_dim)
        self.hidden2label = nn.Linear(hidden_dim, label_size)
    
    def forward(self, sentences):
        embedding = self.word_embeddings(sentences)
        out, hidden = self.rnn(embedding) # pay attention here!
        res = self.hidden2label(out[-1])
        return torch.sigmoid(res)
    

[LSTM source code](http://pytorch.org/docs/master/_modules/torch/nn/modules/rnn.html#LSTM)

In [15]:
model = RNNClassifier(embedding_dim=n_emb,
                       hidden_dim=n_hidden,
                       vocab_size=vocab_size,
                       label_size=1,
                       batch_size=batch_size, 
                       rec_layer=nn.LSTM,
                       dropout=None).to(device)

In [16]:
optimizer = optim.RMSprop(model.parameters(), lr=learning_rate)
lossfun = nn.BCELoss(reduction='sum')

In [17]:
def train_epoch(train_loader, model, lossfun, optimizer, device):
    model.train()
    for it, traindata in enumerate(train_loader):
        train_inputs, train_labels = traindata
        train_inputs = train_inputs.to(device) 
        train_labels = train_labels.to(device)
        train_labels = torch.squeeze(train_labels)

        model.zero_grad()        
        output = model(train_inputs.t()) # pay attention here!

        loss = lossfun(output.view(-1), train_labels.float())
        loss.backward()
        optimizer.step()

def evaluate(loader, model, lossfun, device):
    model.eval()
    total_acc = 0.0
    total_loss = 0.0
    total = 0.0
    for it, data in enumerate(loader):
        inputs, labels = data
        inputs = inputs.to(device) 
        labels = labels.to(device)
        labels = torch.squeeze(labels)

        output = model(inputs.t()) # pay attention here!
        loss = lossfun(output.view(-1), labels.float())
        total_loss += loss.item()

        # calc testing acc        
        pred = output.view(-1) > 0.5
        correct = pred == labels.bool()
        total_acc += torch.sum(correct).item() / len(correct)

    total = it + 1
    return total_loss / total, total_acc / total
    

def train(train_loader, test_loader, model, lossfun, optimizer, \
          device, num_epochs):
    train_loss_ = []
    test_loss_ = []
    train_acc_ = []
    test_acc_ = []
    for epoch in range(num_epochs):
        train_epoch(train_loader, model, lossfun, optimizer, device)
        train_loss, train_acc = evaluate(train_loader, model, lossfun, device)
        train_loss_.append(train_loss)
        train_acc_.append(train_acc)
        test_loss, test_acc = evaluate(test_loader, model, lossfun, device)
        test_loss_.append(test_loss)
        test_acc_.append(test_acc)

        print(f'Epoch: {epoch+1:3d}/{num_epochs:3d} '
              f'Training Loss: {train_loss_[epoch]:.3f}, Testing Loss: {test_loss_[epoch]:.3f}, '
              f'Training Acc: {train_acc_[epoch]:.3f}, Testing Acc: {test_acc_[epoch]:.3f}')

    return train_loss_, train_acc_, test_loss_, test_acc_

In [18]:
%time a, b, c, d = train(train_loader, test_loader, model, lossfun, \
                   optimizer, device, num_epochs)

Epoch:   1/ 30 Training Loss: 84.081, Testing Loss: 85.471, Training Acc: 0.610, Testing Acc: 0.591
Epoch:   2/ 30 Training Loss: 75.500, Testing Loss: 80.116, Training Acc: 0.689, Testing Acc: 0.647
Epoch:   3/ 30 Training Loss: 69.199, Testing Loss: 77.381, Training Acc: 0.727, Testing Acc: 0.672
Epoch:   4/ 30 Training Loss: 66.386, Testing Loss: 76.913, Training Acc: 0.742, Testing Acc: 0.681
Epoch:   5/ 30 Training Loss: 62.392, Testing Loss: 76.142, Training Acc: 0.761, Testing Acc: 0.682
Epoch:   6/ 30 Training Loss: 56.279, Testing Loss: 74.291, Training Acc: 0.797, Testing Acc: 0.699
Epoch:   7/ 30 Training Loss: 53.556, Testing Loss: 74.636, Training Acc: 0.808, Testing Acc: 0.706
Epoch:   8/ 30 Training Loss: 50.282, Testing Loss: 73.452, Training Acc: 0.828, Testing Acc: 0.705
Epoch:   9/ 30 Training Loss: 46.989, Testing Loss: 78.924, Training Acc: 0.837, Testing Acc: 0.704
Epoch:  10/ 30 Training Loss: 43.098, Testing Loss: 78.249, Training Acc: 0.858, Testing Acc: 0.711


Unregularized LSTM often overfits (and we see that test accuracy degrates). To overcome it, L2 regularization and droput are usually applied. But there are several ways how to apply dropout to gated RNNs, and not all of them work well. Please refer to this [blog post](https://medium.com/@bingobee01/a-review-of-dropout-as-applied-to-rnns-72e79ecd5b7b) for good review.

We will implement three dropouts for LSTM. While doing so, we will see that for different methods we need to "reveal" the layer to the different "depth" (from black-box to implementing layer ourselves).

### Dropout by (Gal and Ghahramani)

Let's start with a dropout proposed by [Gal and Ghahramani](https://arxiv.org/abs/1512.05287).

To implement it, we have to use nn.LSTMCell (processes 1 time step) instead of nn.LSTM (processes the whole input sequence). 

Complete class RNNLayer. With dropout=0 it has to work as usual LSTM, and with dropout > 0 it has to multiply the input and hidden vector by random binary mask, and this mask should be __the same for all time steps__.

Formulas for this dropout (m denotes applying dropout):
$$
h_{t-1} = h_{t-1} \odot m_h, \, x_t = x_t \odot m_x
$$
(after this, usual LSTM step goes)
$$
i = \sigma(h_{t-1}W^i + x_t U^i+b_i) \quad
o = \sigma(h_{t-1}W^o + x_t U^o+b_o) 
$$
$$
f = \sigma(h_{t-1}W^f + x_t U^f+b_f) \quad 
g = tanh(h_{t-1} W^g + x_t U^g+b_g) 
$$
$$
c_t = f \odot c_{t-1} +  i \odot  g \quad
h_t =  o \odot tanh(c_t) \nonumber
$$

In [None]:
def init_h0_c0(num_objects, hidden_size, some_existing_tensor):
    """
    return h0 and c0, use some_existing_tensor.new_zeros() to gen them
    h0 shape: num_objects x hidden_size
    c0 shape: num_objects x hidden_size
    """
    ### your code here
    

In [None]:
def gen_dropout_mask(input_size, hidden_size, is_training, p, some_existing_tensor):
    """
    is_training: if True, gen masks from Bernoulli
                 if False, gen masks consisting of (1-p)
    
    return two dropout masks of sizes (input_size, ), (hidden_size, )
    if p is not None
    return one masks if p is None
    """
    ### your code here
    ...new_zeros(...).bernoulli(...)

In [None]:
class RNNLayer(nn.Module):
    def __init__(self, input_size, hidden_size, dropout=None):
        super(RNNLayer, self).__init__()
        self.hidden_size = hidden_size
        self.input_size = input_size
        self.dropout = dropout
        self.rnn_cell = nn.LSTMCell(input_size, hidden_size)
        
    def forward(self, inp):
        # initialize h_0, c_0
        h_0, c_0 = init_h0_c0(inp.shape[1], self.hidden_size, inp)
        
        # gen masks
        input_mask, hidden_mask = gen_dropout_mask(self.input_size, \
                                                   self.hidden_size, \
                                                   self.training, \
                                                   self.dropout, \
                                                   inp)
        
        
        ### your code here
        ### implement recurrent logic and return what nn.LSTM returns
        ### do not forget to apply generated dropout masks!

Test your implementation with dropout turned off (pass RNNLayer to RNNClassifier as rec_layer). Measure the training time (%time). Does it differ from training time of nn.LSTM?

Test your implementation with dropout=0.5 again measuring the training time. Does the model still overfit? Does the training take more time than training without dropout? (additional time is spent for mask generating)

### Dropout by (Gal and Ghahramani). Second try

< start hacking pytorch >

When you unroll the time  cycle in python, training slows down. But there is the way how to implement dropout by (Gal and Ghahramani) without modifying computational graph and modifying only weights of the network. This allows using nn.LSTM instead of nn.LSTMCell. Before calling nn.LSTM, you should replace its weights with the weights where some rows are multiplied by 0. Of course, in this case you have to store the trainable weights separately. This is the way how this dropout is implementd in FastAI library, which code  is used in the cell below.

Complete the class:

In [None]:
import warnings

In [None]:
class FastRNNLayer(nn.Module):
    def __init__(self, input_size, hidden_size, dropout=0):
        super(FastRNNLayer, self).__init__()
        self.module = nn.LSTM(input_size, hidden_size)
        self.dropout = dropout
        self.layer_names = ['weight_hh_l0', 'weight_ih_l0']
        for layer in self.layer_names:
            # Makes a copy of the weights of the selected layers.
            w = getattr(self.module, layer)
            self.register_parameter(f'{layer}_raw', nn.Parameter(w.data))
            
    def _setweights(self):
        "Apply dropout to the raw weights."
        ### your code here
        ### generate input_mask and hidden_mask (use function gen_dropout_mask)
        
        for layer, mask in zip(self.layer_names, (hidden_mask, input_mask)):
            raw_w = getattr(self, f'{layer}_raw')
            self.module._parameters[layer] = raw_w * mask

    def forward(self, *args):
        with warnings.catch_warnings():
            # To avoid the warning that comes because the weights aren't flattened.
            warnings.simplefilter("ignore")
            
            ### your code here
            ### set new weights of self.module and call its forward

    def reset(self):
        if hasattr(self.module, 'reset'): self.module.reset()

Test your implementation (again passing FastRNNLayer as a rec_layer) with dropout = 0.5. Compare training time with previous models. The test accuracy and other training metrics should be the same as with previous implementaton.

< end hacking pytorch >

### Dropout by (Semeniuta et al)
Now let's turn to implementing dropout proposed by [Semeniuta et al](http://www.aclweb.org/anthology/C16-1165). 

This method is even more popular than the previous one. It is developed specifically for _gated_ recurrent architectures. For LSTM, this dropout applies dropout only to information flow ($m_h$ is a dropout mask):
$$
i = \sigma(h_{t-1}W^i + x_t U^i+b_i) \quad
o = \sigma(h_{t-1}W^o + x_t U^o+b_o) 
$$
$$
f = \sigma(h_{t-1}W^f + x_t U^f+b_f) \quad 
g = tanh(h_{t-1} W^g + x_t U^g+b_g) 
$$
$$
c_t = f \odot c_{t-1} +  i \odot g \odot {\bf m_h} \quad
h_t =  o \odot tanh(c_t) \nonumber
$$
For $x_t$, the mask is put in the same way as in (Gal and Ghahramani). By the way, you can apply this mask before passing tensor to LSTM layer.

According to the paper, the mask can be the same for all moments of the time but may also be  different. We will use the same mask.

To implement this dropout, you have to implement LSTM by yourself (interface of LSTMCell is not enough as you should work with LSTM logics). Complete the class:

In [None]:
class HandmadeLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, dropout=0.):
        super(HandmadeLSTM, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.dropout = dropout
        self.input_weights = nn.Linear(input_size, 4 * hidden_size)
        self.hidden_weights = nn.Linear(hidden_size, 4 * hidden_size)
        
        self.reset_params()


    def reset_params(self):
        """
        initialization as in Pytorch
        do not forget to call this method!
        """
        stdv = 1.0 / np.sqrt(self.hidden_size)
        for weight in self.parameters():
            weight.data.uniform_(0, stdv)
            

    def forward(self, inp, hidden=None):
        ### your code here
        # use functions init_h0_c0 and gen_dropout_masks defined above

Test your implementation without dropout (controlthe quality and compare the training time with nn.LSTM) and with dropout=0.5. Copare the quality with the model trained with the previous dropout.

### Zoneout
In Zoneout, at each time step you update the hidden state with probability p and hold it the same with probability 1-p. Formulas for Zoneout:
 
(firstly usual time step goes, e. g. LSTM)
$$
i = \sigma(h_{t-1}W^i + x_t U^i+b_i) \quad
o = \sigma(h_{t-1}W^o + x_t U^o+b_o) 
$$
$$
f = \sigma(h_{t-1}W^f + x_t U^f+b_f) \quad 
g = tanh(h_{t-1} W^g + x_t U^g+b_g) 
$$
$$
c_t = f \odot c_{t-1} +  i \odot  g \quad
h_t =  o \odot tanh(c_t) \nonumber
$$
Then apply Zoneout:
$$
h_t = h_t \odot m_h^t + h_{t-1}\odot(1-m_h^t)
$$
In this method, the mask should be different at different moments of the time (otherwise the method simplifies to the dropout by (Gal and Ghahramani)). For x_t you can apply dropout before using LSTM layer.

If you have time left, you may implement this method. Choose one of three our implementations as a base.