# Recurrent neural networks
[Recurrent neural networks](https://en.wikipedia.org/wiki/Recurrent_neural_network) are used for sequence data, that is data in which each item depends on one or more of the previous items. Examples of this type of data are time series, text, and DNA sequences. We can look at a simple RNN with one single hidden layer to understand how it works. The input data at time t is sent to the output layer through the hidden layer and also back to the hidden layer to be used in combination with the next input. The loop works like a memory and allows the network to learn the dependency between elements in the sequence. 

![RNN - Wikipedia, By fdeloche - Own work, CC BY-SA 4.0](images/recurrent_neural_network.svg)
In the image a RNN with one hidden layer (Credit: fdeloche - Own work, CC BY-SA 4.0, Wikipedia)

In [1]:
import os
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import warnings
import torch
import torch.nn as nn
warnings.filterwarnings('ignore')
print("NumPy version: %s"%np.__version__)
print("Pandas version: %s"%pd.__version__)
print("PyTorch version: %s"%torch.__version__)

NumPy version: 1.25.0
Pandas version: 1.5.3
PyTorch version: 2.0.1


We can compute the preactivation of the hidden layer by two matrix multiplications, one to weight the input data and another to weight the result of the previous input data.

$$z_h^{(t)} = W_{xh}x^{(t)} + W_{hh}h^{(t-1)} + b_h$$

The output of the hidden layer is then computed by applying an activation function $\sigma_h$ to the result of the preactivation

$$h^{(t)} = \sigma_h (z_h^{(t)}) = \sigma_h (W_{xh}x^{(t)} + W_{hh}h^{(t-1)} + b_h)$$

## PyTorch RNN implementation
Now we implement a small RNN with one hidden layer. Let's say the input data is an array of size 5 so that the size of the input layer is 5. We set the size of the hidden layer to 2. With these settings the shape of the $W_{xh}$ matrix is 2x5 and the shape of the $W_{hh}$ matrix is 2x2. The [RNN](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html) implementation in PyTorch builds the matrices from the same parameters. We also assume a bias for each unit of the hidden layer.

In [2]:
torch.manual_seed(1)

rnn_layer = nn.RNN(input_size=5, hidden_size=2, num_layers=1, batch_first=True) 

w_xh = rnn_layer.weight_ih_l0
w_hh = rnn_layer.weight_hh_l0
b_xh = rnn_layer.bias_ih_l0
b_hh = rnn_layer.bias_hh_l0

print('W_xh shape:', w_xh.shape)
print('W_hh shape:', w_hh.shape)
print('b_xh shape:', b_xh.shape)
print('b_hh shape:', b_hh.shape)

W_xh shape: torch.Size([2, 5])
W_hh shape: torch.Size([2, 2])
b_xh shape: torch.Size([2])
b_hh shape: torch.Size([2])


We can compute the output of the RNN instance for a sequence of three inputs and compare the result with that computed using the formula we have described above. 

In [3]:
x_seq = torch.tensor([[1.0]*5, [2.0]*5, [3.0]*5]).float()
x_seq

tensor([[1., 1., 1., 1., 1.],
        [2., 2., 2., 2., 2.],
        [3., 3., 3., 3., 3.]])

In [4]:
## output of the simple RNN:
output, hn = rnn_layer(torch.reshape(x_seq, (1, 3, 5)))

## manually computing the output:
out_man = []
for t in range(3):
    xt = torch.reshape(x_seq[t], (1, 5))
    print(f'Time step {t} =>')
    print('   Input           :', xt.numpy())
    
    ht = torch.matmul(xt, torch.transpose(w_xh, 0, 1)) + b_xh    
    print('   Hidden          :', ht.detach().numpy())
    
    if t>0:
        prev_h = out_man[t-1]
    else:
        prev_h = torch.zeros((ht.shape))

    ot = ht + torch.matmul(prev_h, torch.transpose(w_hh, 0, 1)) + b_hh
    ot = torch.tanh(ot)
    out_man.append(ot)
    print('   Output (manual) :', ot.detach().numpy())
    print('   RNN output      :', output[:, t].detach().numpy())
    print()

Time step 0 =>
   Input           : [[1. 1. 1. 1. 1.]]
   Hidden          : [[-0.4701929  0.5863904]]
   Output (manual) : [[-0.3519801   0.52525216]]
   RNN output      : [[-0.3519801   0.52525216]]

Time step 1 =>
   Input           : [[2. 2. 2. 2. 2.]]
   Hidden          : [[-0.88883156  1.2364397 ]]
   Output (manual) : [[-0.68424344  0.76074266]]
   RNN output      : [[-0.68424344  0.76074266]]

Time step 2 =>
   Input           : [[3. 3. 3. 3. 3.]]
   Hidden          : [[-1.3074701  1.886489 ]]
   Output (manual) : [[-0.8649416   0.90466356]]
   RNN output      : [[-0.8649416   0.90466356]]



Like the other neural networks that we have seen so far, the weights in a RNN are learnt through backpropagation. The loop introduced in a RNN with many layers may result in one of two opposite problems: exploding gradients or vanishing gradients. The problem is discussed in a [paper](https://arxiv.org/abs/1211.5063) by Pascanu, Mikolov, Bengio. The two outcomes depend on the value of the $W_{hh}$ matrix that are computed multiple times depending on the lenght of the sequence we consider to be relevant for the output. If the $|W_{hh}| > 1$ we may face the problem of exploding gradients, on the contrary if $|W_{hh}| < 1$ we may face the problem of vanishing gradients. These problems can be addressed by limiting the length of the sequence we want to take into account for the output. Another approach is to use the Long Short-Term Memory cells. 

## Long Short-Term Memory network
The [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory) cell is the equivalent of a layer and solves the problem of the exploding or vanishing gradients by keeping the recurrent edge close to 1. The cell state $C_t$ depends on the previous cell state $C_{t-1}$, the previous output of the hidden units $h_{t-1}$, and on the input in the sequence. The symbol $\oplus$ in the draw represents the element-wise sumation, and the $\otimes$ symbol represents the element-wise product. The boxes are called gates and are used to carry out matrix-vector multiplications between the input or the recurrent edge units and the weights to coumpute the preactivations. The result of the preactivation is used by the activation function defined in the cell. The three gates $f_t$, forget gate, $i_t$ input gate, and $o_t$ output gate, use a sigmoid activation function ($\sigma$).

![LSTM](images/long_short-term_memory.svg)
(Credit: fdeloche - Own work, Wikipedia, CC BY-SA 4.0)

$$f_t = \sigma(W_{xf}x^{(t)} + W_{hf}h^{(t-1)} + b_f)$$
$$i_t = \sigma(W_{xi}x^{(t)} + W_{hi}h^{(t-1)} + b_i)$$
$$\tilde{C_t} = tanh(W_{xc}x^{(t)} + W_{hc}h^{(t-1)} + b_c)$$
$$C^{(t)} = (C^{(t-1)} \otimes f_t) \oplus (i_t \otimes \tilde{C_t})$$
$$o_t = \sigma(W_{xo}x^{(t)} + W_{ho}h^{(t-1)} + b_o)$$
$$h^{(t)} = o_t \otimes tanh(C^{(t)})$$

## Sentiment analysis
We use the PyTorch implementation of the [LSTM](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html) to develop a tool to determine the sentiment about movies using a set of reviews that have been left by the public. For this problem we will use a sequence of words to infer the sentiment of the authors. We will use the [IMDB](http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz) dataset. The dataset must be downloaded and each text file with a review and sentiment is copied on a txt file. 

In [10]:
basepath = 'data/aclImdb'

labels = {'pos': 1, 'neg': 0}
movie_df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
                x = pd.DataFrame([[txt, labels[l]]], columns=['review', 'sentiment'])
                reviews_df = pd.concat([reviews_df, x], ignore_index=False)
                
movie_df.columns = ['review', 'sentiment']

In [105]:
len(movie_df)

50000

We might want to save the data as a csv file

In [59]:
movie_df = movie_df.sample(frac=1, random_state=1).reset_index(drop=True) # returns a randomized dataframe
movie_df.to_csv(basepath + '/movie_data.csv', index=False, encoding='utf-8')

and then open the file for reading

In [5]:
basepath = 'data/aclImdb'
movie_df = pd.read_csv(basepath + '/movie_data.csv', encoding='utf-8')
movie_df.head(3)

Unnamed: 0,review,sentiment
0,PROM NIGHT (2008)<br /><br />directed by: Nels...,0
1,Let me tell you something...this movie exceeds...,0
2,Private Practice is supposed to be a medical d...,0


We create a custom IMDB dataset from the Python Dataframe. A PyTorch dataset is used to train a model and contains a set of examples, in our case with a set of reviews and sentiments.

In [6]:
from torch.utils.data import Dataset

class ImdbDataset(Dataset):
    def __init__(self, df, transform=None, target_transform=None):
        self.df = df
        self.transform = transform
        self.target_transform = target_transform
    
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        item = self.df.iloc[idx]
        review = item['review']
        sentiment = item['sentiment']
        if self.transform:
            review = self.transform(review)
        if self.target_transform:
            sentiment = self.target_transform(sentiment)
        return review, sentiment

We split the IMDB dataset into a training and test dataset of the same size

In [7]:
from torch.utils.data import DataLoader
from torch.utils.data.dataset import random_split

movie_dataset = ImdbDataset(df=movie_df)
#movie_loader = DataLoader(dataset=review_data, batch_size=4,shuffle=True)

train_dataset, test_dataset = random_split(list(movie_dataset), [25000, 25000])
print('Train dataset: {0:d}\nTest dataset: {1:d}'.format(len(train_dataset), len(test_dataset)))

Train dataset: 25000
Test dataset: 25000


We split the train dataset into a training set and a validation set

In [8]:
train_dataset, valid_dataset = random_split(list(train_dataset), [20000, 5000])
print('Train dataset: {0:d}\nValidation dataset: {1:d}'.format(len(train_dataset), len(valid_dataset)))

Train dataset: 20000
Validation dataset: 5000


We extract the unique words (tokens) from each review in the training set

In [9]:
import re
from collections import Counter, OrderedDict

token_counts = Counter()

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    tokenized = text.split()
    return tokenized

In [33]:
for review, sentiment in train_dataset:
    tokens = tokenizer(review)
    token_counts.update(tokens)

The number of unique words in the training set

In [11]:
print('Vocab-size:', len(token_counts))

Vocab-size: 71086


We create a dictionary using the unique words from the reviews as keys and an integer as value. The dictionary (or vocabulary) should be created from the full dataset, not only from the training set, otherwise some words in the valid or test dataset might be missing. 

In [12]:
sorted_by_freq_tuples = sorted(token_counts.items(), key=lambda x: x[1], reverse=True)
ordered_dict = OrderedDict(sorted_by_freq_tuples)

vocab = {}

count = 2
for item in ordered_dict.items():
    key = item[0]
    value = item[1]
    vocab[key] = count
    count = count + 1

In [13]:
vocab['<pad>'] = 0
vocab['<unk>'] = 1

print([vocab[token] for token in ['this', 'is', 'an', 'example']])

[11, 7, 35, 472]


In [14]:
device = 'cpu'
text_pipeline = lambda x: [vocab[token] for token in tokenizer(str(x))]
label_pipeline = lambda x: 1. if x == 'pos' else 0.

In [23]:
text_pipeline('this is an example and this is an example')

[11, 7, 35, 472, 3, 11, 7, 35, 472]

In [22]:
my_text_list = []
processed_text = torch.tensor(text_pipeline('this is an example and this is an example'), dtype=torch.int64)
my_text_list.append(processed_text)
nn.utils.rnn.pad_sequence(my_text_list, batch_first=True)

tensor([[ 11,   7,  35, 472,   3,  11,   7,  35, 472]])

We implement a collate_fn function to build a batch from the dataset

In [52]:
## Step 3-B: wrap the encode and transformation function
def collate_batch(batch):
    label_list, text_list, lengths = [], [], []
    for _text, _label in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        lengths.append(processed_text.size(0))
    label_list = torch.tensor(label_list)
    lengths = torch.tensor(lengths)
    padded_text_list = nn.utils.rnn.pad_sequence(text_list, batch_first=True)
    return padded_text_list, label_list, lengths

We print the first batch of the training set. We can see that the length of each batch is different. Nontheless the number of tokens in each batch is the same since it has been padded.

In [56]:
from torch.utils.data import DataLoader
dataloader = DataLoader(train_dataset, batch_size=4, shuffle=False, collate_fn=collate_batch)
text_batch, label_batch, length_batch = next(iter(dataloader)) # takes the first batch
print(text_batch)
print(label_batch)
print(length_batch)
print(text_batch.shape)

tensor([[17949,  7190,     7,  ...,     0,     0,     0],
        [   11,     3,  1658,  ...,     0,     0,     0],
        [   44,    21,   102,  ...,     0,     0,     0],
        [    3,   101,   317,  ...,  2107,    61,   114]])
tensor([0., 0., 0., 0.])
tensor([182, 231, 228, 491])
torch.Size([4, 491])


In [64]:
## Step 4: batching the datasets
batch_size = 32  

train_dl = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_batch)
valid_dl = DataLoader(valid_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_batch)
test_dl = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_batch)

## Embeddings
We will represent the words as vectors in a space of real numbers whose dimension n is much smaller than the number of words so that each word can be represented in this space using n real numbers. PyTorch provides some function to create embeddings. For instance we create an embedding of two tensors of integers. Each tensor represents a batch and each integer represents a word. Each word is mapped to a point in the 3-dimensional embedding space 

In [67]:
embedding = nn.Embedding(num_embeddings=10, embedding_dim=3, padding_idx=0)
 
# a batch of 2 samples of 4 indices each
text_encoded_input = torch.LongTensor([[1, 2, 4, 5], [4, 3, 2, 0]])
print(embedding(text_encoded_input))

tensor([[[-0.4179, -1.1119, -0.8271],
         [-0.3051, -0.0225, -0.6329],
         [-0.3254, -0.9680,  1.5573],
         [-0.1910, -0.9103, -0.9134]],

        [[-0.3254, -0.9680,  1.5573],
         [ 0.1040,  1.4941,  0.2961],
         [-0.3051, -0.0225, -0.6329],
         [ 0.0000,  0.0000,  0.0000]]], grad_fn=<EmbeddingBackward0>)


## Building an RNN model
We create a model with two RNN layers and a final fully connected layer.

In [68]:
## An example of building a RNN model
## with simple RNN layer

# Fully connected neural network with one hidden layer
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.rnn = nn.RNN(input_size, 
                          hidden_size, 
                          num_layers=2, 
                          batch_first=True)
        #self.gru = nn.GRU(input_size, hidden_size, num_layers, batch_first=True)
        #self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, 1)
        
    def forward(self, x):
        _, hidden = self.rnn(x)
        out = hidden[-1, :, :]
        out = self.fc(out)
        return out

model = RNN(64, 32) 

print(model) 
 
model(torch.randn(5, 3, 64)) 

RNN(
  (rnn): RNN(64, 32, num_layers=2, batch_first=True)
  (fc): Linear(in_features=32, out_features=1, bias=True)
)


tensor([[-0.1800],
        [ 0.0745],
        [-0.1253],
        [-0.0129],
        [-0.1833]], grad_fn=<AddmmBackward0>)

## Building an RNN model for the sentiment analysis task
We use an LSTM layer to take into account long distance dependencies between words.

In [69]:
class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, rnn_hidden_size, fc_hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, 
                                      embed_dim, 
                                      padding_idx=0) 
        self.rnn = nn.LSTM(embed_dim, rnn_hidden_size, 
                           batch_first=True)
        self.fc1 = nn.Linear(rnn_hidden_size, fc_hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(fc_hidden_size, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, text, lengths):
        out = self.embedding(text)
        out = nn.utils.rnn.pack_padded_sequence(out, lengths.cpu().numpy(), enforce_sorted=False, batch_first=True)
        out, (hidden, cell) = self.rnn(out)
        out = hidden[-1, :, :]
        out = self.fc1(out)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.sigmoid(out)
        return out
         
vocab_size = len(vocab)
embed_dim = 20
rnn_hidden_size = 64
fc_hidden_size = 64

torch.manual_seed(1)
model = RNN(vocab_size, embed_dim, rnn_hidden_size, fc_hidden_size) 
model = model.to(device)

In [70]:
def train(dataloader):
    model.train()
    total_acc, total_loss = 0, 0
    for text_batch, label_batch, lengths in dataloader:
        optimizer.zero_grad()
        pred = model(text_batch, lengths)[:, 0]
        loss = loss_fn(pred, label_batch)
        loss.backward()
        optimizer.step()
        total_acc += ((pred>=0.5).float() == label_batch).float().sum().item()
        total_loss += loss.item()*label_batch.size(0)
    return total_acc/len(dataloader.dataset), total_loss/len(dataloader.dataset)
 
def evaluate(dataloader):
    model.eval()
    total_acc, total_loss = 0, 0
    with torch.no_grad():
        for text_batch, label_batch, lengths in dataloader:
            pred = model(text_batch, lengths)[:, 0]
            loss = loss_fn(pred, label_batch)
            total_acc += ((pred>=0.5).float() == label_batch).float().sum().item()
            total_loss += loss.item()*label_batch.size(0)
    return total_acc/len(dataloader.dataset), total_loss/len(dataloader.dataset)

In [None]:
loss_fn = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

num_epochs = 10 

torch.manual_seed(1)
 
for epoch in range(num_epochs):
    acc_train, loss_train = train(train_dl)
    acc_valid, loss_valid = evaluate(valid_dl)
    print(f'Epoch {epoch} accuracy: {acc_train:.4f} val_accuracy: {acc_valid:.4f}')