# Recurrent Neural Networks
Recurrent Neural Networks (RNNs) are a popular model for many NLP tasks as they perform best with sequential data. For this short analysis, we will use a pre-canned IMDB dataset and test the capabilities of a standard RNN, a Gated Recurrent Unit (GRU), a standard Long Short-Term Memory (LSTM), and a bidirectional LSTM. We will then look into the advantages and disadvantages of the architectures. 

## What's an RNN?
RNNs are recurrent, meaning they perform the same task for every element sequence, with the output being dependent on the previous computations. This works similarly to having a memory that logs information. In theory, this architecture can handle arbitrarily long sequences; in practice, the standard RNN suffers from vanishing and exploding gradients. 

The vanishing gradient problem comes from the RNN function:

$h_t=f(\theta_{hh}h_{t-1}+\theta_{xh}X_t)$

therfore:

$\frac{\partial h_t}{\partial h_{t-1}} = f(\theta_{hh},\theta_{xh},h_{t-1},x_t)\theta_{hh}$

Where if $\theta_{hh}$ is small (if the largest eigenvalue of $\theta_{hh}$ is less than 1), then the gradient gets vanishingly tiny if there are many successive multiplications. Similarly, this is true for the largest eigenvalue. If it is greater than 1, we get exploding gradients.

The problem with the vanishing gradient is that the RNN becomes incapable of capturing long-term dependencies. The exploding gradient can quickly cause even more problems; potentially, a bad parameter configuration or stochastic gradient descent (SGD) update can become too big, leading to instability in model training. Vanishing gradient problem does not always occur but could contribute to the lower validation accuracy, compared to the others that are preventative of this issue (but it still can occur).

There are ways to fix the vanishing gradient problem; we use ReLU activation functions or change the RNN architecture. For this analysis, we will be changing the architecture and trying to understand why there are differences. 

Let's break down these architectures:

## Long Short-term Memory (LSTM): 
LSTM is a Gated Recurrent Network (GRN) that alleviates some difficulties in training the RNN and deals with the vanishing gradient problem. 

## Gated Recurrent Units (GRU): 
GRU has some advantages over LSTM. GRU has fewer parameters by being a simpler alternative, making it faster to train and perform comparable performance. 

## Bidirectional Long Short-term Memory (BiLSTM):
Two hidden states created from both LSTMs are concatenated. This gives the NN more representational power by looking into the past and future contexts by looking at both left and the right context for the current prediction. 

## Begining the Analysis 
We will start by importing PyTorch and Pickle, setting PyTorch parameters, and loading the pickle files. 

In [2]:
pip install torch

Note: you may need to restart the kernel to use updated packages.


In [3]:
import torch
import torch.nn as nn
import pickle

In [4]:
torch.manual_seed(0)
torch.set_num_threads(4)
torch.set_num_interop_threads(4)

In [6]:
reviewVocabVectors = pickle.load(open('assets/reviewVocabVectors', 'rb'))
trainIterator = pickle.load(open('assets/trainIterator', 'rb'))
testIterator = pickle.load(open('assets/testIterator', 'rb'))

## Generate Architecture
This class instantiates the four different RNN architectures with the set data to ensure consistency. The architectures can be instantiated like this `MyRNN(model='RNN')` with `RNN` being the dummy variable to generate the specific RNN. 

In [5]:
embeddingSize = 100
hiddenSize = 10
dropoutRate = 0.5
numEpochs = 5
vocabSize = 20002
pad = 1
unk = 0

class MyRNN(nn.Module):
    def __init__(self, model):
        super().__init__()
        self.name = model
        self.LSTM = (model == 'LSTM' or model == 'BiLSTM')
        self.bidir = (model == 'BiLSTM')
        self.embed = nn.Embedding(vocabSize, embeddingSize, padding_idx = pad)
        
        if model == 'RNN': 
            self.rnn = nn.RNN(embeddingSize, hiddenSize)
        elif model == 'GRU': 
            self.rnn = nn.GRU(embeddingSize, hiddenSize)
        else: 
            self.rnn = nn.LSTM(embeddingSize, hiddenSize, bidirectional=self.bidir)

        self.dense = nn.Linear(hiddenSize * (2 if self.bidir else 1), 1)
        self.dropout = nn.Dropout(dropoutRate)
        
    def forward(self, text, textLengths):
        embedded = self.dropout(self.embed(text))
        packedEmbedded = nn.utils.rnn.pack_padded_sequence(embedded, textLengths)
        
        if self.LSTM: 
            packedOutput, (hidden, cell) = self.rnn(packedEmbedded)
        else: 
            packedOutput, hidden = self.rnn(packedEmbedded)

        output, outputLengths = nn.utils.rnn.pad_packed_sequence(packedOutput)
        
        if self.bidir: 
            hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
        else: 
            hidden = hidden[0]

        return self.dense(self.dropout(hidden))

In [7]:
basicRNN = MyRNN(model='RNN')
GRU = MyRNN(model='GRU') 
LSTM = MyRNN(model='LSTM') 
biLSTM = MyRNN(model='BiLSTM') 

models = [basicRNN, GRU, LSTM, biLSTM]

In [8]:
for model in models:
    if model is None:
        continue
    model.embed.weight.data.copy_(reviewVocabVectors)
    model.embed.weight.data[unk] = torch.zeros(embeddingSize)
    model.embed.weight.data[pad] = torch.zeros(embeddingSize)

In [9]:
criterion = nn.BCEWithLogitsLoss()

def batchAccuracy(preds, targets):
    roundedPreds = (preds >= 0)
    return (roundedPreds == targets).sum().item() / len(preds)

## Training the Architectures
This simple loop sets hyperparameters for training and, per batch, calculates the loss from the `trainIterator` file that we previously loaded. **Be careful running this loop as it takes at least an hour to run and yields different but similar results.**

In [25]:
for model in models: 
    if model is not None:
        model.train()

for model in models:
    if model is None:
        continue
        
    torch.manual_seed(0)
    optimizer = torch.optim.Adam(model.parameters())
    for epoch in range(numEpochs):
        epochLoss = 0
        for batch in trainIterator:
            optimizer.zero_grad()
            text, textLen = batch[0]
            predictions = model(text, textLen).squeeze(1)
            loss = criterion(predictions, batch[1])
            loss.backward()
            optimizer.step()
            epochLoss += loss.item()
        print(f'Model: {model.name}, Epoch: {epoch + 1}, Train Loss: {epochLoss / len(trainIterator)}')
    print()

Model: RNN, Epoch: 1, Train Loss: 0.6541614969977942
Model: RNN, Epoch: 2, Train Loss: 0.6211055834275072
Model: RNN, Epoch: 3, Train Loss: 0.5965952027941603
Model: RNN, Epoch: 4, Train Loss: 0.582118474918863
Model: RNN, Epoch: 5, Train Loss: 0.5689613635430251

Model: GRU, Epoch: 1, Train Loss: 0.6986629676331034
Model: GRU, Epoch: 2, Train Loss: 0.6842351510091815
Model: GRU, Epoch: 3, Train Loss: 0.6193866354730123
Model: GRU, Epoch: 4, Train Loss: 0.4861523292558577
Model: GRU, Epoch: 5, Train Loss: 0.3978971023007732

Model: LSTM, Epoch: 1, Train Loss: 0.6938529293555433
Model: LSTM, Epoch: 2, Train Loss: 0.6422763431773466
Model: LSTM, Epoch: 3, Train Loss: 0.5530681530837818
Model: LSTM, Epoch: 4, Train Loss: 0.47816736893275813
Model: LSTM, Epoch: 5, Train Loss: 0.39673527099592304

Model: BiLSTM, Epoch: 1, Train Loss: 0.6928483648678226
Model: BiLSTM, Epoch: 2, Train Loss: 0.6791311325624471
Model: BiLSTM, Epoch: 3, Train Loss: 0.5798324180381073
Model: BiLSTM, Epoch: 4, Tra

## Evaluating the Architectures
Similar to training, this loop evaluates the accuracy of the predictions based on `testIterator`. This will give us a general idea of the performance of each of the architectures. 

In [26]:
for model in models: 
    if model is not None:
        model.eval()

with torch.no_grad():
    for model in models:
        if model is None:
            continue
        accuracy = 0.0
        for batch in testIterator:
            text, textLen = batch[0]
            predictions = model(text, textLen).squeeze(1)
            loss = criterion(predictions, batch[1])
            acc = batchAccuracy(predictions, batch[1])
            accuracy += acc
        print('Model: {}, Validation Accuracy: {}%'.format(model.name, accuracy / len(testIterator) * 100))

Model: RNN, Validation Accuracy: 75.37164322250639%
Model: GRU, Validation Accuracy: 82.48641304347827%
Model: LSTM, Validation Accuracy: 83.8914641943734%
Model: BiLSTM, Validation Accuracy: 80.27733375959079%


## Evaluating the Results
Seeing accuracy for each of the trained models, we see expected results for the most part. The reason for this is the different representational power and implementation for dealing with the vanishing gradient problem.

### Recurrent Neural Networks (RNN)
Starting with the least accurate, Recurrent Neural Networks (RNN). The Neural Network (NN) maintains a hidden "internal" state that applies a simple recurrence relation updated as a sequence is processed. A prevalent issue with this type of NN is the vanishing gradient problem, where the gradient signal from faraway gets lost as it becomes much smaller in magnitude than nearby signals. If too large, the stochastic gradient descent (SGD) updates can become too large. This causes instability in the model and the possibility of reaching a bad parameter.

Vanishing gradient problem does not always occur but could contribute to the lower validation accuracy, compared to the others that are preventative of this issue (but it still can occur). Though not asked in this question, there are multiple ways to prevent the exploding gradient, with gradient clipping, ReLU activation function, and changing the RNN architecture, which are the next three RNNs used.

### Long Short-term Memory (LSTM)
LSTM earned a much better score than the standard RNN. This is because it is a Gated Recurrent Network (GRN) that alleviates some difficulties in training the RNN and deals with the vanishing gradient problem. This is done with a three-step approach that gives the NN the ability to filter the cell state and hidden state from one step to the next. Though the parameters are greatly inflated from the standard RNN, LSTM has more representational power.

The gates in LSTM are; Forget gates, which control what information is forgotten from the previous cell state. The cell state erases some content from the last cell state and writes some new content. It is computed as a weighted nonlinear combination. Input gates control what part of the new cell contents are written to the cell. And is similarly computed as a weighted linear combination of the previous hidden state. Output gates control what part of the cell contents are output to the hidden state. The hidden state reads some content from the cell state that outputs some content from the cell state by elementwise multiplication of the output gate with the cell content.

### Gated Recurrent Units (GRU)
GRU achieved the second-highest validation score. Being also a GRN, GRU has some advantages over LSTM. GRU has fewer parameters by being a simpler alternative, making it faster to train and perform comparable performance. Because of this, GRU is the most widely used. GRU and LSTM are different because GRU does not maintain a separate cell state to store long-term information.

### Bidirectional Long Short-term Memory 
BiLSTM achieved the highest accuracy. This is done by combining forward and backward LSTMs, meaning an information flow from the left and right sides. The two hidden states created from both LSTMs are concatenated. This gives the NN more representational power by looking into the past and future contexts by looking at both left and the right context for the current prediction. The downside to this NN is that it essentially trains two NN that requires far more computational power to train than normal LSTMs. But we are not stuck with BiLSTM; we can use other RNNs or GRNs bidirectionally.

## Versatility of LSTMs
LSTMs are unique in that their applications are extensive and can be implemented into many different scenarios. One potential application is when considered with transportation is passenger flow. Looking into how LSTMs uses in transportation, I ran into quite a few journals talking about what is described as a Spatio-Temporal LSTM (ST-LSTM) used to extract Spatio-temporal features from the data and combines them as the input.

For this journal, the researchers are wanting to solve the problem for the short-term traffic forecasts, describing the importance as "Accurate short-term traffic forecast can provide technical support for the surveillance and the forewarning of passenger flow." And choose to research the ST-LSTM because with rail transits uniformity, this problem applies well because the spatial correlation can be transformed into the time cost. 

The input to the model is the summation of estimated passenger flows from the other stations. The output is the forecast of the exit passenger flor at station $j$ in time $t$, a passenger from station $k$ in time $t - \Delta * T _{k,j}$ have to be considered, these being the spatial, temporal components. 

for the loss function, this journal used $loss = || \hat{x} _{out, j, t} - x _{out, j, t} ||^2_2$ where $\hat{x} _{out, j, t}$ is the forecast pf station $j$ in time $t$ and $x _{out, j, t}$ is the actual output. Outside of this reasearch paper, mean square error would also be an effective loss function. 

The use cases for LSTM are quite numerous as the network is well suited to classifying processing and making predictions based on time series. Because of this, the NN is very prominent in Natural Language Processing (NLP), but as shown in this journal and countless other examples, it is a very powerful NN that, when implemented efficiently, can be very successful in tasks unrelated to NLP. 

Citation and journal link:
JOUR, Ramalhinho Helena, Tang Qicheng, Yang Mengning, Yang Ying, 2019, 2019/02/06, ST-LSTM: A Deep Learning Approach Combined Spatio-Temporal Features for Short-Term Forecast in Rail Transit, 8392592, 2019, The short-term forecast of rail transit is one of the most essential issues in urban intelligent transportation system (ITS). Accurate forecast result can provide support for the forewarning of flow outburst and enables passengers to make an appropriate travel plan. Therefore, it is significant to develop a more accurate forecast model. Long short-term memory (LSTM) network has been proved to be effective on data with temporal features. However, it cannot process the correlation between time and space in rail transit. As a result, a novel forecast model combining spatio-temporal features based on LSTM network (ST-LSTM) is proposed. Different from other forecast methods, ST-LSTM network uses a new method to extract spatio-temporal features from the data and combines them together as the input. Compared with other conventional models, ST-LSTM network can achieve a better performance in experiments., 0197-6729, https://doi.org/10.1155/2019/8392592, 10.1155/2019/8392592, Journal of Advanced Transportation, Hindawi https://www.hindawi.com/journals/jat/2019/8392592/