# Chapter 15 - Modeling Sequential Data Using Recurrent Neural Networks

## Introducing Sequential Data

### Modeling Sequential Data - Order Matters

What makes sequences unique, compared to other types of data, is that elements in a sequence appear in a certain order and are not independent of each other. Typical ML algorithms for supervised learning assume that the input is **independent and identically distributed (IID)** data, which means that the training examples are *mutually independent* and have the same underlying distribution. Based on that assumption, the order in which the training examples are given to the model is irrelevant. 

This assumtion is not valid when we deal with sequences, by definition, order matters.

### Sequential Data vs Time Series Data

Time series data is a special type of sequential data where each example is associated with a dimension for time. In time series data, samples are taken at successive timestamps, and therefore, the time dimension etermines the order among the data points. 

On the other hand, not all sequential data has time dimension. For example, in text data or DNA sequences, the examples are ordered, but text or DNA does not qualify as time series data. 

### Representing Sequences

For a sensible example of sequences, consider time series data, where each example point Xt belongs to a particular time:

![Alt text](../images/44.png)

RNNs are designed for modeling sequences and are capable of remembering past information and processing new events accordingly

### The Differenct Categories of Sequence Modeling

![Alt text](../images/45.png)

If the input or output of a model is a sequence, the modeling task likely falls into one of these categories:

- **Many-to-one:** The input data is a sequence, but the output is a fixed-size vector or scalar, not a sequence. For example, in sentiment analysis, the input is text-based and the output is a class label.

- **One-to-many:** The input data is in standard format and not a sequence, but the output is a sequence. For example, image captioning.

- **Many-to-many:** Both the input and the output arrays are sequences. This category can be further divided based on whether the input an output are synchronized. 

## RNNs For Modeling Sequences

### Understanding The Dataflow in RNNs

Comparison between dataflow in a standard feedforward NN and in an RNN:

![Alt text](../images/46.png)

In a standard feedforward network, information flows from the input to the hidden layer, and then from the hidden layer to the output layer. On the other hand, in an RNN, the hidden layer receives its input from both the input layer of the current time step and the hidden layer from the previous time step.

The flow of information in adjacent time steps in the hidden layer allows the network to have a memory of the past events. This flow of information is usually displayed as a loop, also known as a **recurrent edge** in graph notation, which is how this general RNN architecture got its name.

Note that it’s a common convention to refer to RNNs with one hidden layer as a single-layer RNN, which is not to be confused with single-layer NNs without a hidden layer, such as Adaline or logistic regression.



### Computing Activations in a RNN

Each directed edge (the connections between boxes) in the representtion of an RNN that we just looked at is associated with a weight matrix. Those weights do no depend on time, t; therefore, they are shared across the time axis. The different weight matrices in a single-layer RNN are as follows:

- Wxh: The weight matrix between the input, x(t), and the hidden layer, h
- Whh: The weight matrix associated with the recurrent edge
- Who: The weight matrix between the hidden layer and output layer

![Alt text](../images/47.png)

### Hidden Recurrence vs Output Recurrence

There is an alternative model to RNNs in which the hidden layer has the recurrent property, in these the recurrent connection comes from the output layer. In this case, the net activations from the output layer at the previous time step, ot-1, can be added in one of two ways:

- To the hidden layer at the current time step, ht
- To the output layer at the current time step, ot

![Alt text](../images/48.png)

Creating the layer and assign the weights and biases for our manual computations:

In [1]:
import torch
import torch.nn as nn
torch.manual_seed(1)
rnn_layer = nn.RNN(input_size=5, hidden_size=2, num_layers=1, batch_first=True)
w_xh = rnn_layer.weight_ih_l0
w_hh = rnn_layer.weight_hh_l0
b_xh = rnn_layer.bias_ih_l0
b_hh = rnn_layer.bias_hh_l0
print('W_xh shape:', w_xh.shape)
print('W_hh shape:', w_hh.shape)
print('b_xh shape:', b_xh.shape)
print('b_hh shape:', b_hh.shape)

W_xh shape: torch.Size([2, 5])
W_hh shape: torch.Size([2, 2])
b_xh shape: torch.Size([2])
b_hh shape: torch.Size([2])


In [3]:
x_seq = torch.tensor([[1.0]*5, [2.0]*5, [3.0]*5]).float()
## output of the simple RNN:
output, hn = rnn_layer(torch.reshape(x_seq, (1, 3, 5)))
## manually computing the output:
out_man = []
for t in range(3):
    xt = torch.reshape(x_seq[t], (1, 5))
    print(f'Time step {t} =>')
    print(' Input :', xt.numpy())
    ht = torch.matmul(xt, torch.transpose(w_xh, 0, 1)) + b_hh
    print(' Hidden :', ht.detach().numpy())
    if t > 0:
        prev_h = out_man[t-1]
    else:
        prev_h = torch.zeros((ht.shape))
    ot = ht + torch.matmul(prev_h, torch.transpose(w_hh, 0, 1)) + b_hh
    ot = torch.tanh(ot)
    out_man.append(ot)
    print(' Output (manual) :', ot.detach().numpy())
    print(' RNN output :', output[:, t].detach().numpy())
    print()

Time step 0 =>
 Input : [[1. 1. 1. 1. 1.]]
 Hidden : [[-0.3161478  0.6472246]]
 Output (manual) : [[-0.21046415  0.5678879 ]]
 RNN output : [[-0.3519801   0.52525216]]

Time step 1 =>
 Input : [[2. 2. 2. 2. 2.]]
 Hidden : [[-0.73478645  1.297274  ]]
 Output (manual) : [[-0.5741978  0.7945334]]
 RNN output : [[-0.68424344  0.76074266]]

Time step 2 =>
 Input : [[3. 3. 3. 3. 3.]]
 Hidden : [[-1.153425   1.9473233]]
 Output (manual) : [[-0.8130059  0.918174 ]]
 RNN output : [[-0.8649416  0.9046636]]



## Implementing RNNs For Sequence Modeling in Pytorch

### Project One - Predicting The Sentiment Of IMDb Movie Reviews

#### Preparing the Movie Review Data

In [14]:
from torchtext.datasets import IMDB
train_dataset = IMDB(split='train')
test_dataset = IMDB(split='test')

OSError: /home/nathalie/Desktop/projects/ml-book/ml-book/lib/python3.10/site-packages/torchtext/lib/libtorchtext.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE

Each set has 25,000 samples. And each sample of the datasets consists of two elements, the sentiment label representing the target label we want to predict (neg refers to negative sentiment and pos refers to positive sentiment), and the movie review text (the input features). The text component of these movie reviews is sequences of words, and the RNN model classifies each sequence as a positive (1) or negative (0) review.

### Project Two - Character-Level Language Modeling in PyTorch

In the model we'll build now, the input is a text document, and our goal is to develop a model that can generate new text that is similar in style to the input document. 

In character-level modeling, the input is broken down into a sequence of characters that are fed into our network one character at a time. The network will process each new character in conjuction with the memory of the previously seen characters to predict the next one.

![Alt text](../images/49.png)

#### Processing the Dataset

In [15]:
import numpy as np
## Reading and processing text
with open('1268-0.txt', 'r', encoding="utf8") as fp:
    text=fp.read()

start_indx = text.find('THE MYSTERIOUS ISLAND')
end_indx = text.find('End of the Project Gutenberg')
text = text[start_indx:end_indx]
char_set = set(text)

In [16]:
print('Total Length:', len(text))
print('Unique Characters:', len(char_set))

Total Length: 1112310
Unique Characters: 80


After downloading and preprocessing the text, we have a sequence consisting of 1,112,350 characters in total and 80 unique characters. However, most NN libraries and RNN implementations cannot deal with input data in string format, which is why we have to convert the text into a numeric format. To do this, we will create a simple Python dictionary that maps each character to an integer, char2int. We will also need a reverse mapping to convert the results of our model back to text. Although the reverse can be done using a dictionary that associates integer keys with character values, using a NumPy array and indexing the array to map indices to those unique characters is more efficient.

In [18]:
chars_sorted = sorted(char_set)
char2int = {ch:i for i,ch in enumerate(chars_sorted)}
char_array = np.array(chars_sorted)
text_encoded = np.array([char2int[ch] for ch in text], dtype=np.int32)

print('Text encoded shape:', text_encoded.shape)
print(text[:15], '== Encoding ==>', text_encoded[:15])
print(text_encoded[15:21], '== Reverse ==>',''.join(char_array[text_encoded[15:21]]))

Text encoded shape: (1112310,)
THE MYSTERIOUS  == Encoding ==> [44 32 29  1 37 48 43 44 29 42 33 39 45 43  1]
[33 43 36 25 38 28] == Reverse ==> ISLAND


In [19]:
for ex in text_encoded[:5]:
    print('{} -> {}'.format(ex, char_array[ex]))

44 -> T
32 -> H
29 -> E
1 ->  
37 -> M


Now, let’s step back and look at the big picture of what we are trying to do. For the text generation task, we can formulate the problem as a classification task.

Suppose we have a set of sequences of text characters that are incomplete:

![Alt text](../images/50.png)

Starting with a sequence of length 1 (that is, one single letter), we can iteratively generate new text based on this multiclass classification approach

![Alt text](../images/50.png)