# Chapter 15 - Modeling Sequential Data Using Recurrent Neural Networks

## Introducing Sequential Data

### Modeling Sequential Data - Order Matters

What makes sequences unique, compared to other types of data, is that elements in a sequence appear in a certain order and are not independent of each other. Typical ML algorithms for supervised learning assume that the input is **independent and identically distributed (IID)** data, which means that the training examples are *mutually independent* and have the same underlying distribution. Based on that assumption, the order in which the training examples are given to the model is irrelevant. 

This assumtion is not valid when we deal with sequences, by definition, order matters.

### Sequential Data vs Time Series Data

Time series data is a special type of sequential data where each example is associated with a dimension for time. In time series data, samples are taken at successive timestamps, and therefore, the time dimension etermines the order among the data points. 

On the other hand, not all sequential data has time dimension. For example, in text data or DNA sequences, the examples are ordered, but text or DNA does not qualify as time series data. 

### Representing Sequences

For a sensible example of sequences, consider time series data, where each example point Xt belongs to a particular time:

![Alt text](../images/44.png)

RNNs are designed for modeling sequences and are capable of remembering past information and processing new events accordingly

### The Differenct Categories of Sequence Modeling

![Alt text](../images/45.png)

If the input or output of a model is a sequence, the modeling task likely falls into one of these categories:

- **Many-to-one:** The input data is a sequence, but the output is a fixed-size vector or scalar, not a sequence. For example, in sentiment analysis, the input is text-based and the output is a class label.

- **One-to-many:** The input data is in standard format and not a sequence, but the output is a sequence. For example, image captioning.

- **Many-to-many:** Both the input and the output arrays are sequences. This category can be further divided based on whether the input an output are synchronized. 

## RNNs For Modeling Sequences

### Understanding The Dataflow in RNNs

Comparison between dataflow in a standard feedforward NN and in an RNN:

![Alt text](../images/46.png)

In a standard feedforward network, information flows from the input to the hidden layer, and then from the hidden layer to the output layer. On the other hand, in an RNN, the hidden layer receives its input from both the input layer of the current time step and the hidden layer from the previous time step.

The flow of information in adjacent time steps in the hidden layer allows the network to have a memory of the past events. This flow of information is usually displayed as a loop, also known as a **recurrent edge** in graph notation, which is how this general RNN architecture got its name.

Note that it’s a common convention to refer to RNNs with one hidden layer as a single-layer RNN, which is not to be confused with single-layer NNs without a hidden layer, such as Adaline or logistic regression.



### Computing Activations in a RNN

Each directed edge (the connections between boxes) in the representtion of an RNN that we just looked at is associated with a weight matrix. Those weights do no depend on time, t; therefore, they are shared across the time axis. The different weight matrices in a single-layer RNN are as follows:

- Wxh: The weight matrix between the input, x(t), and the hidden layer, h
- Whh: The weight matrix associated with the recurrent edge
- Who: The weight matrix between the hidden layer and output layer

![Alt text](../images/47.png)

### Hidden Recurrence vs Output Recurrence

There is an alternative model to RNNs in which the hidden layer has the recurrent property, in these the recurrent connection comes from the output layer. In this case, the net activations from the output layer at the previous time step, ot-1, can be added in one of two ways:

- To the hidden layer at the current time step, ht
- To the output layer at the current time step, ot

![Alt text](../images/48.png)

Creating the layer and assign the weights and biases for our manual computations:

In [2]:
import torch
import torch.nn as nn
torch.manual_seed(1)
rnn_layer = nn.RNN(input_size=5, hidden_size=2, num_layers=1, batch_first=True)
w_xh = rnn_layer.weight_ih_l0
w_hh = rnn_layer.weight_hh_l0
b_xh = rnn_layer.bias_ih_l0
b_hh = rnn_layer.bias_hh_l0
print('W_xh shape:', w_xh.shape)
print('W_hh shape:', w_hh.shape)
print('b_xh shape:', b_xh.shape)
print('b_hh shape:', b_hh.shape)

W_xh shape: torch.Size([2, 5])
W_hh shape: torch.Size([2, 2])
b_xh shape: torch.Size([2])
b_hh shape: torch.Size([2])



A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/nathalie/Desktop/projects/ml-book/ml-book/lib/python3.10/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/home/nathalie/Desktop/projects/ml-book/ml-book/lib/python3.10/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.sta

In [3]:
x_seq = torch.tensor([[1.0]*5, [2.0]*5, [3.0]*5]).float()
## output of the simple RNN:
output, hn = rnn_layer(torch.reshape(x_seq, (1, 3, 5)))
## manually computing the output:
out_man = []
for t in range(3):
    xt = torch.reshape(x_seq[t], (1, 5))
    print(f'Time step {t} =>')
    print(' Input :', xt.numpy())
    ht = torch.matmul(xt, torch.transpose(w_xh, 0, 1)) + b_hh
    print(' Hidden :', ht.detach().numpy())
    if t > 0:
        prev_h = out_man[t-1]
    else:
        prev_h = torch.zeros((ht.shape))
    ot = ht + torch.matmul(prev_h, torch.transpose(w_hh, 0, 1)) + b_hh
    ot = torch.tanh(ot)
    out_man.append(ot)
    print(' Output (manual) :', ot.detach().numpy())
    print(' RNN output :', output[:, t].detach().numpy())
    print()

Time step 0 =>


RuntimeError: Numpy is not available

## Implementing RNNs For Sequence Modeling in Pytorch

### Project One - Predicting The Sentiment Of IMDb Movie Reviews

#### Preparing the Movie Review Data

In [4]:
from torchtext.datasets import IMDB
train_dataset = IMDB(split='train')
test_dataset = IMDB(split='test')

ModuleNotFoundError: Package `portalocker` is required to be installed to use this datapipe.Please use `pip install 'portalocker>=2.0.0'` or`conda install -c conda-forge 'portalocker>=2/0.0'`to install the package

Each set has 25,000 samples. And each sample of the datasets consists of two elements, the sentiment label representing the target label we want to predict (neg refers to negative sentiment and pos refers to positive sentiment), and the movie review text (the input features). The text component of these movie reviews is sequences of words, and the RNN model classifies each sequence as a positive (1) or negative (0) review.

### Project Two - Character-Level Language Modeling in PyTorch

In the model we'll build now, the input is a text document, and our goal is to develop a model that can generate new text that is similar in style to the input document. 

In character-level modeling, the input is broken down into a sequence of characters that are fed into our network one character at a time. The network will process each new character in conjuction with the memory of the previously seen characters to predict the next one.

![Alt text](../images/49.png)

#### Processing the Dataset

In [5]:
import numpy as np
## Reading and processing text
with open('1268-0.txt', 'r', encoding="utf8") as fp:
    text=fp.read()

start_indx = text.find('THE MYSTERIOUS ISLAND')
end_indx = text.find('End of the Project Gutenberg')
text = text[start_indx:end_indx]
char_set = set(text)

In [6]:
print('Total Length:', len(text))
print('Unique Characters:', len(char_set))

Total Length: 1112310
Unique Characters: 80


After downloading and preprocessing the text, we have a sequence consisting of 1,112,350 characters in total and 80 unique characters. However, most NN libraries and RNN implementations cannot deal with input data in string format, which is why we have to convert the text into a numeric format. To do this, we will create a simple Python dictionary that maps each character to an integer, char2int. We will also need a reverse mapping to convert the results of our model back to text. Although the reverse can be done using a dictionary that associates integer keys with character values, using a NumPy array and indexing the array to map indices to those unique characters is more efficient.

In [7]:
chars_sorted = sorted(char_set)
char2int = {ch:i for i,ch in enumerate(chars_sorted)}
char_array = np.array(chars_sorted)
text_encoded = np.array([char2int[ch] for ch in text], dtype=np.int32)

print('Text encoded shape:', text_encoded.shape)
print(text[:15], '== Encoding ==>', text_encoded[:15])
print(text_encoded[15:21], '== Reverse ==>',''.join(char_array[text_encoded[15:21]]))

Text encoded shape: (1112310,)
THE MYSTERIOUS  == Encoding ==> [44 32 29  1 37 48 43 44 29 42 33 39 45 43  1]
[33 43 36 25 38 28] == Reverse ==> ISLAND


In [8]:
for ex in text_encoded[:5]:
    print('{} -> {}'.format(ex, char_array[ex]))

44 -> T
32 -> H
29 -> E
1 ->  
37 -> M


Now, let’s step back and look at the big picture of what we are trying to do. For the text generation task, we can formulate the problem as a classification task.

Suppose we have a set of sequences of text characters that are incomplete:

![Alt text](../images/50.png)

Starting with a sequence of length 1 (that is, one single letter), we can iteratively generate new text based on this multiclass classification approach

![Alt text](../images/50.png)

To implement the text generation task in PyTorch, let’s first clip the sequence length to 40. This means that the input tensor, x, consists of 40 tokens. In practice, the sequence length impacts the quality of the generated text. Longer sequences can result in more meaningful sentences. For shorter sequences, however, the model might focus on capturing individual words correctly, while ignoring the context for the most part. Although longer sequences usually result in more meaningful sentences, as mentioned, for long sequences, the RNN model will have problems capturing long-range dependencies. Thus, in practice, finding a sweet spot and good value for the sequence length is a hyperparameter optimization problem, which we have to evaluate empirically. Here, we are going to choose 40, as it offers a good trade-off.

In [10]:
import torch
from torch.utils.data import Dataset

seq_length = 40
chunk_size = seq_length + 1
text_chunks = [text_encoded[i:i+chunk_size] for i in range(len(text_encoded)-chunk_size)]

class TextDataset(Dataset):
    def __init__(self, text_chunks):
        self.text_chunks = text_chunks
    
    def __len__(self):
        return len(self.text_chunks)
    
    def __getitem__(self, idx):
        text_chunk = self.text_chunks[idx]
        return text_chunk[:-1].long(), text_chunk[1:].long()

# Convert the list of text_chunks into a tensor with dtype=int64
seq_dataset = TextDataset(torch.tensor(text_chunks, dtype=torch.int64))


Let’s take a look at some example sequences from this transformed dataset:

In [13]:
for i, (seq, target) in enumerate(seq_dataset):
    print('Input (x): ', repr(''.join(char_array[seq.tolist()])))
    print('Target (y): ', repr(''.join(char_array[target.tolist()])))
    print()

    if i == 1:
        break   

Input (x):  'THE MYSTERIOUS ISLAND\n\nby Jules Verne\n\n1'
Target (y):  'HE MYSTERIOUS ISLAND\n\nby Jules Verne\n\n18'

Input (x):  'HE MYSTERIOUS ISLAND\n\nby Jules Verne\n\n18'
Target (y):  'E MYSTERIOUS ISLAND\n\nby Jules Verne\n\n187'



In [15]:
from torch.utils.data import DataLoader

batch_size = 64
torch.manual_seed(1)
seq_dl = DataLoader(seq_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

#### Building a Character-Level RNN Model

Now that the dataset is ready, building the model will be relatively straightforward:

In [16]:
import torch.nn as nn
class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, rnn_hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn_hidden_size = rnn_hidden_size
        self.rnn = nn.LSTM(embed_dim, rnn_hidden_size, batch_first=True)
        self.fc = nn.Linear(rnn_hidden_size, vocab_size)

    def forward(self, x, hidden, cell):
        out = self.embedding(x).unsqueeze(1)
        out, (hidden, cell) = self.rnn(out, (hidden, cell))
        out = self.fc(out).reshape(out.size(0), -1)
        return out, hidden, cell

    def init_hidden(self, batch_size):
        hidden = torch.zeros(1, batch_size, self.rnn_hidden_size)
        cell = torch.zeros(1, batch_size, self.rnn_hidden_size)
        return hidden, cell

In [17]:
vocab_size = len(char_array)
embed_dim = 256
rnn_hidden_size = 512
torch.manual_seed(1)
model = RNN(vocab_size, embed_dim, rnn_hidden_size)
model

RNN(
  (embedding): Embedding(80, 256)
  (rnn): LSTM(256, 512, batch_first=True)
  (fc): Linear(in_features=512, out_features=80, bias=True)
)

In [18]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [None]:
num_epochs = 10000
torch.manual_seed(1)

for epoch in range(num_epochs):
    hidden, cell = model.init_hidden(batch_size)
    seq_batch, target_batch = next(iter(seq_dl))
    optimizer.zero_grad()
    loss = 0
    for c in range(seq_length):
        pred, hidden, cell = model(seq_batch[:, c], hidden, cell)
        loss += loss_fn(pred, target_batch[:, c])
    loss.backward()
    optimizer.step()
    loss = loss.item()/seq_length
    if epoch % 500 == 0:
        print(f'Epoch {epoch} loss: {loss:.4f}')

Epoch 0 loss: 4.3720
Epoch 500 loss: 1.5593
Epoch 1000 loss: 1.3769
Epoch 1500 loss: 1.3496
Epoch 2000 loss: 1.2378
Epoch 2500 loss: 1.1915
Epoch 3000 loss: 1.1460
Epoch 3500 loss: 1.1604
Epoch 4000 loss: 1.1008
Epoch 4500 loss: 1.1093
Epoch 5000 loss: 1.0917


#### Evaluation Phase - Generating New Text Passages

In [None]:
from torch.distributions.categorical import Categorical

torch.manual_seed(1)

logits = torch.tensor([[1.0, 1.0, 1.0]])
print('Probabilities:', nn.functional.softmax(logits, dim=1).numpy()[0])

Probabilities: [0.33333334 0.33333334 0.33333334]
m = Categorical(logits=logits)

samples = m.sample((10,))
print(samples.numpy())

In [None]:
torch.manual_seed(1)
logits = torch.tensor([[1.0, 1.0, 3.0]])
print('Probabilities:', nn.functional.softmax(logits, dim=1).numpy()[0])
Probabilities: [0.10650698 0.10650698 0.78698605]
m = Categorical(logits=logits)
samples = m.sample((10,))
print(samples.numpy())

In [None]:
def sample(model, starting_str,
    len_generated_text=500,
    scale_factor=1.0):
    encoded_input = torch.tensor([char2int[s] for s in starting_str])
    encoded_input = torch.reshape(encoded_input, (1, -1))
    generated_str = starting_str

    model.eval()
    hidden, cell = model.init_hidden(1)
    for c in range(len(starting_str)-1):
        _, hidden, cell = model(
        encoded_input[:, c].view(1), hidden, cell
        )

    last_char = encoded_input[:, -1]
    for i in range(len_generated_text):
        logits, hidden, cell = model(
        last_char.view(1), hidden, cell
        )
    
        logits = torch.squeeze(logits, 0)
        scaled_logits = logits * scale_factor
        m = Categorical(logits=scaled_logits)
        last_char = m.sample()
        generated_str += str(char_array[last_char])

    return generated_str

In [None]:
torch.manual_seed(1)
print(sample(model, starting_str='The island'))

In [None]:
logits = torch.tensor([[1.0, 1.0, 3.0]])
print('Probabilities before scaling: ', nn.functional.softmax(logits, dim=1).numpy()[0])
print('Probabilities after scaling with 0.5:', nn.functional.softmax(0.5*logits, dim=1).numpy()[0])
print('Probabilities after scaling with 0.1:', nn.functional.softmax(0.1*logits, dim=1).numpy()[0])