<center> <h1> Lecture 7: Sequence Models and Recurrent Neural Networks </h1> </center>
<center> Krishna Pillutla, Zaid Harchaoui </center>
    <center> Data 598 (Winter 2022), University of Washington </center>

In this lecture, we will talk about models which deal with sequential data, and recurrent neural networks in particular.

The example in this notebook is based on [this PyTorch tutorial](https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html) and is inspired by the [D2L book](https://d2l.ai/).

## Autoregressive Models

We will focus on autoregressive models as they capture the essence of modeling sequences. 

Suppose we have a sequence $x_1, x_2, \cdots$, for example, a sequence of words which make up a novel. 

Given a certain part of the sequence, we aim to model the upcoming (and unseen) parts of the sequences. That is, we aim to model 
$$
    x_t \sim P(\cdot \, | \, x_{t-1}, \cdots, x_1) .
$$
![](https://miro.medium.com/max/1734/1*_MrDp6w3Xc-yLuCTbco0xw.png)

**Latent Autoregressive Models**:
We assume that the data $(x_1, \cdots, x_{t-1})$ is summarized by a *latent state* or *hidden state* $h_t$. The model is 
$$
    x_t \sim P(\cdot | h_t)
$$
and the hidden state is updated based on the previous hidden state $h_t$ and the data $x_t$ at this time step as 
$$
    h_{t+1} = g(x_t, h_t) 
$$
for some function $g$. 

The **key modeling assumption** is that $h_t$ is a fixed dimensional vector independent of the length $t$ of the sequence observed so far. 

![](http://d2l.ai/_images/sequence-model.svg)

**Note**: Hidden Markov Models (HMMs) are special instances of latent autoregressive models as well. In this case, the hidden state is updated in a stochastic manner independent of $x_t$, i.e., $h_{t+1} \sim P_{\text{transition}}( \cdot | h_t)$, where $P_{\text{transition}}$ is called the transition kernel.

## Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are a generalization of multi-layer perceptrons (MLPs) to the case of sequential data. Here, the hidden state update function $g(x_t, h_t)$ from above is parameterized with learnable parameters.

**MLP**: 
Recall that a MLP with a single hidden layer is a map $\psi: \mathbb{R}^{d_0} \to \mathbb{R}^{d_2}$ that can be written as 
$$
    \psi(x) =
    W_2^\top \sigma(h_1) + b_2, \\
     h_1 = W_{1}^\top x + b_1 , 
$$
where $W_j \in \mathbb{R}^{d_{j-1}\times d_j}$ is a weight matrix and $b_j \in \mathbb{R}^{d_j}$ is a bias vector and $\sigma$ is the *activation function*. 


**RNN**: In the case of RNNs, the data $x = (x_1, x_2, \cdots)$ itself is sequential. We start with $h_0 = 0$ and we have,
$$
    \psi_t(x) = W_o^\top h_{t-1} + b_o \\
    h_{t} = \sigma(W_{hi}^\top x_t + W_{hh}^\top h_{t-1} + b_h)
$$
![](http://d2l.ai/_images/rnn.svg)
In this figure $X_t = x_t$ is the input (a vector of size `input_size`) and $H_t = h_t$ is the hidden state (a vector of size `hidden_size`).

**NOTE**: There is difference between a hidden layer of a MLP and a hidden state of a RNN. 
Both are similar in the sense that they are unobservable, i.e., they are neither an input nor an output. However, the RNN has a temporal component that is absent in a MLP: the input of an RNN itself is sequential. 

The hidden state $h_{t-1}$ is an inputs to whatever we do at time step $t$. It can be computed by looking at the data $X_1, \ldots, X_{t-1}$ seen so far.

We can have a multi-layer RNN architecture with *hidden RNN layers*. For details, see [Chap. 10.3 of the D2L book](https://d2l.ai/chapter_recurrent-modern/deep-rnn.html): 
![](https://d2l.ai/_images/deep-rnn.svg)

# Example: Learning Names from Different Languages

We will aim to generate names based on particular language category. 

We will do this by learning an autoregressive model $P(\text{name}| \text{category})$.

We will work at the level of *characters*. That is, we will model the text as a sequence of characters and predict the next character using the characters we have seen so far. 

### Data Preprocessing

Download the data from 
[this link](https://download.pytorch.org/tutorial/data.zip) 
and extract it to the current directory.
Look into the folder `data/names`, and make sure that 18 files such as `Arabic.txt`, `Chinese.txt`, etc. are available.


In [4]:
from io import open
import glob
import os
import unicodedata
import string

all_letters = string.ascii_letters + " .,;'-"
n_letters = len(all_letters) + 1 # Plus EOS marker

# Turn a Unicode string to plain ASCII
def unicode_to_ascii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

# Read a file and split into lines
def read_lines(filename):
    lines = open(filename, encoding='utf-8').read().strip().split('\n')
    return [unicode_to_ascii(line) for line in lines]

# Build the category_lines dictionary, a list of lines per category
category_lines = {}
all_categories = []
for filename in glob.glob('~/data_rnn_demo/names/*.txt'):
    category = os.path.splitext(os.path.basename(filename))[0]
    all_categories.append(category)
    lines = read_lines(filename)
    category_lines[category] = lines

n_categories = len(all_categories)

if n_categories == 0:
    raise RuntimeError('Data not found. Make sure that you downloaded data '
        'from https://download.pytorch.org/tutorial/data.zip and extract it to '
        'the current directory.')

print('# categories:', n_categories, all_categories)

# categories: 18 ['Czech', 'German', 'Arabic', 'Japanese', 'Chinese', 'Vietnamese', 'Russian', 'French', 'Irish', 'English', 'Spanish', 'Greek', 'Italian', 'Portuguese', 'Scottish', 'Dutch', 'Korean', 'Polish']


In [5]:
print(n_letters)

59


## Define a RNN Module

In [6]:
import torch
from torch.nn.functional import relu

class RNN(torch.nn.Module):
    def __init__(self, input_size, hidden_size, output_size,
                category_size=n_categories):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        # (input, category) + hidden - > hidden
        self.i2h = torch.nn.Linear(category_size + input_size + hidden_size, hidden_size)
        # (input, category) + hidden -> output 
        self.i2o = torch.nn.Linear(category_size + input_size + hidden_size, output_size)

    def forward(self, category, inp, hidden):
        # Our model is conditioned on the category as well.
        # We implement this conditioning by passing in the category
        # together with the input and the hidden state.
        input_combined = torch.cat((category, inp, hidden), dim=1)  # concatenate along dimension 1
        hidden = relu(self.i2h(input_combined))  # Update the new hidden state
        output = self.i2o(input_combined)  # Update the output
        return output, hidden

    def init_hidden(self):
        # Initialize the hidden state to zeros
        return torch.zeros(1, self.hidden_size)

### Embedding
As we discussed the last time, how we will embed tokens as vectors is quite crucial.
Since we are working at the level of characters, and the vocabulary of allowed characters is quite small (`n_letters=59` in this case), 
we can get away with a one-hot encoding. 

**Note**: At the word or word-piece level, one must use the embeddings of the kind we discussed the last time. 

In [4]:
# One-hot vector for category
def category_to_onehot(category):
    li = all_categories.index(category)
    tensor = torch.zeros(1, n_categories)
    tensor[0][li] = 1
    return tensor

# One-hot matrix of first to last letters (not including EOS) for input
def input_to_onehot(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for li in range(len(line)):
        letter = line[li]
        tensor[li][0][all_letters.find(letter)] = 1
    return tensor

# LongTensor of second letter to end (EOS) for target
def target_to_indices(line):
    letter_indexes = [all_letters.find(line[li]) for li in range(1, len(line))]
    letter_indexes.append(n_letters - 1) # EOS
    return torch.LongTensor(letter_indexes)


import numpy as np
# Make category, input, and target tensors from a random category, line pair
def sample_one_example():
    category = np.random.choice(all_categories) # sample category
    line = np.random.choice(category_lines[category]) # sample line from category
    category_tensor = category_to_onehot(category)
    input_line_tensor = input_to_onehot(line)
    target_line_tensor = target_to_indices(line)
    return category_tensor, input_line_tensor, target_line_tensor

## How does the RNN process an input?

Recurrent neural networks are inherently sequential in nature. The inputs are processed one-by-one, with the hidden state being updated each time.

In [7]:
rnn = RNN(n_letters, 128, n_letters)

# Sample an example
category_tensor, input_line_tensor, target_line_tensor = sample_one_example()

# Initialize hidden state
hidden = rnn.init_hidden()

# Loop over input sentence
for i in range(input_line_tensor.shape[0]):
    # Update hidden state and make predictions
    output, hidden = rnn(category_tensor, input_line_tensor[i], hidden)
    # Use this output to predict the token at this point
    # ... 

## Training loop
We are now ready to train the RNN. Note the extra work we must perform to process a single example.

In [8]:
from torch.nn.functional import cross_entropy
import time
from tqdm.auto import tqdm

def train_rnn_one_pass(rnn, total_num_examples, learning_rate):
    avg_loss = 0.0
    for i in tqdm(np.arange(total_num_examples)):  # ~2 min per epoch
        # sample a random training example
        category, input_line, target_line = sample_one_example()
        # Process this training example
        output, loss = train_one_example(rnn, category, input_line, target_line, learning_rate)
        avg_loss = i / (i+1) * avg_loss + loss / (i+1)
    return avg_loss

def train_one_example(rnn, category_tensor, input_line_tensor, target_line_tensor, learning_rate):
    # Perform a single SGD update on the given example
    target_line_tensor.unsqueeze_(-1) # batch dimension
    hidden = rnn.init_hidden()

    loss = 0.0

    for i in range(input_line_tensor.size(0)):
        output, hidden = rnn(category_tensor, input_line_tensor[i], hidden)
        l = cross_entropy(output, target_line_tensor[i])
        loss += l  # loss incrred on predicting the token number i

    gradients = torch.autograd.grad(outputs=loss, inputs=rnn.parameters())
    
    with torch.no_grad():
        for p, g in zip(rnn.parameters(), gradients):
            p -= learning_rate * g

    # return final output and average loss so far
    return output, loss.item() / input_line_tensor.size(0)

This is how we sample a single random example.

In [9]:
total_num_examples = sum([len(category_lines[c]) for c in all_categories])
print(total_num_examples)

20074


In [10]:
n_letters

59

# Train
We are now ready to train the RNN. We will train it for 5 epochs through our data. 

In [11]:
rnn = RNN(n_letters, 128, n_letters)
learning_rate = 0.0005


start = time.time()

for epoch in range(5):
    t1 = time.time()
    print(f'Starting epoch {epoch}')
    avg_loss = train_rnn_one_pass(rnn, total_num_examples, learning_rate)
    print(epoch+1, '\t', round(avg_loss, 3), 
          f'\t{round(time.time()-t1, 2)}sec')
    


Starting epoch 0


  0%|          | 0/20074 [00:00<?, ?it/s]

1 	 2.935 	28.85sec
Starting epoch 1


  0%|          | 0/20074 [00:00<?, ?it/s]

2 	 2.57 	28.64sec
Starting epoch 2


  0%|          | 0/20074 [00:00<?, ?it/s]

3 	 2.446 	28.79sec
Starting epoch 3


  0%|          | 0/20074 [00:00<?, ?it/s]

4 	 2.364 	31.15sec
Starting epoch 4


  0%|          | 0/20074 [00:00<?, ?it/s]

5 	 2.31 	33.07sec


## Generating new names from the network

To sample we give the network a letter and ask what the next one is,
feed that in as the next letter, and repeat until the EOS token.

-  Create tensors for input category, starting letter, and empty hidden
   state
-  Create a string ``output_name`` with the starting letter
-  Up to a maximum output length,

   -  Feed the current letter to the network
   -  Get the next letter from highest output, and next hidden state
   -  If the letter is EOS, stop here
   -  If a regular letter, add to ``output_name`` and continue

-  Return the final name


In [12]:
max_length = 20

# Generate from a category and starting letter
@torch.no_grad()
def generate_name_greedy(category, start_letter):
    category_tensor = category_to_onehot(category)
    inp = input_to_onehot(start_letter)
    hidden = rnn.init_hidden()

    output_name = start_letter

    for i in range(max_length):
        # inp[-1] => pass the last character through the RNN
        output, hidden = rnn(category_tensor, inp[-1], hidden)
        # output: (1, n_letters) 
        topi = output[0].argmax()
        if topi == n_letters - 1:  # EOS token
            break
        else:
            # add new
            letter = all_letters[topi]
            output_name += letter
        inp = input_to_onehot(output_name) 
        
    return output_name

In [13]:
print(all_categories)

['Czech', 'German', 'Arabic', 'Japanese', 'Chinese', 'Vietnamese', 'Russian', 'French', 'Irish', 'English', 'Spanish', 'Greek', 'Italian', 'Portuguese', 'Scottish', 'Dutch', 'Korean', 'Polish']


In [15]:
generate_name_greedy('French', start_letter='M')

'Marer'

We can also sample from the network for names.

In [16]:
from torch.nn.functional import softmax

# Generate from a category and starting letter
@torch.no_grad()
def generate_name_sample(category, start_letter, temperature=0.5):
    category_tensor = category_to_onehot(category)
    inp = input_to_onehot(start_letter)
    hidden = rnn.init_hidden()

    output_name = start_letter

    for i in range(max_length):
        # inp[-1] => pass the last character through the RNN
        output, hidden = rnn(category_tensor, inp[-1], hidden)
        # output: (1, n_letters); 1 is the batch_size
        probabilities = softmax(output[0]/temperature, dim=0)
        next_letter = torch.multinomial(probabilities, 1)[0].item()
        if next_letter == n_letters - 1:  # EOS token
            break
        else:
            letter = all_letters[next_letter]
            output_name += letter
        inp = input_to_onehot(output_name)
        
    return output_name

In [17]:
print(all_categories)

['Czech', 'German', 'Arabic', 'Japanese', 'Chinese', 'Vietnamese', 'Russian', 'French', 'Irish', 'English', 'Spanish', 'Greek', 'Italian', 'Portuguese', 'Scottish', 'Dutch', 'Korean', 'Polish']


In [18]:
generate_name_sample('Japanese', start_letter='K')

'Kouea'

# Bonus Exercise 1: RNN for Classification
In this exercise, we run train a RNN-based classifier which can classify names to categories (i.e., language of origin). 
In ML terms, the input is a name (which we treat as a sequence of characters) and the output is its category. 

Details:
- Use the `RNNForClassification` below. This is a simplification of the RNN class we used above. Note that the forward method does not take the category as an input anymore, since this is the output we would like to predict.
- We will run through the entire sequence with the RNN to compute the loss once. In particular, the `train_one_example` function from above will now be modified as:
```
# Step 1: obtain the *final* output
for i in range(input_line_tensor.shape[0]):
        output, hidden = rnn(input_line_tensor[i], hidden)
# Step 2: compute the loss based on the *final* output
# output is of shape (n_categories,) = (18,)
loss = cross_entropy(output, category)
# Note: pass in the raw category as a LongTensor of length 1. In the notebook above, we used a one-hot encoding, which we called as `category_tensor`
```

Deliverables:
- What are the differences between the `RNN` class above and `RNNForClassification` class below? Why do you think we have these differences?
- Separate out a validation set from the training set as 20% of names for each of the categories. 
- Find the divergent learning rate. Hint: you will have to train for a whole epoch to test for divergence. The loss might appear to be going down but then it could suddenly explode. RNNs are a not as robust to learning rates as MLPs. Practitioners use various tricks such as gradient clipping to deal with these issues. 
- Train the model for 5 epochs with quarter the divergent learning rate.
- Report the training and validation accuracy at the end of training.

In [None]:
class RNNForClassification(torch.nn.Module):
    def __init__(self, input_size, hidden_size, output_size=n_categories):
        super().__init__()
        self.hidden_size = hidden_size
        # input + old_hidden -> new_hidden
        self.i2h = torch.nn.Linear(input_size + hidden_size, hidden_size)
        # input + hidden -> output 
        self.i2o = torch.nn.Linear(input_size + hidden_size, output_size)

    def forward(self, inp, hidden):
        input_combined = torch.cat((inp, hidden), 1)
        hidden = self.i2h(input_combined)
        output = self.i2o(input_combined)
        return output, hidden

    def init_hidden(self):
        return torch.zeros(1, self.hidden_size)

# Bonus Exercise 2: RNNs for Sentiment Analysis

The goal of this exercise is to perform sentiment analysis with a recurrent neural network. 

**Note**: This is an advanced exercise which brings together multiple concepts we have learned across all of the course. You might find it easier to attempt this after working out Bonus Exercise 1 above.

**Data**: We will use _movie review_ data from _Rotten Tomatoes_. Please see the demo of week 6 for details on the data. Download the data from [here](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data?select=train.tsv.zip). Also download the test set from the same page.
Note that you will need an active Kaggle account to access the data.

We will use . The sentiment labels are:
- 0 - negative
- 1 - somewhat negative
- 2 - neutral
- 3 - somewhat positive
- 4 - positive

**Model**: Use a word-level recurrent neural network built on the GloVe embeddings. See the lab from week 6 for details on how to obtain the GloVe embeddings for words. 
- The input `inp` to the forward method is the GloVe embedding of the input word. 
- The output dimension is 5, corresponding to the sentiment labels. 
- Use a hidden state of dimension 128. 
- Similar to the `RNNForClassification` above, compute the `cross_entropy` loss once for the entire sequence of words. 

**Optimization**: Find the divergent learning rate. Use 1/4th of the divergent learning to optimize the model with SGD. Plot the train and test loss and accuracy over the course of training. 