# Recurrent Neural Network (RNN) from scratch

This notebook again is based on the [blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) by Andrej Karpathy. The goal is to implement a simple RNN from scratch in Python and train it to perform character-level language modeling.

Andrej also gave a talk about RNNs at the Deep Learning Summer School 2015, which is available [here](https://skillsmatter.com/skillscasts/6611-visualizing-and-understanding-recurrent-networks#video).

Instead of using numpy, we will use pytorch to implement the RNN. This will allow us to easily run the code on a GPU.

Here, we will use Shakespeare's Sonnets as the training data. We are not making names because the sequence of names is not very long and the RNN will not be able to learn much from it.

In [1]:
# import packages that are not related to torch
import os
import math
import time
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt


# torch import
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as tu_data
from torchvision.datasets import FashionMNIST


### --------- environment setup --------- ###
# set up the data path
DATA_PATH = "../GPT-2/data"

# function for setting seed
def set_seed(seed):
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        
# set up seed globally and deterministically
set_seed(42)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# set up device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [2]:
# read the dataset (names)
shk_text = open(os.path.join(DATA_PATH, "input.txt"), "r").read()

In [3]:
print(shk_text[:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


In [3]:
# we will work on characters
# we will use '.' as the start and end token
chars = sorted(list(set(shk_text)))
print(chars)
print("The number of unique characters: {}".format(len(chars)))
print("".join(chars))

['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
The number of unique characters: 65

 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


In [4]:
# create index
char2idx = {ch: i for i, ch in enumerate(chars)}
idx2char = {i: ch for i, ch in enumerate(chars)}
print(char2idx)
print(idx2char)

{'\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12, 'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M': 25, 'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36, 'Y': 37, 'Z': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}
{0: '\n', 1: ' ', 2: '!', 3: '$', 4: '&', 5: "'", 6: ',', 7: '-', 8: '.', 9: '3', 10: ':', 11: ';', 12: '?', 13: 'A', 14: 'B', 15: 'C', 16: 'D', 17: 'E', 18: 'F', 19: 'G', 20: 'H', 21: 'I', 22: 'J', 23: 'K', 24: 'L', 25: 'M', 26: 'N', 27: 'O', 28: 'P', 29: 'Q', 30: 'R', 31: 'S', 32: 'T', 33: 'U', 34: 'V', 35: 'W', 36: 'X', 37: 'Y', 38: 'Z', 39: 'a', 40: 'b', 41: 'c', 42: 'd', 43: 'e', 44: 'f', 45: 'g', 46: 'h', 47: 'i',

Before we start, let's just use one function to understand the basic idea of RNNs: 

$$
h_{t+1}  = tanh(W_{hh} h_t + W_{xh} x_t)
$$

where $h_{t+1}$ is the hidden layer. After the hidden layer we have
the output layer and then softmax to get the probabilities for the next character.

In [5]:
# set up hyperparameters
text_size, vocab_size = len(shk_text), len(chars)
print("The text has {} characters, {} unique.".format(text_size, vocab_size))
hidden_size = 100 
seq_len = 9  # the length of the sequence
learning_rate = 1e-2

The text has 1115394 characters, 65 unique.


In [24]:
# initialize the parameters
# since we are doing one-hot encoding,
# the input size is the same as the vocab_size
wxh = torch.randn(hidden_size, vocab_size,
                  device=device, dtype=torch.float32,
                  requires_grad=True)
whh = torch.randn(hidden_size, hidden_size,
                    device=device, dtype=torch.float32,
                    requires_grad=True)
why = torch.randn(vocab_size, hidden_size,
                    device=device, dtype=torch.float32,
                    requires_grad=True)
# bias
bh = torch.zeros(hidden_size, device=device,
                    dtype=torch.float32, requires_grad=True)
by = torch.zeros(vocab_size, device=device,
                    dtype=torch.float32, requires_grad=True)
parameters = [wxh, whh, why, bh, by]
# print out the number of parameters
print("The number of parameters: {}".format(sum(p.numel() for p in parameters)))


The number of parameters: 23165


In [7]:
# prepare the data
encode = lambda text: [char2idx[ch] for ch in text]
decode = lambda tnsr: "".join([idx2char[i] for i in tnsr])

# test the encode and decode functions
print(encode("hello"))
print(decode(encode("hello")))

[46, 43, 50, 50, 53]
hello


In [8]:
# we now will encode the whole text
encoded_text = torch.tensor(encode(shk_text), device=device, dtype=torch.long)
print("The text has {} characters, {} unique.".format(text_size, vocab_size))
# encoded_text is a 1D tensor and has the length of the text
print(encoded_text.shape)
print(encoded_text[:10])

The text has 1115394 characters, 65 unique.
torch.Size([1115394])
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47], device='cuda:0')


In [15]:
# now split the dataset into training and validation
# split the dataset into train and test sets
train_n = int(encoded_text.shape[0] * 0.9)
train_data = encoded_text[:train_n]
test_data = encoded_text[train_n:]
print("The number of training data: {}".format(train_data.shape[0]))
print("The number of test data: {}".format(test_data.shape[0]))

The number of training data: 1003854
The number of test data: 111540


## Structure of training a nueral network

What are the key components of training a neural network?

- load the original data
- preprocess the data
- initialize the parameters
- forward propagation
- compute the loss
- backpropagation
- update the parameters
- predict

Working with text, reading the dataset is not difficult. We just need to read the text file and convert it to a list of characters or tokens. However, once we have the data, we need to preprocess it. This is a very important step. We need to convert the characters to numbers. We also need to create a dictionary to map the numbers back to the characters. The common way to do this is to use one-hot encoding, which needs to construct __a dictionary__ to map the characters to numbers.

Once we have a dictionary, we can use it to convert the characters to integers. However, this one-hot encoding is not very efficient. We will use a different way to encode the characters. We will use the `Embedding` layer in pytorch to do this. This layer will map the characters to a vector of real numbers. This is a more efficient way to encode the characters. Therefore, here are common steps to preprocess the data:

- read the text file
- clean the text on character level or token level
- construct a dictionary to map the characters to numbers
- then characters flow into the neural network as numbers
    - human being read the text as characters
    - the neural network read the text as numbers

Once, we have the data, we need to initialize the parameters. Without using `Pytorch`, one has to initialize the parameters manually. However, with `Pytorch`, we can use the `nn` module to do this. We will use the `nn` module to initialize the parameters. Or if the modules are not available, we can write our own `class` to initialize the parameters.

In [9]:
# now construct the embedding layer
# we do not need to set up the device as we coudl use the .to(device) method
class Embedding:
  
  def __init__(self, dict_size, embedding_dim):
    """
    dict_size: the size of the dictionary
    embedding_dim: the dimension of the embedding
    """
    self.weight = torch.randn((dict_size, embedding_dim),
                                              device=device,
                                              dtype=torch.float32,
                                              requires_grad=True)
    
  def __call__(self, IX):
    """
    embedding layer will be the first layer of the network
    and each time we will pass a batch of data to the network
    the input.shape = (batch_size, seq_len), then 
    the output.shape = (batch_size, seq_len, embedding_dim)
    the input datatype has to be int, such as torch.long
    This way we could train the embedding layer efficiently
    batch by batch instead of training the whole dataset
    """
    self.out = self.weight[IX]
    return self.out
  
  def parameters(self):
    return [self.weight]

In [10]:
# since RNN rolls out the sequence, we need to define the RNNCell
# based on https://pytorch.org/docs/stable/generated/torch.nn.RNNCell.html?highlight=rnncell
# you can watch this for more details: https://youtu.be/ySEx_Bqxvvo
class RNNCell:

    def __init__(self, embedding_size, hidden_size, output_size, bias=True):
        """
        input_size: the size of the input
        hidden_size: the size of the hidden state
        remark: the order matters, which determines whether we should
        call input @ weight or weight @ input
        here we set up the weight as (input_size, hidden_size)
        therefore, we should call input @ weight
        """
        self.vocab_size = output_size
        self.hidden_size = hidden_size
        self.wxh = torch.randn((embedding_size, hidden_size), device=device,
                                                    dtype=torch.float32,
                                                    requires_grad=True)
        self.whh = torch.randn((hidden_size, hidden_size), device=device,
                                                    dtype=torch.float32,
                                                    requires_grad=True)
        self.why = torch.randn((hidden_size, vocab_size), device=device,
                                                    dtype=torch.float32,
                                                    requires_grad=True)
        if bias:
            self.bh = torch.zeros((1, hidden_size), device=device,
                                                    dtype=torch.float32,
                                                    requires_grad=True)
        else:
            self.bh = None

    def __call__(self, x, h):
        """
        x: the input of the RNNCell
               the shape of the input x.shape = (batch_size, vocab_size)
               there is no seq_len dimension
        h: the hidden state of the RNNCell, h.shape = (batch_size, hidden_size)
        """
        if h is None:
            # we need to initialize the hidden state
            h = torch.zeros((x.shape[0], self.hidden_size), device=device,
                                                    dtype=torch.float32,
                                                    requires_grad=True)
        self.hidden = torch.tanh(x @ self.wxh + h @ self.whh + self.bh)
        self.out = self.hidden @ self.why
        # return the output and the hidden state
        return self.out, self.hidden
    
    def parameters(self):
        return [self.wxh, self.whh, self.why, self.bh]

In [11]:
# the second layer is the RNN layer
# we will run a customer layer, __init__ part is the initialization
# and the __call__ part is the forward pass
# if the weight.shape = (input_size, hidden_size)
# then in the forward pass, we should call weight @ input
# otherwise, we should set up the weight as (hidden_size, input_size)
# and call input @ weight
# now we will create a RNN layer based on the RNNCell
class RNN:

    def __init__(self, rnn_cell):
        """
        rnn_cell: the RNNCell
        """
        self.rnn_cell = rnn_cell

    
    def __call__(self, X, h):
        """
        X: the input of the RNN, X.shape = (batch_size, seq_length, embedding_dim)
        h: the initial hidden state, h.shape = (batch_size, hidden_size)
        """
        batch_size, seq_len, embedding_dim = X.shape

        outputs = []

        for t in range(seq_len):
            x_t = X[:, t, :]
            out, h = self.rnn_cell(x_t, h)
            outputs.append(out)
        
        # stack the outputs
        outputs = torch.stack(outputs, dim=1)
        return outputs, h
    
    def parameters(self):
        return self.rnn_cell.parameters()

In [12]:
# define the forward function
def forward(X, Y, embedding, rnn, h):
    """
    X: the input of the network, X.shape = (batch_size, seq_len)
    Y: the target of the network, Y.shape = (batch_size, seq_len)
    embedding: the embedding layer
    rnn: the rnn layer
    h: the initial hidden state
    """
    # get the embedding
    X = embedding(X)
    # get the output and the hidden state
    out, h = rnn(X, h)
    # calculate the loss
    # out.shape = (batch_size, seq_len, vocab_size)
    # calculate the loss for each time step
    # and then take the average
    loss = F.cross_entropy(out.view(-1, out.shape[-1]), Y.view(-1)).mean()
    
    return loss, h

In [13]:
# construct X and Y
def get_batch(encoded_text, seq_len, batch_size):
    """
    encoded_text: the encoded text
    seq_len: the length of the sequence
    batch_size: the size of the batch
    """
    # calculate the number of batches
    n_batches = encoded_text.shape[0] // (seq_len * batch_size)
    # reshape the encoded text
    encoded_text = encoded_text[:n_batches * batch_size * seq_len]
    encoded_text = encoded_text.reshape((batch_size, -1))
    # loop through the encoded text
    for i in range(0, encoded_text.shape[1], seq_len):
        # get the input and the target
        X = encoded_text[:, i:i+seq_len]
        Y = torch.zeros_like(X)
        Y[:, :-1], Y[:, -1] = X[:, 1:], X[:, 0]
        yield X, Y

In [37]:
# test it
foo_x, foo_y = next(get_batch(train_data, 10, 30))

In [28]:
foo_x.shape, foo_y.shape

(torch.Size([30, 10]), torch.Size([30, 10]))

In [30]:
print(foo_x[0])
print(foo_y[0])

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47], device='cuda:0')
tensor([47, 56, 57, 58,  1, 15, 47, 58, 47, 18], device='cuda:0')


In [66]:
# we could draw batches randomly
# but here we will draw batches sequentially
# now we will go through the network

foo_embedding = Embedding(vocab_size, 25)
foo_embedding(foo_x).shape
# second layer
foo_rnn = RNN(RNNCell(25, 100, vocab_size))
foo_out, foo_h = foo_rnn(foo_embedding(foo_x), None)
print(foo_out.shape, foo_y.shape)
print(foo_out.view(-1, foo_out.shape[-1]).shape, foo_y.view(-1).shape)

torch.Size([30, 10, 65]) torch.Size([30, 10])
torch.Size([300, 65]) torch.Size([300])


In [68]:
# the output is the shape of (batch_size, seq_len, vocab_size)
# whereas Y.shape = torch.Size([30, 10])) which is (batch_size, seq_len)
# we need to calculate the loss for each time step
# and then average them
F.cross_entropy(foo_out.view(-1, foo_out.shape[-1]), foo_y.view(-1)).mean()



tensor(22.1778, device='cuda:0', grad_fn=<MeanBackward0>)

In [56]:
input_foo = torch.randn(3, 5, requires_grad=True)
target_foo = torch.randint(5, (3,), dtype=torch.int64)
F.cross_entropy(input_foo, target_foo)

tensor(2.0836, grad_fn=<NllLossBackward0>)

In [16]:
# now, let's train the network
max_steps = 10000
batch_size = 600
seq_len = 25
embedding_dim = 65 # which is also the vocab_size
hidden_size = 500
lr = 1e-2

# initialize the network
embedding = Embedding(vocab_size, embedding_dim)
rnn = RNN(RNNCell(embedding_dim, hidden_size, vocab_size))

# initialize the optimizer
optimizer = torch.optim.Adam([p for p in embedding.parameters()] + [p for p in rnn.parameters()], lr=lr)

# begin to train the network
for i in range(max_steps):

    # get the batch
    X, Y = next(get_batch(train_data, seq_len, batch_size))
    # initialize the hidden state
    h = None
    # forward pass
    loss, h = forward(X, Y, embedding, rnn, h)
    # backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    # print the loss
    if i % 1000 == 0:
        print(f"step: {i}, loss: {loss.item()}")

step: 0, loss: 31.93937110900879
step: 1000, loss: 2.742612600326538
step: 2000, loss: 2.7541823387145996
step: 3000, loss: 2.7108449935913086
step: 4000, loss: 2.6730246543884277
step: 5000, loss: 2.644676923751831
step: 6000, loss: 2.6060636043548584
step: 7000, loss: 2.589829444885254
step: 8000, loss: 2.580498456954956
step: 9000, loss: 2.5517477989196777


In [40]:
# let's generate some text
def generate_text(embedding, rnn, h, seed_text, n_chars=20):
    """
    We are now predicting the next character based on the previous text
    embedding: the embedding layer has been trained
    rnn: the rnn layer has been trained
    h: the hidden state
    seed_text: the seed text
    We will use mutlinomial to sample the next character
    """

    print(seed_text, end="")

    output_text = []

    for i in range(n_chars):

        # get the input
        

    
    return "".join(output_text)

In [41]:
# test it
generate_text(embedding, rnn, None, "Tomorrow I will")

Tomorrow I will<class 'list'>
:uO le  snsatl <class 'list'>
,funarqsoW,ule<class 'list'>
uudeunset!gll<class 'list'>
u
e iaecodha<class 'list'>
 .retotatel<class 'list'>
eu,sogaal <class 'list'>
nS iitel <class 'list'>
 w'noona<class 'list'>
iesisl <class 'list'>
mbHtee<class 'list'>
mrll <class 'list'>
 se <class 'list'>
gi <class 'list'>
 u<class 'int'>


TypeError: object of type 'int' has no len()