## Character level language model with RNNs

In the last exercise, we implemented MLP and CNNs in PyTorch on MNIST and CIFAR datasets. In this exercise, we will implement a character level language model with RNNs and train it on Shakespearean text.

This code is based on Andrej Karpathy's [char-rnn](https://github.com/karpathy/char-rnn). Do check out his [blog](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) and [gist](https://gist.github.com/karpathy/d4dee566867f8291f086) about the same.

In [None]:
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

The Shakespearean text is provided in the ```ex_05_nlp_graph.zip``` under ```./data/tinyshakespeare.txt```. We will load the data and perform some preprocessing for starters.

In [None]:
# read the contents of the text file
data = open('./data/tinyshakespeare.txt', 'r').read()
print (data[:1000]) # let's examine some text

The text looks good but it can not be processed by RNNs in its raw form. We first need to tokenize the data and convert it into a form suitable for RNNs. Since we focus on character level language models in this exercise, we will consider each character as a token.

In [None]:
chars = list(set(data)) # get the set of unique characters
data_size, vocab_size = len(data), len(chars)
print ('The text file has {} characters out of which {} are unique.'.format(data_size, vocab_size))
print (chars)

We have quite a big text with 65 unique characters. Now, we need to associate each character with a unique id which can then be converted into 1-hot vector form to provide as input to the RNN.

In [None]:
# Creat a dictionary mapping each character to a unique id and vice versa
char_to_ix = {ch:i for i,ch in enumerate(chars)}
ix_to_char = {i:ch for i,ch in enumerate(chars)}

Now that we have a unique id for each character, we can represent each character with a 1-hot encoding

In [None]:
# 1-hot encoding example for a few tokens
for _ in range(5):
    random_char_index = np.random.randint(0, vocab_size)
    random_char = ix_to_char[random_char_index]
    one_hot_vector = np.zeros(vocab_size)
    one_hot_vector[random_char_index] = 1
    print (random_char_index, random_char, one_hot_vector.shape)

In the last exercise on classification with MNIST and CIFAR, we had ground truth labels provided explicitly with each instance of the dataset. In our text dataset, we don't have explicitly ground truth labels but note that in a character level language model, we predict the next character. So, in essence, our text is itself the ground truth label since for each character, the next character acts as the ground truth. This will be important when loading the data for training.

Now that we have preprocessed the data, we are ready to start building and training our model. But, recall that in the last exercise, we use MNIST and CIFAR datasets which are available in the torchvision package, which we directly supplied to PyTorch's DataLoader class. However, PyTorch's DataLoader takes in a specific object of the inbuilt [Dataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) class. So, we first need to convert our dataset in this form by inheriting from the Dataset class.

In [None]:
# General template for defining a CustomDataset by inheriting from the Dataset class
# We need to override the __init__, __len__ and __getitem__ methods
# __init__ is called during the dataset instantiation
# __len__ returns the size of the dataset
# __getitem__ is called during training and returns the data to be used during training
# Refer to the Dataset class link for more information

class CustomDataset(Dataset):
    
    def __init__(self, params):
        super(CustomDataset, self).__init__()
        
    
    def __len__(self):
        pass
    
    def __getitem__(self, index):
        pass
        

The first task of this exercise is to implement the ```__init__```, ```__len__``` and ```__getitem__``` methods for our dataset. The preprocessing steps that we did above are now to be implemented as methods of our dataset class. We have implemented the ```__init__``` and ```__len__``` methods for you.

In [None]:
#########################################################################
# TO-DO: Implement the __getitem__ of the Shakespeare class. 
# Important points: __getitem__ is called at each training iteration
# So, we need to return the data and ground-truth label. The data is in
# the form of one-hot vectors and ground-truth is the index of next char
# Our RNN operates on an input sequence of a specified length (seq_length)
# so we need to return a sequence of one-hot vector and the indices of
# their corresponding next character
#########################################################################

class Shakespeare(Dataset):
    
    def __init__(self, data_path, seq_length):
        super(Shakespeare, self).__init__()
        
        self.seq_length = seq_length # length of the input sequence
        
        # same as done above
        self.data = open(data_path, 'r').read()
        self.data_size = len(self.data)
        
        self.chars = list(set(self.data))
        self.vocab_size = len(self.chars)
        data_size, vocab_size = len(self.data), len(self.chars)
        
        self.char_to_ix = {ch:i for i,ch in enumerate(self.chars)}
        self.ix_to_char = {i:ch for i,ch in enumerate(self.chars)}
        
    def __len__(self):
        return self.data_size - self.seq_length - 1
    
    def __getitem__(self, index):
        
        ##### implement this part #####
        
        
        ###############################
        
        # one_hot_input_seq = one-hot vectors of the tokens in input sequence
        # targets = indices of the next character of each token
        
        # one_hot_input_seq = (seq_length, vocab_size)
        # targets = (seq_length)
        return one_hot_input_seq, targets

Now that we have defined our dataset, we can instatiate it and build our dataloader

In [None]:
seq_length = 25
batch_size = 100

data_path = './data/tinyshakespeare.txt'
dataset = Shakespeare(data_path, seq_length)
vocab_size = dataset.vocab_size

# Q: why is drop_last=True and shuffle=True used in the dataloader?
# What happens if we don't set them to True?
dataloader = DataLoader(dataset, batch_size=batch_size, num_workers=1, shuffle=True, drop_last=True)


#########################################################################
# Solution
# drop_last=True is used since the data size might not be divisible by batch_size, 
# so the last batch would have less elements and cannot be processed by the network
#
# shuffle=False leads to sequential data loading which enables propagating hidden state
# from one batch to the next, this helps the network to learn better since it provides
# relevant context and the network doesn't need to start from 0 everytime
dataloader = DataLoader(dataset, batch_size=batch_size, num_workers=1, shuffle=False, drop_last=True)
#########################################################################

### Question: why is drop_last=True and shuffle=True used in the dataloader? What happens if we don't set them to True?

### Answer:

Next, we need to define our RNN model before training. For this, we will implement the RNN Cell discussed in lecture 8 slide 6. Note that the RNN Cell operates on a single timestep. So, the RNN Cell will take a single timestep token as input and produce output and hidden state for that particular timestep only.

In [None]:
#########################################################################
# TO-DO: Implement the __init__ and forward methods of the RNNCell class. 
# Refer to the equations in the lecture and implement the same here
# The forward method should return the output and the hidden state
#########################################################################

class RNNCell(nn.Module):
    
    def __init__(self, vocab_size, hidden_size):
        super(RNNCell, self).__init__()
        
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        
        ##### implement this part #####

        
        ###############################
        
    def forward(self, input_emb, hidden_state):
        
        # input_emb = (batch_size, vocab_size)
        # hidden_state = (batch_size, hidden_size)
        
        ##### implement this part #####
        
        
        ###############################
        
        return output, hidden_state

Since we have a sequence of tokens as input, we will implement another model which uses this RNNCell and processes multi-timestep sequence inputs. The RNN class takes in a sequence of one-hot encodings of tokens as input and returns a sequence of output, one for each timestep.

In [None]:
#########################################################################
# TO-DO: Implement the forward method of the RNN class. 
# The RNN class takes in a sequence of one-hot encodings of tokens as input 
# and returns a sequence of output, one for each timestep.
# We also return the hidden state of the final timestep
# Q: Is it required to return the hidden state? If yes, why? If no, why?
#########################################################################

class RNN(nn.Module):
    
    def __init__(self, seq_length, vocab_size, hidden_size):
        super(RNN, self).__init__()
        
        self.seq_length = seq_length
        self.hidden_size = hidden_size
        self.rnn_cell = RNNCell(vocab_size, hidden_size)
        
    def forward(self, input_seq, hidden_state):
        
        # input_seq: (batch_size, seq_length, vocab_size)
        # hidden_state: (batch_size, hidden_size)
        
        ##### implement this part #####
        
        
        ###############################
        
        # outputs: (batch_size, seq_lenth, vocab_size)
        # hidden_state: (batch_size, hidden_size)
        
        return outputs, hidden_state

Now that dataset and model definitions are done, we need to implement the training loop and we are good to go.

In [None]:
#########################################################################
# TO-DO: Implement the missing part of the training function. 
# As a loss function we want to use cross-entropy
# It can be called with F.cross_entropy().
# Hint: Pass through the model -> Backpropagate gradients -> Take gradient step
#########################################################################

def train(model, dataloader, optimizer, epoch, log_interval, device='cpu'):
    model.train()
    
    for batch_idx, (data, target) in enumerate(dataloader):
        # we get these from __getitem__ of the dataset class that we implemented earlier
        # data: (batch_size, seq_length, vocab_size)
        # target: (batch_size, seq_length)
        data, target = data.to(device), target.to(device)
        
        # first we need to zero the gradient, otherwise PyTorch would accumulate them
        optimizer.zero_grad()
        
        # Q: Is hidden_state required to be initialized to 0 here?
        # If yes, why? If no, why? How does this affect the training?
        hidden_state = torch.zeros(batch_size, hidden_size).to(device)
        
        ##### implement this part #####
        
        
        ###############################
        
        if batch_idx % log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(dataloader.dataset),
                100. * batch_idx / len(dataloader), loss.item()))

Next, we instantiate our model and optimizer and then we can start training.

In [None]:
# model and training parameters
# feel free to experiment with different parameters and optimizers
hidden_size = 100
learning_rate = 3e-2
device = 'cuda' if torch.cuda.is_available() else 'cpu'

rnn = RNN(seq_length, vocab_size, hidden_size).to(device)

optimizer = optim.Adagrad(rnn.parameters(), lr=learning_rate)

In [None]:
# training loop
epochs = 3
for epoch in range(1, epochs + 1):
    train(rnn, dataloader, optimizer, epoch, log_interval=1000, device=device)

Along with training the model, it's also good to check what kind of text our model is generating. We have implemented a function which samples text from the model give an initial token and a hidden state.

In [None]:
# sample text of length seq_len, this seq_len need to be the same
# as seq_length that we used earlier, we can basically sample text
# of any arbitrary length.

def sample(hidden_state, token, seq_len):
    token_emb = torch.zeros(1, vocab_size).to(device) # use batch_size=1 at inference
    token_emb[0,token] = 1
    char_indices = [token] # first token
    
    with torch.no_grad():
        for timestep in range(seq_len):
            output, hidden_state = rnn.rnn_cell(token_emb, hidden_state)
            output = torch.softmax(output, dim=-1) # convert to probabilities
            token = torch.argmax(output, dim=-1).item() # get the token with the highest proability
            char_indices.append(token)

            token_emb = torch.zeros(1, vocab_size).to(device)
            token_emb[0,token] = 1
    
    return char_indices

Now, let's sample sample text from the model after every epoch to see if our model is learning to generate some text or not. In the code below, we are sampling a 100 char text from the model, starting with a random token and 0 memory. Try to generate some text by using the the hidden_state returned by RNN class.

In [None]:
epochs = 3
for epoch in range(1, epochs + 1):
    train(rnn, dataloader, optimizer, epoch, log_interval=1000, device=device)
    
    # sample a 100 char text from the model, starting with a random token and 0 memory
    token = np.random.randint(0, vocab_size)
    hidden_state = torch.zeros(1, hidden_size).to(device)
    char_indices = sample(hidden_state, token, 100) # sample a 100 char text from the model
    txt = ''.join(dataset.ix_to_char[ix] for ix in char_indices)
    print (txt)

If everything went well, then our model should be able to generate some legible text after some epochs. However, it'd probably be quite slow at it. Try to figure out how to make the model learn faster. Use the code from [here](https://gist.github.com/karpathy/d4dee566867f8291f086) as reference.

Hint: It has to do with initializing hidden state to 0 in every training iteration in the training loop.

The model should be able to learn spellings and certain words, use of spaces and how to begin sentences. It most likely won't be able to generate long sentences. To generate long sentences and passages, the model capacity needs to be increased. Try using a 3-layer RNN with 512 dimensional hidden state and see if the model is able to generate better text.

Bonus: Try implementing a word level language model with the same model on the same dataset. The only difference is that now each word is a token rather than each character. So, the tokenization in the dataset needs to be changed and everything else remains same