## Character-level RNN 
<b>The text file supplements/anna.txt contains Leo Tolstoy's novel <i>Anna Karenina</i>.
We train a character-level language model using this dataset.</b>


In [1]:
# Warning: Rerunning this code block will create
# a list of 'chars' w/ a different order/sequencing every time (I only realised
# this after training my model, so I can't change this. If this is changed, forward passing
# on the trained model will produce garbage output)

# This means that you will get a unique idx_to_char mapping
# every time you run this block. This will render forward passes through a previously trained
# LSTM useless if you lose the original idx_to_char mapping that the model
# was trained on. E.g. A model trained w/ idx_to_char = {1: 'x', 2: 'y', 3:'k'}
# can only perform forward passes w/ the same idx_to_char dictionary. 
# A forward pass w/ a trained model on a new dictionary {1: 'y', 2: 'k', 3:'x'}
# will produce garbage output!

# A more robust solution would be to loop over all chars in anna.txt sequentially,
# adding unique chars to a list to ensure that chars has the same order of characters
# every time you run this code cell (implement this in future scenarios).

import numpy as np
import torch

# data I/0 
data = open('supplements/anna.txt', 'r').read()

# Data is a string. Converted into a set removes duplicates. 
# Then we store the characters in a list.
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print(f"There are {vocab_size} distinct characters in the dataset.")

# Function used to map characters w/ a unique index
# to a onehot vector (we map chars to idxs below in a dictionary)
def idx2onehot(idx):
    """
    Maps an idx 'int' to a one-hot-vector of length vocab size
    """
    # Create a one-hot vector to represent the input.
    x = torch.zeros((vocab_size, 1), dtype = torch.double)
    x[idx] = 1
    return x.double()

# Mapping each unique character to a number (char_to_idx)
# and a number to an idx (idx_to_char).
# The numbers start from 0 to len(unique_chars) - 1 (i.e. 0 to 82 here for anna.txt)
char_to_idx = {char:idx for idx, char in enumerate(chars)}
idx_to_char = {idx:char for idx, char in enumerate(chars)}

# Dictionary mapping an idx to a onehot vector of shape (vocab_size,1)
idx_to_onehot = {idx:idx2onehot(idx) for idx in idx_to_char}

# Just getting indexes of characters
data_indxs = [char_to_idx[char] for char in data]
del data # delete data from memory once finished with it

# Length of our sequences used to train our LSTM
seq_length = 20

def create_dataset(data_indxs, seq_length):
    '''
    Transforms a list of indxs (which map to letters sequentially in anna.txt)
    to torch.tensors X and y which form our training examples and ground truth labels.
    
    Args:
        data_indxs: a list of indxs (which map to letters sequentially in anna.txt)
        seq_length: length of the sequences used to train our LSTM
        
    Returns: 
        X is an n_sequences x seq_length training matrix.
        Y has the same dimensions. 
    '''
    
    X, y = [], []
    
    i = 0
    
    while True:
        
        # Stop adding sequences to our training sets
        # when we've reached the end of anna.txt
        if i+seq_length+1 >= len(data_indxs):
            break

        feature = data_indxs[i:i+seq_length]
        target = data_indxs[i+1:i+seq_length+1]
        
        # X is a list of lists of torch.doubles
        # which are our training sequences
        X.append(feature)
        
        # Y is a list of lists of torch.doubles
        # which are our training target sequences 
        # These are the sequences in X shifted by 1
        y.append(target)
        
        # Reduce i to i += 1 if
        # you want to slide sequences along 1 character
        # and produce more training data.
        i+=25
        
    return torch.tensor(X, dtype = torch.double), torch.tensor(y, dtype = torch.double)

# x_train_RNN, y_train_RNN 
x_train_RNN, y_train_RNN = create_dataset(data_indxs, seq_length)
del data_indxs # delete data_indxs once we have our dataset

There are 83 distinct characters in the dataset.


<b>
Below I describe the LSTM RNN architecture that I used for this language model and explain the hyperparameters chosen (e.g. how many layers are there, what are the activation functions used).</b>

I used an LSTM architecture for this task. The RNN has an 83-dimensional input layer, $x_{t}$, and an 83-dimensional output layer $y_{t}$ at each timestep, as well as a hidden layer $h_t$ of 50 neurons. Note that $h_t$ is a function of $o_{t} = \sigma(W_{o} \dot [h_{t-1},x_{t}] +b_o)$ and $c_{t}(c_{t-1}, f_{t}, i_{t}, \tilde c_t)$, where $o_{t}, f_{t}, i_{t}$ and $\tilde c_t$ are all single layered MLPs  that perform a linear transformation on $x_t$ and $h_{t-1}$. The equations below summarise the LSTM architecture in its entirety:


\begin{align*}
f_{t} &= \sigma (W_{f}[h_{t-1},x_{t}] +b_f)  \\
i_{t} &= \sigma (W_{i}[h_{t-1},x_{t}] +b_i)   \\  
o_{t} &= \sigma (W_{o}[h_{t-1},x_{t}] +b_o) \\     
\tilde c_{t} &= tanh (W_{h}[h_{t-1},x_{t}] +b_h) \\
c_{t} &= f_t \odot c_{t-1} + i_{t} \odot \tilde c_{t}\\
h_{t} &= o_{t} \odot tanh(c_t) \\
y_{t} &= W_{y}h_{t} + b_{y}\\
\end{align*}


Note that $y_{t}$ is an 83-dimensional output vector of 'logits' which, during training, is passed into the cross-entropy loss function. However, during inference or sequence forecasting, if we want to compute say the next $n$ characters that follow a string with $m$ characters, we update $h_t$ using the first $m$ characters in the string (which correspond to $m$ inputs $x_t$), without computing $y_t$. There is no harm computing $y_t$ here, but it is not needed. Thereafter, we take the value of $y_t$ outputted by the $m^{th}$ character; compute $p_t = softmax(y_t)$ and feed in the 83-dimensional one-hot encoded $argmax(p_t)$ as the input into $h_{t+1}$ in the following timestep. We repeat this process for as long as desired to generate a forecasted sequence.

In [2]:
import torch.nn as nn

# Converts a (n_batches, 1) tensor --> (n_batches, 83) 
# tensor of one_hot encodings
def col2onehot(col, idx_to_onehot):

    # Convert onehot to a torch tensor of dtype = int64/long
    return torch.tensor([list(idx_to_onehot[idx.item()]) for idx in col])    

class AnnaLSTM(nn.Module):
    def __init__(self, vocab_size = 83, hidden_size = 50):
        super(AnnaLSTM, self).__init__()
        
        self.LSTM_input_size = vocab_size
        self.hidden_size = hidden_size
        self.Linear_output_size = vocab_size

        self.lstm = nn.LSTMCell(self.LSTM_input_size, self.hidden_size, dtype = torch.double)
        self.linear = nn.Linear(self.hidden_size, self.Linear_output_size, dtype = torch.double) 
        # We compute softmax along the columns of our (n_batches, seq_length) LSTM output
        # vector i.e. dim = 1
        self.softmax = nn.Softmax(dim = 1)

    def forward(self, x, y): 
        '''
        Forward pass of LSTM

        Parameters
        ----------
        x: a tensor of shape (n_samples, seq_length) containing indexes (that map to letters via idx_to_char) 
        of examples. MUST convert numbers to one-hot encodings within forward pass loop.
        y: a tensor of shape (n_samples, seq_length) containing indexes (which map to letters via idx_to_char)
        of targets. MUST convert numbers to one-hot encodings within forward pass loop.

        Returns
        -------
        outputs: a tensor of shape (n_samples*seq_length, vocab_size)
           - These are all of the logits stacked in rows, W/ the output
           of the 1st character of the first example on the first row, and the output
           of the last character of the last example on the last row
           
        groundtruths: a tensor of shape (n_samples*seq_length, vocab_size)
           - These are all of the one-hot-encoded targets stacked in rows, W/ the target
           of the 1st character of the first example on the first row, and the target
           of the last character of the last example on the last row
        '''

        outputs, groundtruths = [], []
        
        # Getting batch_size and sequence length
        batch_size, seq_length = x.shape

        # c_t and h_t are initialized w/ zeros and 
        # shape = (batch_size, hidden_size)
        h_t = torch.zeros(batch_size, self.hidden_size, dtype=torch.double).to(x.device)
        c_t = torch.zeros(batch_size, self.hidden_size, dtype=torch.double).to(x.device)

        # Here, x.chunk(x.size(1)), dim = 1) splits out (n_batches, seq_length)
        # training block up into seq_length columns - so we can effectively
        # forward pass n_batches sequences at once.
        for i, input_t in enumerate(x.chunk(x.size(1), dim=1)):
            
            # input_t is a (n_batches, 1) column vector of idxs that needs
            # to be converted to onehot vectors i.e. need (n_batches, 1) --> (n_batches, 83)
            input_t = col2onehot(input_t, idx_to_onehot)
            
            # y.chunk(y.shape[1],dim = 1) returns a tuple of columns of y 
            # since y is a (n_batches, 1) column vector of idxs that need
            # to be converted to onehot vectors, we need (n_batches, 1) --> (n_batches, 83)
            groundtruth_t = col2onehot(y.chunk(y.shape[1], dim = 1)[i], idx_to_onehot)
            
            h_t, c_t = self.lstm(input_t, (h_t, c_t))
            logits = self.linear(h_t)
            
            outputs += [logits]
            groundtruths += [groundtruth_t]
            
        # Outputs are of shape (n_batches*seq_length, 83)
        outputs = torch.stack(outputs, dim = 1).reshape(batch_size*seq_length,-1)
        # Associated groundtruths are of shape (n_batches*seq_length, 83)
        groundtruths = torch.stack(groundtruths, dim = 1).reshape(batch_size*seq_length,-1)

        return outputs, groundtruths
    
    def sample(self, x, future=50):
        '''
        Forward pass of LSTM

        Parameters
        ----------
        x: a tensor of shape (n_samples, seq_length) containing indexes (letters) of to-be-forecasted sequences
           MUST convert numbers to one-hot encodings within forward pass loop

        Returns
        -------
        outputs: a tensor of shape (n_samples, future) containing indexes which map to characters
                 via char_to_idx/idx_to_char defined above.
        '''
        best_idxs_output = []
        
        batch_size, seq_length = x.shape
        
        # c_t and h_t are initialized w/ zeros and 
        # shape = (batch_size, hidden_size)
        h_t = torch.zeros(batch_size, self.hidden_size, dtype=torch.double).to(x.device)
        c_t = torch.zeros(batch_size, self.hidden_size, dtype=torch.double).to(x.device)

        # Here, x.chunk(x.size(1)), dim = 1) splits out (n_batches, seq_length)
        # training block up into seq_length columns - so we can effectively
        # forward pass n_batches sequences at once
        
        for i, input_t in enumerate(x.chunk(x.size(1), dim=1)):
            
            # input_t is a (n_batches, 1) column vector of idxs that needs
            # to be converted to onehot vectors i.e. need (n_batches, 1) --> (n_batches, 83)
            input_t = col2onehot(input_t, idx_to_onehot)
            h_t, c_t = self.lstm(input_t, (h_t, c_t))
            
            if i == (x.size(1) - 1):
                logits = self.linear(h_t)
                p_t = self.softmax(logits)

                # This produces a column vector of shape (n_batches, 1)
                # Need to transform to  ---> (n_batches, 83)
                best_idxs = torch.argmax(p_t, dim = 1)
                best_idxs_output += [best_idxs]
                output_t = col2onehot(best_idxs, idx_to_onehot)
                
        # Keep feeding in outputs into the lstm
        for i in range(future-1):

            # STARTING FROM THE LAST OUTPUT - KEEP UPDATING OUTPUT TO FORECAST INTO FUTURE
            h_t, c_t = self.lstm(output_t, (h_t, c_t))
            logits = self.linear(h_t)
            p_t = self.softmax(logits)

            # This produces a column vector of shape (n_batches, 1)
            # Need to transform to  ---> (n_batches, 83)
            best_idxs = torch.argmax(p_t, dim = 1)
            best_idxs_output += [best_idxs]
            output_t = col2onehot(best_idxs, idx_to_onehot)
        
        # Here we concatenate the tensors along the column dimension
        outputs = torch.stack(best_idxs_output, 1).squeeze()
        return outputs

<b>Below I train my language model on the dataset.
I also describe details in the training procedure. </b>

<b>Furthermore, I show the training progress by reporting how the result of using my RNN to
complete the sentence ``The meaning of life is'' changes as more training is
done.</b>

In [25]:
from time import time
from tqdm import tqdm
import torch.optim as optim


def trainRNN(net, x, y, lr=0.5, momentum=0.9, batch_size=64, nepochs=20):
    
    # Define the device the net parameters are on
    device = next(net.parameters()).device 
    optimizer = optim.SGD(net.parameters(), lr=lr, momentum=momentum)
    
    # Cross Entropy Loss - loss function
    criterion = nn.CrossEntropyLoss()

    dataloader = DataLoader(DatasetWrapper(x, y), batch_size=batch_size, shuffle=True)
    loop = tqdm(range(nepochs))
    
    # training loop
    for i in loop: # for each epoch
        t0 = time()
        
        epoch_loss = 0
        n_batches = 0
        for (x_batch, y_batch) in dataloader: # for each mini-batch
            
            # Move mini batches to device (GPU if one exists)
            x_batch, y_batch = x_batch.to(device), y_batch.to(device)
            
            # Zero gradients
            optimizer.zero_grad()
            
            # Get outputs and groundtruths in desired shape/form
            # through net forward pass
            outputs, groundtruths = net(x_batch, y_batch)
            
            # Compute loss
            loss = criterion(outputs, groundtruths)
            
            # Get gradients
            loss.backward()
            
            # Update parameters
            optimizer.step()
            
            # Add loss to epoch loss to eventually compute
            # an average epoch loss
            # Detach from the computational graph to avoid RAM increasing spuriously
            epoch_loss += loss.detach()
            n_batches += 1
      
        # Average epoch loss
        epoch_loss = epoch_loss/n_batches

        # show training progress
        loop.set_postfix(Epoch_Loss ='|%7.5f|' % epoch_loss.item(), time='|%7.2f|' % (time()-t0))
        
        # Show training progress after each epoch
        with torch.no_grad(): 
            test_string = "The meaning of life is"
            input_sample = torch.tensor([char_to_idx[char] for char in test_string]).unsqueeze(0)
            sample_idxs = net.sample(input_sample, future = 50)
            txt = ''.join(idx_to_char[idx.item()] for idx in sample_idxs)
            output = test_string + txt
            print(f"---- Progress after Epoch {i+1}\n{output} \n----")
            
        # Save model
        torch.save(net.state_dict(), f"supplements/RNNstate_dict_epoch{i+1}.pth")

In [28]:
torch.manual_seed(0) # for reproducibility
from util import *

# Boolean flag which decides if we want
# to retrain our model, or just load saved model
# weights and produce necessary outputs
RETRAIN = False

if RETRAIN:
    device = (
        "cuda"
        if torch.cuda.is_available()
        else "mps"
        if torch.backends.mps.is_available()
        else "cpu"
    )

    net = AnnaLSTM().to(device)
    trainRNN(net, x_train_RNN, y_train_RNN, lr=0.5, momentum=0.9, batch_size=64, nepochs=20)
    
else:
    
    epochs = [2,3,6,14,18,19,20] #include more but these were the only interesting ones
    
    for epoch in epochs:
        
        loaded_RNN = AnnaLSTM()
        loaded_RNN.load_state_dict(torch.load(f"supplements/RNNstate_dict_epoch{epoch}.pth"))
        
        # Show training progress after each selected epoch
        with torch.no_grad(): 
            test_string = "The meaning of life is"
            input_sample = torch.tensor([char_to_idx[char] for char in test_string]).unsqueeze(0)
            sample_idxs = loaded_RNN.sample(input_sample, future = 50)
            txt = ''.join(idx_to_char[idx.item()] for idx in sample_idxs)
            output = test_string + txt
            print(f"---- Progress after Epoch {epoch}\n{output} \n----")

---- Progress after Epoch 2
The meaning of life is the said the said the said the said the said the  
----
---- Progress after Epoch 3
The meaning of life is the said, and the said, and the said, and the sai 
----
---- Progress after Epoch 6
The meaning of life is to the stalked the stalked the stalked the stalke 
----
---- Progress after Epoch 14
The meaning of life is the controady of the same the same the same the s 
----
---- Progress after Epoch 18
The meaning of life is the countess the same to the same to the same to  
----
---- Progress after Epoch 19
The meaning of life is the strain of the strain of the strain of the str 
----
---- Progress after Epoch 20
The meaning of life is the continual to her for the same to the same to  
----


I trained the LSTM architecture using an SGD optimizer with momentum = 0.9, learning rate = 0.5 and batch size = 64. The model was trained for 20 epochs (as I ran out of time). The initial hidden state $h_{t_0}$ and cell state $c_{t_0}$ were simply zero vectors. Furthermore, each training sequence had a length of 20. It should also be noted that the training set was generated by grabbing the first 20 characters in the anna.txt file, and using it as our first training example. The corresponding target or ground truth label was a string of 20 characters 'shifted' along by 1 step. Subsequent training examples and training targets were generated by sliding along 25 characters and then again storing training examples and targets of length 20. This procedure gave us a training set with 79,409 training examples. In an ideal scenario, however, we would have generated training examples by sliding along 1 character at a time rather than 25. This would have resulted in a training set with 1,985,201 example sequences of length 20.

Above we can see how the model completed the sentence "The meaning of life is" after selected epochs. Note that the model predicted the next 50 characters following that initial partial sentence.

<b>Below I comment on the text generated by my character-level RNN. I also discuss what my model has learned to generate, and its limitations.</b>

Referring to the model's output above after selected epochs, the model appears to improve its ability to generate realistic text as training continues. At early epochs, the model correctly includes spaces between its words; however, it repeats one or two words constantly (which is obviously not good). As training continues, the model starts to include a greater variety of 'words' (note that some words aren't actually words in the English language). Since I was unable to train the model for longer than 20 epochs, the final model does not generate realistic text. However, I am sure that 100+ epochs of training would have produced a better model.

Some possible limitations of the model include the fact that it was trained on a reduced training set as mentioned above. Furthermore, I didn't have time to trial different combinations of learning rates, batch sizes, epochs, momentum values as well as various combinations of sequence lengths to train the model on. In addition to this, there was no rigorous or precise way to evaluate the quality of the model's output at test time (e.g. we could't evaluate the quality of its output using some kind of test loss metric). The only metric we had was our sense of the English language in general, or by monitoring the training set loss each epoch (which in fact decreased monotonically for the most part).

Another limitation of using an LSTM for a character-level language model is that such models have a limited context window, meaning that they can only consider a fixed number of characters before and after the current character. This makes it challenging for the model to capture long-range dependencies and understand the larger context of the text. Our character-level LSTM model also does not capture the semantic meaning of the words and sentences, which can limit its ability to generate coherent and meaningful text. This can be especially problematic when generating longer sequences of text.