# Recurrent Neural Networks

image.png
[From Gabriel Loye's 'A Beginner's Guide to Recurrent Neural Networks with Pytorch'](https://blog.floydhub.com/a-beginners-guide-on-recurrent-neural-networks-with-pytorch/)

Recurrent neural networks (RNNs) differ from your traditional multi-layer perceptron (MLP) in that they process inputs one at a time, and pass information from the previous inputs at each time step. Perhaps obviously, this makes RNNs well-suited to processing sequential data, such as language or audio.

As seen in the diagram above, the simplest RNN consists of a single hidden node (labeled here 'RNN Cell') which takes in new input and the previous time step's hidden state, and produces a new hidden state and an output for that timestep. The equation that produces this hidden state is:

$$h_{t}=\phi(w_{h}h_{t-1}+w_{i}i_{t})$$

Where $w_{h}$ and $w_{i}$ are learned weights, and $\phi$ is an activation function, such as tanh.

The output at each time step is produced from this hidden state based on another set of learned weights, $w_{o}$:
$$o_{t}=w_{o}h_{t}$$

The *recurrence* in this architecture refers to the fact that the RNN cell, which includes the learned weights and the activation function, does not differ between each time step. In other words, each new input and hidden state is fed back into the *same* RNN cell for the entire sequence.

# Long Short-Term Memory

# Character Prediction

To demonstrate the power of LSTMs in processing and understanding sequential data, we'll train and compare the performance of a feed-forward n-gram model and an LSTM on the task of text generation by character-level prediction. 

First, we need to import our dataset and reshape the data so that it can be fed into our models. We'll use the 20 Newsgroups dataset from SciKitLearn, which is a collection of about 20K news documents, each labeled as belonging to one of 20 topics. We'll remove unwanted characters, such as numbers and line returns, ensure that each sequence is the same length for batching, and then create an integer-to-character dictionary for later lookups.

In [None]:
from sklearn.datasets import fetch_20newsgroups
import re
import torch
from torch import nn
import numpy as np

#download list of newsgroups documents, without headers, footers, or comments
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes')).data

cleaned_newsgroups = []
for doc in newsgroups:
    #split document into list of words and punctuation
    words = re.findall(r"[\w']+|[.,!?;]",doc)
    words = [str(w).lower() for w in words if re.fullmatch('[A-Za-z.,!?;]*',w) and not w.isupper() and not re.search('[0-9]', w)]

    #combine cleaned words and limit to first 100 characters
    chars = ' '.join(words)[:100]
    if len(chars) == 100:
        #add sequence to dataset and skip any that are less than 100 characters long
        cleaned_newsgroups.append(chars)

#create set of unique characters from dataset
lstm_vocab = set(''.join(cleaned_newsgroups))

#map characters to integers and vice versa
i2c = dict(enumerate(lstm_vocab))
c2i = {c:i for i,c in i2c.items()}


#N-gram Code
We'll start by creating and training our somewhat naive n-gram model. Here the model will be able to see the previous n characters in order to predict the next character. We will split each of our sequences into inputs and targets with a moving window.

**Given that each n-gram (sequence of n characters) will become a unique token in our vocabulary, what would be a good choice for n?**

In [None]:
#TODO choose a length for your n-gram model
n = None
ngram_input = []
ngram_target = []

for seq in cleaned_newsgroups:
    k = 0
    for i in range(0,100-n-1):
        #add n-gram sequences to input and output lists
        #limit to 10 sequence windows per document
        if k <10:
            ngram_input.append(seq[i:(i+n)])
            ngram_target.append(seq[i+n])
            k += 1

#create set of unique n-grams
ngram_vocab = set(ngram_input)

#create lookups for converting from ngram to integer
i2n = dict(enumerate(ngram_vocab))
n2i = {ng:i for i,ng in i2n.items()}

#convert inputs and targets to integers
ngram_input = [n2i[ng] for ng in ngram_input]

#need to use character lookup to convert single character targets
ngram_target = [c2i[ng] for ng in ngram_target]

print("Your choice for n has created", len(ngram_vocab), "unique tokens and", len(ngram_input), "training sequences.")

Next, we'll use a function to turn our inputs into one-hot-encodings. We'll be using cross-entropy for our loss function, which expects the target sequence to be encoded with integer labels, so we can leave our targets in their current format.

In [None]:
def one_hot_encode(sequence, vocab_size, sequence_length, batch_size):
    # Create a one hot encoding with batch as the first axis, position in sequence along the second axis,
    # and one-hot vector along the third axis

    # skip sequence position axis for ngram encodings (no sequence position information)
    if sequence_length > 0:
        features = np.zeros((batch_size, sequence_length, vocab_size), dtype=np.float32)
    else:
        features = np.zeros((batch_size, vocab_size), dtype=np.float32)
    
    # Replacing the 0 at the relevant character index with a 1 to represent that character
    for i in range(batch_size):
        #again skipping sequence position axis for ngram encodings
        if sequence_length > 0:
            for u in range(sequence_length):
                features[i, u, sequence[i][u]] = 1
        else:
            features[i, sequence[i]] = 1
    return features

In [None]:
char_vocab_size = len(c2i)
ngram_vocab_size = len(n2i)
ngram_batch_size = len(ngram_input)

#use 0 for sequence length since ngrams are each treated as a single token
ngram_input = one_hot_encode(ngram_input, ngram_vocab_size, 0, ngram_batch_size)

Finally, we convert all inputs and targets to tensors for pytorch to use.

In [None]:
#we use different torch functions to convert to tensors here, since our inputs are numpy arrays and our targets are lists
ngram_input = torch.from_numpy(ngram_input)
ngram_target = torch.Tensor(ngram_target)

Now we will define our n-gram feed-forward/MLP. We've got the model set up with a single linear layer, but feel free to add layers, regularization, and non-linearity as you see fit.

**We've also left it up to you to choose an activation function for this model.**

In [None]:
class ngram_mlp(nn.Module):
    def __init__(self, ngram_vocab_size, char_vocab_size, hidden_dim):
    
        super(ngram_mlp,self).__init__()
        self.ngram_vocab_size = ngram_vocab_size
        self.hidden_dim = hidden_dim
        self.char_vocab_size = char_vocab_size

        #we take in n characters from our input dataset perform a linear transformation on them to the specified hidden layer dimension
        self.L1 = nn.Linear(self.ngram_vocab_size, self.hidden_dim)
        #our output will be a probability for each character in our vocabulary
        self.L2 = nn.Linear(self.hidden_dim, self.char_vocab_size)

    def forward(self,x):

        #TODO choose an activation function for our hidden layer
        x = None
        x = self.L2(x)
        return x

As always, we'd like to use a GPU to run these models if one is available. So we'll check here and assign our device.

In [None]:
is_cuda = torch.cuda.is_available()

if is_cuda:
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

In [None]:
ngram = ngram_mlp(ngram_vocab_size,char_vocab_size, 100)
ngram.to(device)

Now it's time to set up our loss function and optimizer. We'll also take advantage of Pytorch's built-in data handling for batching.

**Choose a number of epochs and a learning rate that make sense for your model, the size of your ngram vocabulary, and the size of your input dataset.**

In [None]:
n_epochs = None
lr = None

# Define Loss, Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(ngram.parameters(), lr=lr)

import torch.utils.data as data_utils

ngram_train = data_utils.TensorDataset(ngram_input.to(device), ngram_target.to(device))
ngram_train_loader = data_utils.DataLoader(ngram_train, batch_size=100, shuffle=True)

We'll also need some functions to be able to see how our model performs when generating text. During training, we'll print out a sample at the end of each epoch to see how the training is progressing.

In [None]:
def ngram_predict(model, input, temperature):
    # Produce next character as function of current character and hidden state
    
    # to avoid keyError for unseen ngrams, randomly replace first and second character in ngram until we find one that is in our vocab
    while ''.join(input) not in n2i:
        input[0] = np.random.choice(list(c2i.keys()))
        input[1] = np.random.choice(list(c2i.keys()))
    
    
    # One-hot encoding our input to fit into the model
    token = [n2i[''.join(input)]]
    token = one_hot_encode(token, ngram_vocab_size, 0, 1)
    token = torch.from_numpy(token).to(device)
    
    out = model.forward(token)

    prob = nn.functional.softmax(out[-1]/temperature, dim=0).data
    char_ind = torch.multinomial(prob, 1).item()

    return i2c[char_ind]

def ngram_sample(model, n, out_len, start='the',temperature=1.0):
    # Create a string of out_len characters by repeated model prediction on the generated sequence.    
    model.eval() # eval mode
    start = start.lower()
    chars = [ch for ch in start]
    size = out_len - n
    for i in range(size):
        char = ngram_predict(model,chars[-n:],temperature)
        chars.append(char)
    model.train()

    return ''.join(chars)

And now we're ready to actually train the model. We'll print out our loss after every 500 batches, and print out a sample of generated text after each epoch.

In [None]:
ngram.train()

for epoch in range(1, n_epochs + 1):
    for idx,data in enumerate(ngram_train_loader):
        d,t = data
        optimizer.zero_grad()
        output = ngram.forward(d)
        loss = criterion(output, t.view(-1).long())
        loss.backward() 
        optimizer.step()
        
        if idx%500==0:
    
            print('Epoch: {}/{}...Batch:{}/{}..........'.format(epoch, n_epochs, idx, round(len(ngram_input)/100)), end=' ')
            print("Loss: {:.4f}".format(loss.item()))
    print(ngram_sample(ngram, n, 50))

## LSTM Code

Now we'll go through the same process in setting up and training our LSTM character prediction model.

First we need to split up our input and target sentences for our LSTM model.

**Given that our model will predict the next character for each character in the input, how should we define the input and target sequences?**

In [None]:
lstm_input = []
lstm_target = []

for seq in cleaned_newsgroups:
  #skip any sequences that are less than 100 characters long
  if len(seq) == 100:
  
    #TODO segment each sequence into an input and target sequence and convert to integers
    lstm_input.append(None)
    lstm_target.append(None)

for i in range(len(lstm_input)):
    lstm_input[i] = [c2i[c] for c in lstm_input[i]]
    lstm_target[i] = [c2i[c] for c in lstm_target[i]]

print("LSTM Input Sequence: {}\nLSTM Target Sequence:  {}".format(lstm_input[i], lstm_target[i]))

Next we run our input through the one-hot-encoding function and convert both the input and the target to tensors.

In [None]:
lstm_seq_len = len(lstm_input[0])
lstm_batch_size = len(lstm_input)
lstm_input = one_hot_encode(lstm_input, char_vocab_size, lstm_seq_len, lstm_batch_size)

In [None]:
#we use different torch functions to convert to tensors here, since our inputs are numpy arrays and our targets are lists
lstm_input = torch.from_numpy(lstm_input)
lstm_target = torch.Tensor(lstm_target)

Now we're ready to create our LSTM model. We'll use pytorch's LSTM module, which will automatically handle the weights for each of the functions performed inside the LSTM cell. All we need to add is a final linear layer to produce the output at each timestep in the shape of our character vocabulary.

In [None]:
class lstm_char_predict(nn.Module):

    def __init__(self, input_size, output_size, hidden_dim, n_layers):
        super(lstm_char_predict, self).__init__()

        #Defining the size of our hidden layer(s) and how many of them we want in our model
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers

        #Defining our LSTM layer(s) and our linear (output) layer
        self.lstm = nn.LSTM(input_size, hidden_dim, n_layers, batch_first = True, dropout = 0.5)
        self.output = nn.Linear(hidden_dim, output_size)

    def forward(self, x):

        #initialize hidden and cell states for first time step
        batch_size = x.size(0)
        hidden = self.init_hidden_cell(batch_size, x)

        #pass the input and hidden state into our LSTM model
        out, hidden = self.lstm(x, hidden)

        #reshape our LSTM output to be fed into the linear output layer
        #here we combine the sequence and the batch axes from the output, which is fine since we'll be summing them together anyway
        out = out.contiguous().view(-1, self.hidden_dim)
        out = self.output(out)

        return out, hidden

    def init_hidden_cell (self, batch_size, x):
    
        #we need to initialize a hidden state of zeros for input to our first timestep
        hidden = (torch.zeros(self.n_layers, batch_size, self.hidden_dim, device=x.device), 
                  torch.zeros(self.n_layers, batch_size, self.hidden_dim, device=x.device))
    
        return hidden


**What are our input and output dimensions? Also, choose a size and number for the hidden layers in your model**

In [None]:
#TODO: fill in input/output size, size of hidden layer, and number of hidden layers
lstm = lstm_char_predict(None)
lstm.to(device)

Just like for the ngram model, we'll use Cross Entropy Loss and an Adam optimizer, and we'll user Pytorch's data loader for handling our batches.

**Choose a number of epochs and learning rate that makes sense for your model, the size of the character vocabulary, and the size of your input dataset**

In [None]:
import torch.utils.data as data_utils
#TODO: fill in number of epochs and learning rate
n_epochs = None
lr = None

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(lstm.parameters(), lr = lr)

train = data_utils.TensorDataset(lstm_input.to(device), lstm_target.to(device))
lstm_train_loader = data_utils.DataLoader(train, batch_size = 100, shuffle = True)

And again we'll create some functions to see what our model's predictions look like. Notice that we don't have to do the same adjustments for key errors here, since our vocabulary is an exhaustive list of all possible character inputs (unlike the ngram model, where there were a lot of possible ngrams never seen in the input dataset)

In [None]:
def lstm_predict(model, character, temperature):
    # Produce next character as function of current character and hidden state
    
    # One-hot encoding our input to fit into the model
    character = np.array([[c2i[c] for c in character]])
    character = one_hot_encode(character, char_vocab_size, character.shape[1], 1)
    character = torch.from_numpy(character).to(device)
    
    out, hidden = model(character)

    prob = nn.functional.softmax(out[-1]/temperature, dim=0).data
    char_ind = torch.multinomial(prob, 1).item()

    return i2c[char_ind], hidden

def lstm_sample(model, out_len, start='the',temperature=1.0):
    # Create a string of out_len characters by repeated model prediction on the generated sequence.    
    model.eval() # eval mode
    start = start.lower()
    chars = [ch for ch in start]
    size = out_len - len(chars)
    for i in range(size):
        char, hidden = lstm_predict(model, chars,temperature)
        chars.append(char)
    model.train()

    return ''.join(chars)

And now we can train our model, once again printing the loss during each epoch. **You may need to go back and fine tune the size/number of hidden layers.**

In [None]:
lstm.train()

for epoch in range(1, n_epochs + 1):
    for idx, data in enumerate(lstm_train_loader):
        d,t = data
        optimizer.zero_grad()
        output,hidden = lstm(d)
        loss = criterion(output, t.view(-1).long())
        loss.backward() 
        optimizer.step()
    
        if idx%50 == 0:
            print('Epoch: {}/{}..Batch: {}/{}.............'.format(epoch, n_epochs, idx, round(len(lstm_input)/100)), end=' ')
            print("Loss: {:.4f}".format(loss.item()))
    print(lstm_sample(lstm, 50))

## Trying out a sample from our model

**Choose an output length, a start sequence of characters, and a temperature.**

Note from our lstm_sample function that the temperature is a hyperparameter that controls the randomness of the LSTM's predictions. **Explore how a temperature of 1, 0.1, etc. affects the output and fine tune it to something that makes sense for our model. Try out multiple start sequences to get a feel for the difference between the LSTM and the n-gram MLP.** 

In [None]:
# TODO: fill in output length, start character sequence, and temperature
out_len = None
start = None
lstm_temp = None

print(lstm_sample(lstm, out_len=out_len, start=start, temperature=lstm_temp))

Let's look at how this output compares to one generated by our n-gram MLP. 

In [None]:
# TODO: fine tune the temperature of the n-gram model.
ngram_temp = None

print(ngram_sample(ngram, n, out_len=out_len, start=start, temperature=ngram_temp))

**Note any synctatic differences in the model.** With good hyperparameters, we should notice some more intelligent word-by-word predictions within the LSTM. 