# Recurrent Neural Networks (RNNs)

## Intuitition
Ordinary neural networks have some limitations: they take a fixed size input and give a fixed sized output. For example, they take in a number of pixels, and give out a list of predictions for character likelihood in a vector.

Another problem is that they have know concept of memory or context. For example, if you were using them to predict words in a sentence, they would just take in all of the words of a sentence at once, and predict the next word, not taking into account which words might have come previously in their calculations, just predicting that the next word should be a proper noun, or whatever.

RNNs solve this problem. RNNs process sequences. Normal neural networks can be thought of as calculating an output based on some input. RNNs can be thought of as calculating an output based on the input BUT ALSO on the history of other inputs as well.

Imagine a loop.

The first time you go through the loop, you feed in some input and get some output.

The next time you go through the look, you feed in some input AND you feed in the previous output as input too.

Thus you can get predictions for sequences of data based on previous sequences and new input.

For this reason, RNNs have many applications such as stock prediction, video frame captioning (take into account previous frames), segment-to-segment machine translation (translate by grammatical segements instead of processing word-by-word), etc.

## Note on LTSMs
In practice, the RNNs are implemented as LTSM (Long Short Term Memory). This means that when generating predictions, they can choose use some sub-networks to choose to ignore some data, hold some data, and select some data for the prediction.

Since RNNs just take in a vector and spit out a vector, you can easily feed one RNN into another.

## Note on tanh
RNNs use hyperbolic tan as a squashing function to stop gradient explosions. I.e, it keeps the values between -1 to 1 to stop the gradients blowing up to huge/tiny numbers, which might happen if they were doubled/halfed each iteration.

It's very similar to the sigmoid function.

<img src="https://raw.githubusercontent.com/pekoto/fast.ai/master/images/tanh.jpg" width=500 height=350>

## LTSM Example
Ref: https://www.youtube.com/watch?v=WCUNPb-5EYI

<img src="https://raw.githubusercontent.com/pekoto/fast.ai/master/images/lstm0.jpg" width=600 height=450>

First, we take our previous input, our new input, and our list of predictions. We get a list of predictions: Doug, or saw. Doug and saw are put into memory. Our trained weights predict the next most likely word is saw. This gives us a new list of predictions: Doug, Spot, and Jane.

<img src="https://raw.githubusercontent.com/pekoto/fast.ai/master/images/lstm1.jpg" width=600 height=450>

We then repeat the process with our new predictions. When we get to the "ignore" layer, based on our previous memory, we know we've already seen Doug, so we ignore it, leaving us to choose either Jane or Spot.
<img src="https://raw.githubusercontent.com/pekoto/fast.ai/master/images/ltsm2.jpg" width=600 height=450>

In this way, RNNs can take account of what they previously saw when making new predictions.

## A very simple RNN
Ref: https://www.youtube.com/watch?v=UNmqTiOnRfg

Imagine we have three food types represented by one-hot encoded vectors. Each good gets cooked in a sequence, depending on the weather.

So in this example, food is our previous output, and weather is our new input. We need to remember the last food cooked, and combine it with the weather.

<img src="https://raw.githubusercontent.com/pekoto/fast.ai/master/images/rnn0.jpg" width=600 height=450>

<img src="https://raw.githubusercontent.com/pekoto/fast.ai/master/images/rnn1.jpg" width=600 height=450>

<img src="https://raw.githubusercontent.com/pekoto/fast.ai/master/images/rnn2.jpg" width=600 height=450>

<img src="https://raw.githubusercontent.com/pekoto/fast.ai/master/images/rnn3.jpg" width=600 height=450>

<img src="https://raw.githubusercontent.com/pekoto/fast.ai/master/images/rnn4.jpg" width=600 height=450>

So when implementing our RNNs, we multiply the previous output (as input) by a weight matrix, and also our new input by a weight matrix too.

## RNN Implementation
In this example, we will download the collected works of Nietzche, and then get it to predict the next character given n starting characters.

### Data Setup

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.io import *
from fastai.conv_learner import *

from fastai.column_data import *

In [4]:
PATH='data/nietzsche/'

In [5]:
text = open(f'{PATH}nietzsche.txt').read()
print('corpus length:', len(text))

corpus length: 600893


In [6]:
# Take a look at the first 400 words
text[:400]

'PREFACE\n\n\nSUPPOSING that Truth is a woman--what then? Is there not ground\nfor suspecting that all philosophers, in so far as they have been\ndogmatists, have failed to understand women--that the terrible\nseriousness and clumsy importunity with which they have usually paid\ntheir addresses to Truth, have been unskilled and unseemly methods for\nwinning a woman? Certainly she has never allowed herself '

Instead of predicting word by word, we will predict char by char. This has a few advantages: you don't have to worry about unrecognized/rare words, and there are going to be fewer unique items.

In [7]:
# Insert into set to get rid of dupes
chars = sorted(list(set(text)))
vocab_size = len(chars)+1
print('total chars', vocab_size)

total chars 85


In [8]:
# Insert a value for padding
chars.insert(0, '\0')

In [9]:
''.join(chars)

'\x00\n !"\'(),-.0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyzÆäæéë'

Now we will create mappings from characters to indices, and indices to characters.

In [10]:
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

Now we will use a list of indices generated using our mappings above when dealing with the text from now on.

In [12]:
indices = [char_indices[c] for c in text]
indices[:10]

[40, 42, 29, 30, 25, 27, 29, 1, 1, 1]

In [17]:
''.join(indices_char[i] for i in indices[:70])

'PREFACE\n\n\nSUPPOSING that Truth is a woman--what then? Is there not gro'

### A 3 char model
Now let's build a model where we can pass in 3 chars and get the fourth back. For this to work, we will need to grab the 0, 1, 2, 3 position chars in a loop.

In [33]:
# Get every nth element in indices -- our string of letters converted to indices
# range statement = get 0...len-3, skipping 3 chars at a time
char_skip = 3

c1_data = [indices[i] for i in range(0, len(indices)-char_skip, char_skip)]
c2_data = [indices[i+1] for i in range(0, len(indices)-char_skip, char_skip)]
c3_data = [indices[i+2] for i in range(0, len(indices)-char_skip, char_skip)]
c4_data = [indices[i+3] for i in range(0, len(indices)-char_skip, char_skip)]

In [34]:
# Set up input (put into array)
x1 = np.stack(c1_data)
x2 = np.stack(c2_data)
x3 = np.stack(c3_data)

In [35]:
# Set up the output
y = np.stack(c4_data)

In [36]:
x1[:4], x2[:4], x3[:4]

(array([40, 30, 29,  1]), array([42, 25,  1, 43]), array([29, 27,  1, 45]))

In [37]:
y[:4]

array([30, 29,  1, 40])

#### Create model
(See diagrams for reference: https://github.com/fastai/fastai/blob/master/courses/dl1/ppt/lesson6.pptx)

Our first version of the model will look like this:

<img src="https://raw.githubusercontent.com/pekoto/fast.ai/master/images/rnn5.jpg" width=650 height=450>

Note the same coloured arrows mean we are using the same weight matrix.

In [38]:
num_of_hidden_activations = 256

In [39]:
# I.e., size of embedding matrix
num_of_factors = 42

In [51]:
class Char3Model(nn.Module):
    def __init__(self, vocab_size, num_of_factors):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, num_of_factors)
        
        # Set up the first hidden layer -- from input to hidden layer (green arrow)
        self.input_layer = nn.Linear(num_of_factors, num_of_hidden_activations)
        
        # Set up the hidden layer -- from hidden to hidden (orange arrow)
        self.hidden_layer = nn.Linear(num_of_hidden_activations, num_of_hidden_activations)
        
        # Set up the output layer -- from hidden to output (blue arrow)
        self.output_layer = nn.Linear(num_of_hidden_activations, vocab_size)
        
    # Take in 3 chars
    def forward(self, c1, c2, c3):
        
        # Feed inputs through the embedding layer, then the hidden layer, then ReLU
        input_char1 = F.relu(self.input_layer(self.embedding(c1)))
        input_char2 = F.relu(self.input_layer(self.embedding(c2)))
        input_char3 = F.relu(self.input_layer(self.embedding(c3)))
        
        # Set up hidden activations
        hidden_activations = V(torch.zeros(input_char1.size()).cuda())
        
        # Put the input through the hidden layers, adding them together
        # We use tanh to keep things between -1 and 1
        hidden_activations = F.tanh(self.hidden_layer(hidden_activations+input_char1))
        hidden_activations = F.tanh(self.hidden_layer(hidden_activations+input_char2))
        hidden_activations = F.tanh(self.hidden_layer(hidden_activations+input_char3))
        
        # Finally use softmax to get most likely char
        return F.log_softmax(self.output_layer(hidden_activations))
        

In [52]:
# Set up our model data (just to avoid using PyTorch loaders)
model_data = ColumnarModelData.from_arrays('.', [-1], np.stack([x1,x2,x3], axis=1), y, bs=512)

In [53]:
model = Char3Model(vocab_size, num_of_factors).cuda()

In [54]:
# Set up an iterator so we can grab mini-batches and iterate through them
iterator = iter(model_data.trn_dl)

# Grab the Xs and Ys for a mini-batch (as tensors?)
*xs,yt = next(iterator)

# Convert tensor to variable and pass to the model, giving a tensor of floats
t = model(*V(xs))

In [61]:
# Size is 512 (batch size) * probability of each of the vocab items (85)
t

Variable containing:
-4.3439 -4.4141 -4.4359  ...  -4.4682 -4.5836 -4.4587
-4.5228 -4.3487 -4.5125  ...  -4.6175 -4.4831 -4.5724
-4.0663 -4.3157 -4.6830  ...  -4.6195 -4.3445 -4.3563
          ...             ⋱             ...          
-4.1089 -4.2008 -4.4603  ...  -4.3261 -4.5919 -4.4617
-4.1337 -4.3263 -4.5595  ...  -4.2842 -4.5107 -4.5387
-4.1154 -4.3211 -4.4799  ...  -4.1908 -4.6491 -4.5378
[torch.cuda.FloatTensor of size 512x85 (GPU 0)]

In [56]:
# Set up our optimizer
opt = optim.Adam(model.parameters(), 1e-2)

In [57]:
fit(model, model_data, 1, opt, F.nll_loss)

epoch      trn_loss   val_loss                              
    0      2.101551   1.388842  



[1.3888421]

In [58]:
# Set learning rate annealing and run again
set_lrs(opt, 0.001)

In [60]:
fit(model, model_data, 1, opt, F.nll_loss)

epoch      trn_loss   val_loss                              
    0      1.863796   0.42308   



[0.42307997]

#### Test the model

In [62]:
def get_next(inp):
    # Convert character in the input to tensor
    indices = T(np.array([char_indices[c] for c in inp]))
    
    probabilities = model(*VV(indices))
    
    i = np.argmax(to_np(probabilities))
    
    return chars[i]

In [63]:
get_next('y. ')

'T'

In [66]:
get_next(' an')

'd'

## Converting to RNN
We will now convert the code to be an RNN. The code is largely the same, except now we loop our hidden layers. The output is also taken into the loop. In other words, the output will become the new input.

<img src="https://raw.githubusercontent.com/pekoto/fast.ai/master/images/rnn6.jpg" width=600 height=450>

In [67]:
char_skip=8

In [70]:
# Through 0-7, create a list of every 8th character with that starting point. These will be the inputs.
input_character_data = [[indices[i+j] for i in range(char_skip)] for j in range(len(indices)-char_skip)]

In [72]:
# Then create a list of the next char in the series. These will be the training labels.
output_character_data = [indices[j+char_skip] for j in range(len(indices)-char_skip)]

In [73]:
xs = np.stack(input_character_data, axis=0)

In [74]:
xs.shape

(600885, 8)

In [75]:
y = np.stack(output_character_data)

In [76]:
xs[:char_skip, :char_skip]

array([[40, 42, 29, 30, 25, 27, 29,  1],
       [42, 29, 30, 25, 27, 29,  1,  1],
       [29, 30, 25, 27, 29,  1,  1,  1],
       [30, 25, 27, 29,  1,  1,  1, 43],
       [25, 27, 29,  1,  1,  1, 43, 45],
       [27, 29,  1,  1,  1, 43, 45, 40],
       [29,  1,  1,  1, 43, 45, 40, 40],
       [ 1,  1,  1, 43, 45, 40, 40, 39]])

In [77]:
y[:char_skip]

array([ 1,  1, 43, 45, 40, 40, 39, 43])

In [78]:
validation_indices = get_cv_idxs(len(indices)-char_skip-1)

In [79]:
model_data = ColumnarModelData.from_arrays('.', validation_indices, xs, y, bs=512)

In [88]:
# This is essentially the same as our earlier model

class CharLoopModel(nn.Module):
    # This is an RNN!
    def __init__(self, vocab_size, num_of_factors):
        super().__init__()
        self.embedding_layer = nn.Embedding(vocab_size, num_of_factors)
        self.input_layer = nn.Linear(num_of_factors, num_of_hidden_activations)
        self.hidden_layer = nn.Linear(num_of_hidden_activations, num_of_hidden_activations)
        self.output_layer = nn.Linear(num_of_hidden_activations, vocab_size)
        
    def forward(self, *chars):
        bs = chars[0].size(0)
        h = V(torch.zeros(bs, num_of_hidden_activations).cuda())
        for char in chars:
            # Loop around our input layer and hidden layers
            inp = F.relu(self.input_layer(self.embedding_layer(char)))
            h = F.tanh(self.hidden_layer(h+inp))
        
        return F.log_softmax(self.output_layer(h), dim=-1)

In [89]:
model = CharLoopModel(vocab_size, num_of_factors).cuda()
opt = optim.Adam(model.parameters(), 1e-2)

In [90]:
fit(model, model_data, 1, opt, F.nll_loss)

epoch      trn_loss   val_loss                              
    0      1.98031    1.956332  



[1.956332]

In [91]:
set_lrs(opt, 0.001)

In [92]:
fit(model, model_data, 1, opt, F.nll_loss)

epoch      trn_loss   val_loss                              
    0      1.696746   1.697028  



[1.6970276]

One issue with the model above is that we are adding the input and the hidden activations (h) so far. The input state represents the encoding of the characters, but h is the encoding of the series of characters so far. Since these are different types of info, we want to concatenate them, not add them.

In [99]:
class CharLoopConcatModel(nn.Module):
    # This is an RNN!
    def __init__(self, vocab_size, num_of_factors):
        super().__init__()
        self.embedding_layer = nn.Embedding(vocab_size, num_of_factors)
        self.input_layer = nn.Linear(num_of_factors+num_of_hidden_activations, num_of_hidden_activations)
        self.hidden_layer = nn.Linear(num_of_hidden_activations, num_of_hidden_activations)
        self.output_layer = nn.Linear(num_of_hidden_activations, vocab_size)
        
    def forward(self, *chars):
        bs = chars[0].size(0)
        h = V(torch.zeros(bs, num_of_hidden_activations).cuda())
        for char in chars:
            # Loop around our input layer and hidden layers
            inp = torch.cat((h, self.embedding_layer(char)), 1)
            inp = F.relu(self.input_layer(inp))
            h = F.tanh(self.hidden_layer(inp))
        
        return F.log_softmax(self.output_layer(h), dim=-1)

In [100]:
model = CharLoopConcatModel(vocab_size, num_of_factors).cuda()
opt = optim.Adam(model.parameters(), 1e-3)

In [101]:
it = iter(model_data.trn_dl)
*xs,yt = next(it)
t = model(*V(xs))

In [103]:
fit(model, model_data, 1, opt, F.nll_loss)

epoch      trn_loss   val_loss                              
    0      1.843979   1.822812  



[1.8228117]

In [104]:
set_lrs(opt, 1e-4)

In [107]:
fit(model, model_data, 1, opt, F.nll_loss)

epoch      trn_loss   val_loss                              
    0      1.737225   1.738723  



[1.7387227]

#### Testing

In [None]:
def get_next(inp):
    idxs = T(np.array([char_indices[c] for c in inp]))
    p = m(*VV(idxs))
    i = np.argmax(to_np(p))
    return chars[i]

In [106]:
get_next('for thos')

'e'

In [108]:
get_next('part of ')

't'

## PyTorch Implementation
The PyTorch implementation is largely the same.

In [120]:
class CharRnn(nn.Module):
    def __init__(self, vocab_size, num_of_factors):
        super().__init__()
        self.embedding_layer = nn.Embedding(vocab_size, num_of_factors)
        self.rnn = nn.RNN(num_of_factors, num_of_hidden_activations)
        self.output_layer = nn.Linear(num_of_hidden_activations, vocab_size)
        
    def forward(self, *chars):
        batch_size = chars[0].size(0)
        hidden_activations = V(torch.zeros(1, batch_size, num_of_hidden_activations))
        inp = self.embedding_layer(torch.stack(chars))
        
        # Basically the difference is it just runs the loop for us
        outp,hidden_activations = self.rnn(inp, hidden_activations)
        
        # In PyTorch they pass back all of the hidden layer outputs
        # We just want the last one, so we get it with [-1]
        return F.log_softmax(self.output_layer(outp[-1]), dim=-1)

In [121]:
model = CharRnn(vocab_size, num_of_factors).cuda()
opt = optim.Adam(model.parameters(), 1e-3)

In [122]:
it = iter(model_data.trn_dl)
*xs,yt = next(it)

In [123]:
t = model.embedding_layer(V(torch.stack(xs)))
t.size()

torch.Size([8, 512, 42])

In [124]:
ht = V(torch.zeros(1, 512,num_of_hidden_activations))
outp, hn = model.rnn(t, ht)
outp.size(), hn.size()

(torch.Size([8, 512, 256]), torch.Size([1, 512, 256]))

In [125]:
t = model(*V(xs))

t.size()

torch.Size([512, 85])

In [127]:
fit(model, model_data, 4, opt, F.nll_loss)

epoch      trn_loss   val_loss                              
    0      1.880866   1.852416  
    1      1.686484   1.678685                              
    2      1.597526   1.598038                              
    3      1.527368   1.550226                              



[1.5502259]

In [128]:
set_lrs(opt, 1e-4)

In [129]:
fit(model, model_data, 2, opt, F.nll_loss)

epoch      trn_loss   val_loss                              
    0      1.477367   1.51327   
    1      1.480211   1.508043                              



[1.5080433]

#### Testing

In [130]:
def get_next_n(inp, n):
    res = inp
    for i in range(n):
        c = get_next(inp)
        res += c
        inp = inp[1:]+c
    return res

In [131]:
get_next_n('for thos', 40)

'for those in the same to the same to the same to'