# Recurrent Neural Networks (RNN)



## 1. Background

Sequences are extremely important for understanding what's happening around us. As you read this text you make sense of it by combining previous words with new ones into a meaninful sequence of words. Similarly for many processes, it's incredibly important to use data from previous 'states' in order to predict future ones. To use anther example, when we listen to someone, our brain is constantly trying to complete the sentence, and we often do. 

Recurrent Neural Networks are a class of argorithms that allow us to train machines to do just that!

The main innovation of RNNs is that they allow information to persist, or in other words, allow for memory units to pass information from one step to the next. You can think of it as information being passed over many time steps, or simply as a loop repeating over time with information being updated. 

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png" height="350" width="650"/>


Long-Short-Term-Memory units, or LSTMs for short, have been extremely successful in solving many problems in speech recognition, text generation, captioning, etc.

In this tutorial we build a simple RNN model using LSTMs that predicts future text characters by training on J.M. Barrie's Peter Pan novel. This example is essentially an abridged version from the problem set in the Udacity deep learning nanodegree.

For a more in depth explanation I recommend you watch this [video](https://www.youtube.com/watch?v=56TYLaQN4N8&list=PLE6Wd9FR--EfW8dtjAuPoTuPcqmOV53Fu&index=14) by Nando de Freitas, and read this excellent [article](http://colah.github.io/posts/2015-08-Understanding-LSTMs/).

## 2.1 Loading and Cleaning the Dataset

In [70]:
import urllib.request
response = urllib.request.urlopen('http://www.gutenberg.org/files/16/16-0.txt')
data = response.read()
 
# Write data to file
filename = "peterpan.txt"
file_ = open(filename, 'w')
file_.write(str(data))
file_.close()

In [42]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# The text for training can be obtained at http://www.gutenberg.org/ebooks/16

# read in the text, transforming everything to lower case
text = open('peterpan.txt').read().lower()
print('our original text has ' + str(len(text)) + ' characters')


### find and replace '\n' and '\r' symbols - replacing them 
text = text[1302:]
text = text.replace('\n',' ')    # replacing '\n' with '' simply removes the sequence
text = text.replace('\r',' ')
### print out the first 1000 characters of the raw text to get a sense of what we need to throw out
text[:1000]

our original text has 278194 characters


'er 15 “hook or me this time”  chapter 16 the return home  chapter 17 when wendy grew up     chapter 1 peter breaks through  all children, except one, grow up. they soon know that they will grow up, and the way wendy knew was this. one day when she was two years old she was playing in a garden, and she plucked another flower and ran with it to her mother. i suppose she must have looked rather delightful, for mrs. darling put her hand to her heart and cried, “oh, why can’t you remain like this for ever!” this was all that passed between them on the subject, but henceforth wendy knew that she must grow up. you always know after you are two. two is the beginning of the end.  of course they lived at 14 [their house number on their street], and until wendy came her mother was the chief one. she was a lovely lady, with a romantic mind and such a sweet mocking mouth. her romantic mind was like the tiny boxes, one within the other, that come from the puzzling east, however many you discover th

In [43]:
### TODO: list all unique characters in the text and remove any non-english ones
import string

allowed_chars = string.ascii_lowercase + ' ' + '!' + ',' + '.' + ':' + ';' + '?'

# remove as many non-english characters and character sequences as you can 
for char in text:
    if char not in allowed_chars:
        text = text.replace(char, ' ')

# shorten any extra dead space created above
text = text.replace('  ',' ')

In [44]:
### print out the first 2000 characters of the raw text to get a sense of what we need to throw out
text[:2000]

'er   hook or me this time  chapter  the return home chapter  when wendy grew up   chapter  peter breaks through all children, except one, grow up. they soon know that they will grow up, and the way wendy knew was this. one day when she was two years old she was playing in a garden, and she plucked another flower and ran with it to her mother. i suppose she must have looked rather delightful, for mrs. darling put her hand to her heart and cried, oh, why can t you remain like this for ever! this was all that passed between them on the subject, but henceforth wendy knew that she must grow up. you always know after you are two. two is the beginning of the end. of course they lived at   their house number on their street , and until wendy came her mother was the chief one. she was a lovely lady, with a romantic mind and such a sweet mocking mouth. her romantic mind was like the tiny boxes, one within the other, that come from the puzzling east, however many you discover there is always one

In [45]:
# count the number of unique characters in the text
chars = sorted(list(set(text)))

# print some of the text, as well as statistics
print ("this corpus has " +  str(len(text)) + " total number of characters")
print ("this corpus has " +  str(len(chars)) + " unique characters")

this corpus has 272588 total number of characters
this corpus has 33 unique characters


## 2.2 Cutting data into input/output pairs

In [46]:
### TODO: fill out the function below that transforms the input text and window-size into a set of input/output pairs for use with our RNN model
def window_transform_text(text,window_size,step_size):
    # containers for input/output pairs
    inputs = []
    outputs = []
    ctr = 0
    
    # Goes from window_size until the end, and pick previous characters
    for i in range(window_size, len(text), step_size):
        inputs.append(text[ctr:i])
        outputs.append(text[i])
        ctr = ctr + step_size
    
    return inputs,outputs

In [47]:
# run your text window-ing function 
window_size = 100
step_size = 5
inputs, outputs = window_transform_text(text,window_size,step_size)

In [48]:
# print out a few of the input/output pairs to verify that we've made the right kind of stuff to learn from
print('input = ' + inputs[2])
print('output = ' + outputs[2])
print('--------------')
print('input = ' + inputs[100])
print('output = ' + outputs[100])

input = or me this time  chapter  the return home chapter  when wendy grew up   chapter  peter breaks throug
output = h
--------------
input = as all that passed between them on the subject, but henceforth wendy knew that she must grow up. you
output =  


In [49]:

# print out the number of unique characters in the dataset
chars = sorted(list(set(text)))
print ("this corpus has " +  str(len(chars)) + " unique characters")
print ('and these characters are ')
print (chars)

this corpus has 33 unique characters
and these characters are 
[' ', '!', ',', '.', ':', ';', '?', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


## 2.3 One-hot encoding characters

The easiest way to think of this problem is as a classification problem! Essentially all we're doing is predicting what the next character will be, and we know that there are 33 unique forms this character ti can take. So, if we think of characters as classes then we just need to predict what class will the next character belong to!

Since the number of classes is relatively small, only 33, we can simply use one-hot encoding. However, note that for larger models with many more classes this becomes inefficient and using an embedding layer would be better.

In [50]:
# this dictionary is a function mapping each unique character to a unique integer
chars_to_indices = dict((c, i) for i, c in enumerate(chars))  # map each unique character to unique integer

# this dictionary is a function mapping each unique integer back to a unique character
indices_to_chars = dict((i, c) for i, c in enumerate(chars))  # map each unique integer back to unique character

Now we can transform our input/output pairs - consisting of characters - to equivalent input/output pairs made up of one-hot encoded vectors. In the next cell we provide a function for doing just this: it takes in the raw character input/outputs and returns their numerical versions. In particular the numerical input is given as $\bf{X}$, and numerical output is given as the $\bf{y}$

In [51]:

# transform character-based input/output into equivalent numerical versions
def encode_io_pairs(text,window_size,step_size):
    # number of unique chars
    chars = sorted(list(set(text)))
    num_chars = len(chars)
    
    # cut up text into character input/output pairs
    inputs, outputs = window_transform_text(text,window_size,step_size)
    
    # create empty vessels for one-hot encoded input/output
    X = np.zeros((len(inputs), window_size, num_chars), dtype=np.bool)
    y = np.zeros((len(inputs), num_chars), dtype=np.bool)
    
    # loop over inputs/outputs and tranform and store in X/y
    for i, sentence in enumerate(inputs):
        for t, char in enumerate(sentence):
            X[i, t, chars_to_indices[char]] = 1
        y[i, chars_to_indices[outputs[i]]] = 1
        
    return X,y

Now run the one-hot encoding function by activating the cell below and transform our input/output pairs!


In [52]:
# use your function
window_size = 100
step_size = 5
X,y = encode_io_pairs(text,window_size,step_size)

## 3.1 Setting up the RNN

With our dataset loaded and the input/output pairs extracted / transformed we can now begin setting up our RNN for training. Again we will use Keras to quickly build a single hidden layer RNN - where our hidden layer consists of LTSM modules.

Time to get to work: build a 3 layer RNN model of the following specification
<ul><li>layer 1 should be an LSTM module with 200 hidden units --> note this should have input_shape = (window_size,len(chars)) where len(chars) = number of unique characters in your cleaned text</li>
<li>layer 2 should be a linear module, fully connected, with len(chars) hidden units --> where len(chars) = number of unique characters in your cleaned text</li>
<li>layer 3 should be a softmax activation ( since we are solving a multiclass classification)
Use the categorical_crossentropy loss</li></ul>

This network can be constructed using just a few lines - as with the RNN network you made in part 1 of this notebook. See e.g., the general Keras documentation and the LTSM documentation in particular for examples of how to quickly use Keras to build neural network models.

In [55]:
### necessary functions from the keras library
from keras.models import Sequential
from keras.layers import Dense, Activation, LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import keras
import random

# TODO build the required RNN model: a single LSTM hidden layer with softmax activation, categorical_crossentropy loss 
model = Sequential()
model.add(LSTM(200, input_shape=(window_size, 33)))
model.add(Dense(33, activation='softmax'))

# initialize optimizer
optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)

# compile model --> make sure initialized optimizer and callbacks - as defined above - are used
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

Using TensorFlow backend.


## 3.2 Training our RNN model for text generation

With our RNN setup we can now train it! Lets begin by trying it out on a small subset of the larger version. In the next cell we take the first 10,000 input/output pairs from our training database to learn on.

In [56]:
# a small subset of our input/output pairs
Xsmall = X[:10000,:,:]
ysmall = y[:10000,:]

# train the model
model.fit(Xsmall, ysmall, batch_size=500, epochs=40,verbose = 1)

# save weights
model.save_weights('model_weights/best_RNN_small_textdata_weights.hdf5')

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


OSError: Unable to create file (Unable to open file: name = 'model_weights/best_rnn_small_textdata_weights.hdf5', errno = 2, error message = 'no such file or directory', flags = 13, o_flags = 602)

## 3.3 Predicting text from trained model


In [58]:
# function that uses trained model to predict a desired number of future characters
def predict_next_chars(model,input_chars,num_to_predict):     
    # create output
    predicted_chars = ''
    for i in range(num_to_predict):
        # convert this round's predicted characters to numerical input    
        x_test = np.zeros((1, window_size, len(chars)))
        for t, char in enumerate(input_chars):
            x_test[0, t, chars_to_indices[char]] = 1.

        # make this round's prediction
        test_predict = model.predict(x_test,verbose = 0)[0]

        # translate numerical prediction back to characters
        r = np.argmax(test_predict)                           # predict class of each test input
        d = indices_to_chars[r] 

        # update predicted_chars and input
        predicted_chars+=d
        input_chars+=d
        input_chars = input_chars[1:]
    return predicted_chars

With your trained model try a few subsets of the complete text as input - note the length of each must be exactly equal to the window size. For each subset us the function above to predict the next 100 characters that follow each input.

In [59]:
# TODO: choose an input sequence and use the prediction function in the previous Python cell to predict 100 characters following it
# get an appropriately sized chunk of characters from the text
start_inds = [0, 500, 1000]

# load in weights
model.load_weights('model_weights/best_RNN_small_textdata_weights.hdf5')
for s in start_inds:
    start_index = s
    input_chars = text[start_index: start_index + window_size]

    # use the prediction function
    predict_input = predict_next_chars(model,input_chars,num_to_predict = 100)

    # print out input characters
    print('------------------')
    input_line = 'input chars = ' + '\n' +  input_chars + '"' + '\n'
    print(input_line)

    # print out predicted characters
    line = 'predicted chars = ' + '\n' +  predict_input + '"' + '\n'
    print(line)

------------------
input chars = 
er   hook or me this time  chapter  the return home chapter  when wendy grew up   chapter  peter bre"

predicted chars = 
thendy wond she cas in the but wendy and sto sas and the was the bege to the was she bed she has wen"

------------------
input chars = 
as all that passed between them on the subject, but henceforth wendy knew that she must grow up. you"

predicted chars = 
 said s on the was the bed he was on the mase to the pather wind she cas in the but wendy and sto th"

------------------
input chars = 
more; and her sweet mocking mouth had one kiss on it that wendy could never get, though there it was"

predicted chars = 
 the had in wer the had se the ther hed be the shed he was wendy and soo no he the he he her ind s a"




This looks ok, but not great. Now lets try the same experiment with a larger chunk of the data - with the first 100,000 input/output pairs.
Tuning RNNs for a typical character dataset like the one we will use here is a computationally intensive endeavour and thus timely on a typical CPU. Using a reasonably sized cloud-based GPU can speed up training by a factor of 10. Also because of the long training time it is highly recommended that you carefully write the output of each step of your process to file. This is so that all of your results are saved even if you close the web browser you're working out of, as the processes will continue processing in the background but variables/output in the notebook system will not update when you open it again.
In the next cell we show you how to create a text file in Python and record data to it. This sort of setup can be used to record your final predictions.

In [60]:
### A simple way to write output to file
f = open('my_test_output.txt', 'w')              # create an output file to write too
f.write('this is only a test ' + '\n')           # print some output text
x = 2
f.write('the value of x is ' + str(x) + '\n')    # record a variable value
f.close()     

# print out the contents of my_test_output.txt
f = open('my_test_output.txt', 'r')              # create an output file to write too
f.read()

'this is only a test \nthe value of x is 2\n'

With this recording devices we can now more safely perform experiments on larger portions of the text. In the next cell we will use the first 100,000 input/output pairs to train our RNN model.
First we fit our model to the dataset, then generate text using the trained model in precisely the same generation method applied before on the small dataset.
Note: your generated words should be - by and large - more realistic than with the small dataset, but you won't be able to generate perfect English sentences even with this amount of data. A rule of thumb: your model is working well if you generate sentences that largely contain real English words.

<b>NOTE: If you're running this on your CPU, this following piece of code may take significant resouces and could take several hours to complete</b>.

In [64]:
# a small subset of our input/output pairs
Xlarge = X[:100000,:,:]
ylarge = y[:100000,:]

# TODO: fit to our larger dataset
model.fit(Xlarge, ylarge, batch_size=500, nb_epoch=100,verbose = 1)

# save weights
model.save_weights('model_weights/best_RNN_large_textdata_weights.hdf5')



Epoch 1/100
 1500/54498 [..............................] - ETA: 271s - loss: 0.1526

KeyboardInterrupt: 

In [66]:
# TODO: choose an input sequence and use the prediction function in the previous Python cell to predict 100 characters following it
# get an appropriately sized chunk of characters from the text
start_inds = [0, 500, 1000]

# save output
f = open('text_gen_output/RNN_large_textdata_output.txt', 'w')  # create an output file to write too

# load weights
model.load_weights('model_weights/best_RNN_large_textdata_weights.hdf5')
for s in start_inds:
    start_index = s
    input_chars = text[start_index: start_index + window_size]

    # use the prediction function
    predict_input = predict_next_chars(model,input_chars,num_to_predict = 100)

    # print out input characters
    line = '-------------------' + '\n'
    print(line)
    f.write(line)

    input_line = 'input chars = ' + '\n' +  input_chars + '"' + '\n'
    print(input_line)
    f.write(input_line)

    # print out predicted characters
    predict_line = 'predicted chars = ' + '\n' +  predict_input + '"' + '\n'
    print(predict_line)
    f.write(predict_line)
f.close()

-------------------

input chars = 
er   hook or me this time  chapter  the return home chapter  when wendy grew up   chapter  peter bre"

predicted chars = 
atemed.  my. ladeng, and ther ammithers lagneng to to had heard at ther, haver! he soid hat llagt in"

-------------------

input chars = 
as all that passed between them on the subject, but henceforth wendy knew that she must grow up. you"

predicted chars = 
 mectioning fuld herro.  do go fully fill year ans.  foh me noungerll for peter. it was not to lit t"

-------------------

input chars = 
more; and her sweet mocking mouth had one kiss on it that wendy could never get, though there it was"

predicted chars = 
, and striss this she was it to he hat for mratess gate. he said, then at in had another plopped att"

