# Artificial Intelligence Nanodegree
## Recurrent Neural Network Project
## Project: Train a character level sequence generator

Welcome to the Recurrent Neural Network Project in the Artificial Intelligence Nanodegree! In this notebook, some template code has already been provided for you, and you will need to implement additional functionality to successfully complete this project. You will not need to modify the included code beyond what is requested. Sections that begin with **'Implementation'** in the header indicate that the following block of code will require additional functionality which you must provide. Instructions will be provided for each section and the specifics of the implementation are marked in the code block with a 'TODO' statement. Please be sure to read the instructions carefully!

In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a **'Question X'** header. Carefully read each question and provide thorough answers in the following text boxes that begin with **'Answer:'**. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.  

>**Note:** Code and Markdown cells can be executed using the **Shift + Enter** keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.

## Getting started

In this project you will implement a popuular Recurrent Neural Network (RNN) architecture to create an English language sequence generator capable of building semi coherent english sentences from scratch by building them up character-by-character.  This will require a substantial amount amount of parameter tuning on a large trainnig corpus (at least 100,000 characters long).  In particular for this project we will be using a complete version of Sir Arthur Conan Doyl's classic book The Adventures of Sherlock Holmes.

The particular network architecture we will employ is known as  [Long Term Short Memory (LTSM)](https://en.wikipedia.org/wiki/Long_short-term_memory), which helps significantly avoid technical problems with optimization of RNNs.  

**Important note:** Tuning RNNs is a computationally intensive endevour and thus timely on a typical CPU.  Using a reasonable sized cloud-based GPU can speed up training by a factor of 10.  Also because of the long training time it is highly recommended that you carefully write the output of each step of your process to file.  This is so that all of your results are saved even if you close close the web browser you're working out of, as the processes will continue processing in the background but variables/output in the notebook system will not update when you open it again.

In [1]:
### A simple way to write output to file
x = 2   
f = open('my_test_output.txt', 'w')              # create an output file to write too
f.write('this is only a test ' + '\n')           # print some output text
f.write('the value of x is ' + str(x) + '\n')    # record a variable value
f.close()                                        # close the file when everything is recorded

In [2]:
from __future__ import print_function
import numpy as np
import sys
f = open('RNN_seq_gen_output.txt', 'w')              # create an output file to write too

# 1. Downloading and preprocessing a text dataset

Our first task is to grab a large text corpus for use in training, and on it we perform a several light of pre-processing tasks.  The default corpus we will use is the classic book Sherlock Holmes, but you can use a variety of others as well - so long as they are fairly large (around 100,000 characters or more).  

The first part of pre-processing we need to do is to - after readin gin the text - convert everything to lower-case.  This is done in the next cell.

In [3]:
# read in the text, transforming everything to lower case
text = open('holmes.txt').read().lower()
text = text[:100000]

Next, lets examine a bit of the raw text.  Because we are interested in creating sentences of English words automatically by building up each word character-by-character, we only want to train on valid English words.  In other words - we need to remove all of the other junk characters that aren't words!

In [4]:
### print out the first 1000 characters of the raw text to get a sense of what we need to throw out
text[:1000]

"\xef\xbb\xbfproject gutenberg's the adventures of sherlock holmes, by arthur conan doyle\r\n\r\nthis ebook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  you may copy it, give it away or\r\nre-use it under the terms of the project gutenberg license included\r\nwith this ebook or online at www.gutenberg.net\r\n\r\n\r\ntitle: the adventures of sherlock holmes\r\n\r\nauthor: arthur conan doyle\r\n\r\nposting date: april 18, 2011 [ebook #1661]\r\nfirst posted: november 29, 2002\r\n\r\nlanguage: english\r\n\r\n\r\n*** start of this project gutenberg ebook the adventures of sherlock holmes ***\r\n\r\n\r\n\r\n\r\nproduced by an anonymous project gutenberg volunteer and jose menendez\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nthe adventures of sherlock holmes\r\n\r\nby\r\n\r\nsir arthur conan doyle\r\n\r\n\r\n\r\n   i. a scandal in bohemia\r\n  ii. the red-headed league\r\n iii. a case of identity\r\n  iv. the boscombe valley mystery\r\n   v. the five

Wow - there's a lot of junk here!  e.g., all the carriage return and newline sequences '\n' and '\r'' sequences.  Lets remove these from the text.

In [5]:
### find and replace '\n' and '\r' symbols - replacing them 
text = text.replace('\n','')    # replacing '\n' with '' simply removes the sequence
text = text.replace('\r','')

Lets see how the first 1000 characters of our text looks now!

In [6]:
### print out the first 1000 characters of the raw text to get a sense of what we need to throw out
text[:1000]

"\xef\xbb\xbfproject gutenberg's the adventures of sherlock holmes, by arthur conan doylethis ebook is for the use of anyone anywhere at no cost and withalmost no restrictions whatsoever.  you may copy it, give it away orre-use it under the terms of the project gutenberg license includedwith this ebook or online at www.gutenberg.nettitle: the adventures of sherlock holmesauthor: arthur conan doyleposting date: april 18, 2011 [ebook #1661]first posted: november 29, 2002language: english*** start of this project gutenberg ebook the adventures of sherlock holmes ***produced by an anonymous project gutenberg volunteer and jose menendezthe adventures of sherlock holmesbysir arthur conan doyle   i. a scandal in bohemia  ii. the red-headed league iii. a case of identity  iv. the boscombe valley mystery   v. the five orange pips  vi. the man with the twisted lip vii. the adventure of the blue carbuncleviii. the adventure of the speckled band  ix. the adventure of the engineer's thumb   x. the 

Still looks like there's a few more non-English characters / character sequences to pull out.  Try to remove as many of these as you possibly can in the next Python cell.  You might print out more characters to get a better sense of things that need to be removed.

Try to remove as many bad characters / sequences as you can see in the first characters of the text, but don't worry if you don't remove every last bad character / string.   The bulk of this text is made up of valid English words and our RNN will learn to produce real sentences from it.

In [7]:
### TODO: remove as many non-English characters and character sequences as you can 
# some of the non-english I see
non_english = ['\xef','\xbb','\xbf','*','#','[',']','i.','ii.','iii.','iv.','v.','vi.','v.','vi.','vii.','viii.','ix.','x.','0','1','2','3','4','5','6','7','8','9']
for i in non_english:
    text = text.replace(i,'')
text = text.replace('  ',' ')

In [8]:
### print out the first 1000 characters of the raw text to get a sense of what we need to throw out
text[:1000]

"project gutenberg's the adventures of sherlock holmes, by arthur conan doylethis ebook is for the use of anyone anywhere at no cost and withalmost no restrictions whatsoever. you may copy it, give it away orre-use it under the terms of the project gutenberg license includedwith this ebook or online at www.gutenberg.nettitle: the adventures of sherlock holmesauthor: arthur conan doyleposting date: april , ebook first posted: november , language: english start of this project gutenberg ebook the adventures of sherlock holmes produced by an anonymous project gutenberg volunteer and jose menendezthe adventures of sherlock holmesbysir arthur conan doyle  a scandal in bohemia i the red-headed league ii a case of identity  the boscombe valley mystery  the five orange pips v the man with the twisted lip vi the adventure of the blue carbunclevii the adventure of the speckled band  the adventure of the engineer's thumb  the adventure of the noble bachelor x the adventure of the beryl coronet xi

Now that we have thrown out a good number of non-English characters/character sequences lets print out some statistics about the dataset - including number of total characters and number of unique characters.

In [9]:
# count the number of unique characters in the text
chars = sorted(list(set(text)))

# print some of the text, as well as statistics
print ("this corpus has " +  str(len(text)) + " total number of characters")
print ("this corpus has " +  str(len(chars)) + " unique characters")

this corpus has 95161 total number of characters
this corpus has 41 unique characters


The last step:  convert our characters via a look up table into numerical values.  We can't just throw characters into any machine learning algorithm - they only ingest numerical values.  So we need to create a function that transforms each of our input characters into distinct numerical values - like integers.  To do this we make a simple dictionary mapping each unique character to a unique integer.  To re-translate the output of our RNN - which will be a sequence of integers - into our unique set of characters we also create the inverse function dictionary mapping integers back to our unique characters.

In [10]:
### generate function mapping each unique character to a unique integer, as well as its inverse
char_indices = dict((c, i) for i, c in enumerate(chars))  # map each unique character to unique integer
indices_char = dict((i, c) for i, c in enumerate(chars))  # map each unique integer back to unique character

## Cutting our text into sequences

Now we need to cut up the text into equal length sequences.  However it can certainly be the case that a word at the start or end of a sequence might get cut off, so in order to not lose this information we cut up the text in a simiilar manner to how images / audio are cut for classification - via *windowing*.  Imagine the entire text as one long string.  We slide a window of fixed length along the string from left to right - taking a step of a certain number of characters each time - and take a snapshot of whats in the window at each moment.

In [11]:
### cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

print('Vectorization...')
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1
    
# print out what a few of the first segments look like
print (sentences[0])
print (sentences[1])

nb sequences: 31707
Vectorization...
project gutenberg's the adventures of sh
ject gutenberg's the adventures of sherl


# 2. Setting up our RNN

With our dataset loaded in and pre-processed we can now begin setting up our RNN.  We use Keras to quickly build a single hidden layer RNN - where our hidden layer consists of LTSM modules.

In [12]:
### necessary functions from the keras library
from keras.models import Sequential
from keras.layers import Dense, Activation, LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import random

Using Theano backend.


Now its your turn to build a simple single-hidden layer RNN with LTSM hidden units, a softmax activation, and categorical_crossentropy loss function.  This can be constructed using just a few lines - see e.g., the [general Keras documentation](https://keras.io/getting-started/sequential-model-guide/) and the [LTSM documentation in particular](https://keras.io/layers/recurrent/) for examples of how to quickly use Keras to build neural network models.

In [13]:
### TODO build the required RNN model: a single LSTM hidden layer with softmax activation, categorical_crossentropy loss 
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

With our RNN build we can now train our model on the input text data.

In [None]:
# sampling function for RNN-based predictions
def sample(preds):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) 
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [None]:
f = open('RNN_output.txt', 'w')              # create an output file to write too

# train the model, output generated text after each iteration
for iteration in range(1, 50):
    # print update to console
    print()
    print('-' * 40)
    line = 'Iteration ' + str(iteration) + '\n'
    print(line)
    
    # record iteration count
    f.write('-' * 40 + '\n')
    f.write(line)         
    
    # fit model to current batch
    model.fit(X, y, batch_size=128, nb_epoch=1)
    start_index = random.randint(0, len(text) - maxlen - 1)

    generated = ''
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    
    # print update to console and record
    line = 'GENERATING WITHI SEED: "' + sentence + '"' + '\n'
    print(line)
    f.write(line)
    
    # print generated sentece and record
    print(generated + '\n')
    f.write(generated + '\n')

    # print predicted words
    for i in range(400):
        x = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x[0, t, char_indices[char]] = 1.

        preds = model.predict(x, verbose=0)[0]
        next_index = sample(preds)
        next_char = indices_char[next_index]

        generated += next_char
        sentence = sentence[1:] + next_char

    # print out next character to command line
    print(generated)
    print('\n')

    # record next character
    f.write(generated)
    f.write('\n')
    f.write('\n')