# Artificial Intelligence Nanodegree
## Recurrent Neural Network Project
## Project: Train a character level sequence generator

Welcome to the Recurrent Neural Network Project in the Artificial Intelligence Nanodegree! In this notebook, some template code has already been provided for you, and you will need to implement additional functionality to successfully complete this project. You will not need to modify the included code beyond what is requested. Sections that begin with **'Implementation'** in the header indicate that the following block of code will require additional functionality which you must provide. Instructions will be provided for each section and the specifics of the implementation are marked in the code block with a 'TODO' statement. Please be sure to read the instructions carefully!

In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a **'Question X'** header. Carefully read each question and provide thorough answers in the following text boxes that begin with **'Answer:'**. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.  

>**Note:** Code and Markdown cells can be executed using the **Shift + Enter** keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.

## Getting started

In this project you will implement a popuular Recurrent Neural Network (RNN) architecture to create an English language sequence generator capable of building semi coherent english sentences from scratch by building them up character-by-character.  This will require a substantial amount amount of parameter tuning on a large trainnig corpus (at least 100,000 characters long).  In particular for this project we will be using a complete version of Sir Arthur Conan Doyl's classic book The Adventures of Sherlock Holmes.

The particular network architecture we will employ is known as  [Long Term Short Memory (LTSM)](https://en.wikipedia.org/wiki/Long_short-term_memory), which helps significantly avoid technical problems with optimization of RNNs.  

**Note:** Tuning RNNs is a computationally intensive endevour and thus timely on a typical CPU.  Using a reasonable sized Amazon GPU can speed up training by a factor of 10. 

In [None]:
from __future__ import print_function
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys

## Downloading and preprocessing a text dataset

Our first task is to grab a large text corpus for use in training, and on it we perform a several light of pre-processing tasks.  The default corpus we will use is Sherlock Holmes, but you can use a variety of others as well.  For this project to work successfully though, you should use a rather large corpus (at least 100,000 characters).

In [None]:
# grab text-based file from the web using keras's get_file command
text = open('pg1661.txt').read().lower()

# read in the text, transforming everything to lower case
text = open(path).read().lower()

With the dataset downloaded, lets look at a sample text.  Say the first 1000 characters - this will give us a sense of whether or not any further pre-processing is required.

In [None]:
# print out the first 1000 characters of our training corpus
text[0:1000]

Wow!  It looks like there are a good number of extra characters (e.g., tags indicating new line) that we should remove from the text.  We want our RNN to learn the general pattern of English words, not tags!

So, for example, we can remove new line tags "\n" via the line below.

In [None]:
# remove new line tags from the text
text = text.replace('\n', ' ').replace('\r', '')

In [None]:
# print out the first 1000 characters of the text with unwanted characters removed
text[0:1000]

Alright - it looks like the new line tags have been removed.  What other unwanted characters do you think we could safely remove in order for the text to be more completely consisting of English words?

In [None]:
# TODO: remove other unwanted characters from the corpus
# remove strange characters
text = text.replace('*', ' ').replace('\r', '')
text = text.replace('#', ' ').replace('\r', '')

# remove some non-english words
text = text.replace('\xef\xbb\xbfproject', ' ').replace('\r', '')

# remove integers
for i in range(0,10):
    text = text.replace(str(i), ' ').replace('\r', '')

# remove all blank space thats too big
text = text.replace('     ', ' ').replace('\r', '')
text = text.replace('   ', ' ').replace('\r', '')
text = text.replace('  ', ' ').replace('\r', '')

In [None]:
# print out the first 1000 characters of the text with unwanted characters removed
text[0:1000]

Next, lets print out some basic stats on the corpus.  In the cell below determine and print out the total number of characters in the pre-processed corpus, as well as the total number of unique characters.

In [None]:
# TODO - print out the total number of characters and unique characters in the pre-processed corpus
# count the number of unique characters in the text
unique_chars = sorted(list(set(text)))

# print some of the text, as well as statistics
print ("this corpus has total length = " +  str(len(text)))
print ("this corpus has = " +  str(len(unique_chars)) + " number of unique characters")

Up next, no machine learning model can take in raw string values so we need to convert our characters into integers.  We create two dictionaries that allow us to quickly reference a character's integer representation and vice-versa.

In [None]:
# create dictionaries containing each character
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

## Cutting our text into sequences

Now we need to cut up the text into equal length sequences.  However it can certainly be the case that a word at the start or end of a sequence might get cut off, so in order to not lose this information we cut up the text in a simiilar manner to how images / audio are cut for classification - via *windowing*.  Imagine the entire text as one long string.  We slide a window of fixed length along the string from left to right - taking a step of a certain number of characters each time - and take a snapshot of whats in the window at each moment.

In [None]:
# TODO: cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []

# make index range for each sequence
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('number of sequences = ' +  len(sentences))

# cut text into sequences
print('starting the process of cutting text into sequences...')
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1
print('finished!')    

# print out what a few of the first segments look like
print (sentences[4])
print (sentences[11])

## Building an RNN model for text generation

Next, we use Keras to quickly build a single hidden layer RNN - where our hidden layer consists of LTSM modules.

In [None]:
# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

With our RNN build we can now train our model on the input text data.

In [None]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

# train the model, output generated text after each iteration
for iteration in range(1, 60):
    print()
    print('-' * 50)
    print('Iteration', iteration)
    model.fit(X, y, batch_size=128, nb_epoch=1)

    start_index = random.randint(0, len(text) - maxlen - 1)

    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print()
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x[0, t, char_indices[char]] = 1.

            preds = model.predict(x, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()