# Artificial Intelligence Nanodegree
## Recurrent Neural Network Projects

Welcome to the Recurrent Neural Network Project in the Artificial Intelligence Nanodegree! In this notebook, some template code has already been provided for you, and you will need to implement additional functionality to successfully complete this project. You will not need to modify the included code beyond what is requested. Sections that begin with **'Implementation'** in the header indicate that the following block of code will require additional functionality which you must provide. Instructions will be provided for each section and the specifics of the implementation are marked in the code block with a 'TODO' statement. Please be sure to read the instructions carefully!

In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a **'Question X'** header. Carefully read each question and provide thorough answers in the following text boxes that begin with **'Answer:'**. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.  

>**Note:** Code and Markdown cells can be executed using the **Shift + Enter** keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.

# Project 1: Perform time series prediction 

In this project you will perform time series prediction using a Recurrent Neural Network regressor.  In particular you will re-create the figure shown in the notes - where the stock price of Apple was forecasted (or predicted) 7 days in advance.  In completing this exercise you will learn how to construct RNNs using Keras, which will also aid in completing the second project in this notebook.

The particular network architecture we will employ for our RNN is known as  [Long Term Short Memory (LTSM)](https://en.wikipedia.org/wiki/Long_short-term_memory), which helps significantly avoid technical problems with optimization of RNNs.  

## 1.1 Getting started

First we must load in our time series - a history of around 140 days of Apple's stock price.  Then we need to perform a number of pre-processing steps to prepare it for use with an RNN model.  First off, it is good practice to normalize time series - by normalizing its range.  This helps us avoid serious numerical issues associated how common activation functions (like tanh) transform very large (positive or negative) numbers, as well as helping us to avoid related issues when computing derivatives.

Here we normalize the series to lie in the range [0,1], but it is also commonplace to normalize by a series standard deviation.

In [1]:
### Load in necessary libraries for data input and normalization
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

### load in and normalize the dataset
dataset = np.loadtxt('apple_prices.csv')
scaler = MinMaxScaler(feature_range=(0, 1)) 
dataset = scaler.fit_transform(dataset.reshape(-1,1))

Lets take a quick look at the (normalized) time series we'll be performing predictions on.

In [None]:
# lets take a look at our time series
plt.plot(dataset)

## 1.2  Cutting our time series into sequences

Remember, our time series is a sequence of numbers that we can represent in general mathematically as 

$$y_{0},y_{1},y_{2},...,y_{P}$$

where $y_{p}$ is the numerical value of the time series at time period $p$ and where $P$ is the total length of the series.  In order to apply our RNN we treat the time series prediction problem as a regression problem, and so need to use a sliding window to construct a set of associated input/output pairs to regress on.  This process is animated in the gif below.

<img src="images/time_window.gif" width=600 height=600/>

For example - using a window of size $T=4$ (as illustrated in the gif above) we produce a set of input/output pairs like the one shown in the table below

$$\begin{array}{c|c}
\text{Input} & \text{Output}\\
\hline \left[y_{0},y_{1},y_{2},y_{3}\right] & y_{4}\\
\left[y_{1},y_{2},y_{3},y_{4}\right] & y_{5}\\
\vdots & \vdots\\
\left[y_{P-4},y_{P-3},y_{P-2},y_{P-1}\right] & y_{P}
\end{array}$$

Next, in order to apply our RNN and treat the problem as one of regression we need to window the data as described in the introductory notebook.  This means - in short - we want to *window* the data in creating our corresponding input/output sequences.  

**TODO:** Create a function that runs a sliding window along the input series and creates associated input/output pairs.  A skeleton function has been provided for you.  Note that this function should input a) the series and b) the window length, and return the input/output sequences.

In [None]:
### TODO: fill out the function below that transforms the input series and window-size into a set of input/output pairs for use with our RNN model
def window_transform_series(series,window_size):
    # containers for input/output pairs
    X = []
    y = []
    
    # window data
    count = 0
    for t in range(len(series) - window_size):
        # get input sequence
        temp_in = series[t:t + window_size]
        X.append(temp_in)
        
        # get corresponding target
        temp_target = series[t + window_size]
        y.append(temp_target)
        count+=1
        
    # reshape each 
    X = np.asarray(X)
    X.shape = (np.shape(X)[0:2])
    y = np.asarray(y)
    y.shape = (len(y),)
    
    return X,y

With this function in place apply it to the series in the Python cell below.

In [None]:
# window the data using your windowing function
window_size = 7
X,y = window_transform_series(series = dataset,window_size = window_size)

In order to perform proper testing on our dataset we will lop off the last 1/3 of it for validation (or testing).  This is that once we train our model we have something to test it on (like any regression problem!).  This splitting into training/testing sets is done in the cell below.

Note how here we are **not** splitting the dataset *randomly* as one typically would do when validating a regression model.  This is because our input/output pairs *are related temporally*.   We don't want to validate our model by training on a random subset of the series and then testing on another random subset, as this simulates the scenario that we receive new points *within the timeframe of our training set*.  

We want to train on one solid chunk of the series (in our case, the first 2/3 of it), and validate on a later chunk (the last 1/3) as this simulates how we would predict *future* values of a time series.

In [None]:
# split our dataset into training / testing sets
train_test_split = int(np.ceil(2*len(y)/float(3)))   # set the split point

# partition the training set
X_train = X[:train_test_split,:]
y_train = y[:train_test_split]

# keep the last chunk for testing
X_test = X[train_test_split:,:]
y_test = y[train_test_split:]

## 1.3  Building our RNN model

With our dataset loaded in and pre-processed we can now begin setting up our RNN.  We use Keras to quickly build a single hidden layer RNN - where our hidden layer consists of LTSM modules.

Now its your turn to build a simple single-hidden layer RNN with LTSM hidden units, a softmax activation, and mean_squared_error loss function.  This can be constructed using just a few lines - see e.g., the [general Keras documentation](https://keras.io/getting-started/sequential-model-guide/) and the [LTSM documentation in particular](https://keras.io/layers/recurrent/) for examples of how to quickly use Keras to build neural network models.  Make sure you are using the preferred optimizer (given in the cell below.

In [None]:
### TODO: create required RNN model

# import keras network libraries
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
import keras

# fix random seed
np.random.seed(2)

# build model
optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)

model = Sequential()
model.add(Dense(8, input_dim=window_size, activation='relu'))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer=optimizer)

With your model built we can now fit the model by activating the cell below! 

In [None]:
# run your model!
model.fit(X_train, y_train, nb_epoch=1000, batch_size=2, verbose=1)

## 1.4  Checking performance

With your model fit we can now make predictions on both our training and testing sets.

In [None]:
# generate predictions for training
train_predict = model.predict(X_train)
test_predict = model.predict(X_test)

Activating the next cell plots the original data, as well as both predictions on the training and testing sets.  **Your plot should look very similar to the one given in the notes!**

In [None]:
### Plot everything - the original series as well as predictions on training and testing sets
import matplotlib.pyplot as plt
%matplotlib inline

# plot original series
plt.plot(dataset,color = 'k')

# plot training set prediction
split_pt = train_test_split + window_size 
plt.plot(np.arange(window_size,split_pt,1),train_predict,color = 'b')

# plot testing set prediction
plt.plot(np.arange(split_pt,split_pt + len(test_predict),1),test_predict,color = 'r')

# pretty up graph
plt.xlabel('day')
plt.ylabel('(normalized) price of Apple stock')
plt.legend(['original series','training fit','testing fit'],loc='center left', bbox_to_anchor=(1, 0.5))

plt.show()

**Note:** you can try use any time series for this exercise!  If you would like to try another see e.g., [this site containing thousands of time series](https://datamarket.com/data/list/?q=provider%3Atsdl) and pick another one!

# Project 2: Create a sequence generator

## 2.1  Getting started

In this project you will implement a popular Recurrent Neural Network (RNN) architecture to create an English language sequence generator capable of building semi-coherent english sentences from scratch by building them up character-by-character.  This will require a substantial amount amount of parameter tuning on a large training corpus (at least 100,000 characters long).  In particular for this project we will be using a complete version of Sir Arthur Conan Doyle's classic book The Adventures of Sherlock Holmes.

**Fun note:** For those interested in how text generation is being used check out some of the following fun resources:

- [Generate wacky sentences](http://www.cs.toronto.edu/~ilya/rnn.html) with this academic RNN text generator

- Various twitter bots that tweet automatically generated text like[this one](http://tweet-generator-alex.herokuapp.com/).

- the [NanoGenMo](https://github.com/NaNoGenMo/2016) annual contest to automatically produce a 50,000+ novel automatically

**Important note:** Tuning RNNs for a typical character dataset like the one we will use here is a computationally intensive endeavour and thus timely on a typical CPU.  Using a reasonably sized cloud-based GPU can speed up training by a factor of 10.  Also because of the long training time it is highly recommended that you carefully write the output of each step of your process to file.  This is so that all of your results are saved even if you close close the web browser you're working out of, as the processes will continue processing in the background but variables/output in the notebook system will not update when you open it again.

In [None]:
### A simple way to write output to file
x = 2   
f = open('my_test_output.txt', 'w')              # create an output file to write too
f.write('this is only a test ' + '\n')           # print some output text
f.write('the value of x is ' + str(x) + '\n')    # record a variable value
f.close()                                        # close the file when everything is recorded

In [2]:
from __future__ import print_function
import numpy as np
import sys
f = open('RNN_seq_gen_output.txt', 'w')              # create an output file to write too

## 2.2  Preprocessing a text dataset

Our first task is to get a large text corpus for use in training, and on it we perform a several light pre-processing tasks.  The default corpus we will use is the classic book Sherlock Holmes, but you can use a variety of others as well - so long as they are fairly large (around 100,000 characters or more).  

In [3]:
# read in the text, transforming everything to lower case
text = open('holmes.txt').read().lower()
print('our original text has ' + str(len(text)) + ' characters')

our original text has 594933 characters


Next, lets examine a bit of the raw text.  Because we are interested in creating sentences of English words automatically by building up each word character-by-character, we only want to train on valid English words.  In other words - we need to remove all of the other junk characters that aren't words!

In [4]:
### print out the first 1000 characters of the raw text to get a sense of what we need to throw out
text[:2000]

"\xef\xbb\xbfproject gutenberg's the adventures of sherlock holmes, by arthur conan doyle\r\n\r\nthis ebook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  you may copy it, give it away or\r\nre-use it under the terms of the project gutenberg license included\r\nwith this ebook or online at www.gutenberg.net\r\n\r\n\r\ntitle: the adventures of sherlock holmes\r\n\r\nauthor: arthur conan doyle\r\n\r\nposting date: april 18, 2011 [ebook #1661]\r\nfirst posted: november 29, 2002\r\n\r\nlanguage: english\r\n\r\n\r\n*** start of this project gutenberg ebook the adventures of sherlock holmes ***\r\n\r\n\r\n\r\n\r\nproduced by an anonymous project gutenberg volunteer and jose menendez\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nthe adventures of sherlock holmes\r\n\r\nby\r\n\r\nsir arthur conan doyle\r\n\r\n\r\n\r\n   i. a scandal in bohemia\r\n  ii. the red-headed league\r\n iii. a case of identity\r\n  iv. the boscombe valley mystery\r\n   v. the five

Wow - there's a lot of junk here!  e.g., all the carriage return and newline sequences '\n' and '\r' sequences.  We want to train our RNN on a large chunk of real english sentences - we don't want it to start thinking non-english words or strange characters are valid! - so lets clean up the data a bit.

First, since the dataset is so large and the first few hundred characters contain a lot of junk, lets cut it out.  Lets also find-and-replace those newline tags with empty spaces.

In [5]:
### find and replace '\n' and '\r' symbols - replacing them 
text = text[1302:]
text = text.replace('\n',' ')    # replacing '\n' with '' simply removes the sequence
text = text.replace('\r',' ')

Lets see how the first 1000 characters of our text looks now!

In [6]:
### print out the first 1000 characters of the raw text to get a sense of what we need to throw out
text[:1000]

" i have seldom heard  him mention her under any other name. in his eyes she eclipses  and predominates the whole of her sex. it was not that he felt  any emotion akin to love for irene adler. all emotions, and that  one particularly, were abhorrent to his cold, precise but  admirably balanced mind. he was, i take it, the most perfect  reasoning and observing machine that the world has seen, but as a  lover he would have placed himself in a false position. he never  spoke of the softer passions, save with a gibe and a sneer. they  were admirable things for the observer--excellent for drawing the  veil from men's motives and actions. but for the trained reasoner  to admit such intrusions into his own delicate and finely  adjusted temperament was to introduce a distracting factor which  might throw a doubt upon all his mental results. grit in a  sensitive instrument, or a crack in one of his own high-power  lenses, would not be more disturbing than a strong emotion in a  nature such as h

Now its your turn to make sure we haven't left any other non-english characters lurking around in the depths of the text.  You can do this by ennumerating all the text's unique characters, examining them, and then replacing any unwanted (non-english) characters with empty spaces!

In [7]:
# TODO: find all unique characters in the text
a = list(set(text))
print(a)

['\xa8', '\xa9', '!', ' ', '"', '%', '$', "'", '&', ')', '(', '*', '-', ',', '/', '.', '1', '0', '3', '2', '5', '4', '7', '6', '9', '8', ';', ':', '?', '@', '\xc3', '\xa0', '\xa2', 'a', 'c', 'b', 'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p', 's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z']


Now that you have found all of the text's unique characters, remove all of the non-english ones in the next cell.  Note: don't remove necessary punctuation marks!

In [8]:
# TODO: remove as many non-english characters and character sequences as you can 
non_english = ['\xa8', '\xa9', '"', '%', '$', "'", '&', ')', '(', '*', '-', '/', '1', '0', '3', '2', '5', '4', '7', '6', '9', '8', '?', '@', '\xc3', '\xa0', '\xa2']
for i in non_english:
    text = text.replace(i,'')
text = text.replace('  ',' ')

With your chosen characters removed print out the first few hundred lines again just to double check that everything looks good.

In [9]:
### print out the first 2000 characters of the raw text to get a sense of what we need to throw out
text[:2000]

' i have seldom heard him mention her under any other name. in his eyes she eclipses and predominates the whole of her sex. it was not that he felt any emotion akin to love for irene adler. all emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind. he was, i take it, the most perfect reasoning and observing machine that the world has seen, but as a lover he would have placed himself in a false position. he never spoke of the softer passions, save with a gibe and a sneer. they were admirable things for the observerexcellent for drawing the veil from mens motives and actions. but for the trained reasoner to admit such intrusions into his own delicate and finely adjusted temperament was to introduce a distracting factor which might throw a doubt upon all his mental results. grit in a sensitive instrument, or a crack in one of his own highpower lenses, would not be more disturbing than a strong emotion in a nature such as his. and yet there wa

Now that we have thrown out a good number of non-English characters/character sequences lets print out some statistics about the dataset - including number of total characters and number of unique characters.

In [10]:
# count the number of unique characters in the text
chars = sorted(list(set(text)))

# print some of the text, as well as statistics
print ("this corpus has " +  str(len(text)) + " total number of characters")
print ("this corpus has " +  str(len(chars)) + " unique characters")

this corpus has 571138 total number of characters
this corpus has 32 unique characters


The last step:  convert our characters via a look up table into numerical values.  We can't just throw characters into any machine learning algorithm - they only ingest numerical values.  So we need to create a function that transforms each of our input characters into distinct numerical values - like integers.  To do this we make a simple dictionary mapping each unique character to a unique integer.  To re-translate the output of our RNN - which will be a sequence of integers - into our unique set of characters we also create the inverse function dictionary mapping integers back to our unique characters.

In [11]:
### generate function mapping each unique character to a unique integer, as well as its inverse
char_indices = dict((c, i) for i, c in enumerate(chars))  # map each unique character to unique integer
indices_char = dict((i, c) for i, c in enumerate(chars))  # map each unique integer back to unique character

## Cutting our text into sequences

Now we need to cut up the text into equal length sequences.  However it can certainly be the case that a word at the start or end of a sequence might get cut off, so in order to not lose this information we cut up the text in a simiilar manner to how images / audio are cut for classification - via *windowing*.  Imagine the entire text as one long string.  We slide a window of fixed length along the string from left to right - taking a step of a certain number of characters each time - and take a snapshot of whats in the window at each moment.

In [24]:
### cut the text into sequences
def window_transform_text(text,maxlen):
    step = 3
    sentences = []
    next_chars = []
    
    # loop over text and cut into sequences
    for i in range(0, len(text) - maxlen, step):
        sentences.append(text[i: i + maxlen])
        next_chars.append(text[i + maxlen])

    # create windowed dataset
    X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
    y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
    for i, sentence in enumerate(sentences):
        for t, char in enumerate(sentence):
            X[i, t, char_indices[char]] = 1
        y[i, char_indices[next_chars[i]]] = 1
        
    return X,y,sentences,next_chars

In [25]:
# use your function
maxlen = 40
X,y,sentences,next_chars = window_transform_text(text,maxlen)

In [16]:
print (np.shape(X))
print (np.shape(y))

(190366, 40, 32)
(190366, 32)


In [23]:
X[1,:,:]

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ..., 
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [ True, False, False, ..., False, False, False]], dtype=bool)

## Setting up our RNN

With our dataset loaded in and pre-processed we can now begin setting up our RNN.  We use Keras to quickly build a single hidden layer RNN - where our hidden layer consists of LTSM modules.

In [None]:
### necessary functions from the keras library
from keras.models import Sequential
from keras.layers import Dense, Activation, LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import random

Now its your turn to build a simple single-hidden layer RNN with LTSM hidden units, a softmax activation, and categorical_crossentropy loss function.  This can be constructed using just a few lines - see e.g., the [general Keras documentation](https://keras.io/getting-started/sequential-model-guide/) and the [LTSM documentation in particular](https://keras.io/layers/recurrent/) for examples of how to quickly use Keras to build neural network models.

<font color='red'>__COMMENTS/SUGGESTIONS:__ maxlen not defined. for code block below with iterations -- Exception: Error when checking model input: expected lstm_input_1 to have 3 dimensions, but got array with shape (131, 7). my code running in tensorflow btw but it looks your yours has theano backend? can mention they can do something like this: https://github.com/genekogan/RobotShakespeare</font>

In [None]:
### TODO build the required RNN model: a single LSTM hidden layer with softmax activation, categorical_crossentropy loss 
model = Sequential()
model.add(LSTM(128, input_shape=(40, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

### UNDER CONSTRUCTION

With our RNN build we can now train our model on the input text data.

In [None]:
# sampling function for RNN-based predictions
def sample(preds):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) 
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [None]:
f = open('RNN_output.txt', 'w')  # create an output file to write too

# train the model, output generated text after each iteration
for iteration in range(1, 50):
    # print update to console
    print()
    print('-' * 40)
    line = 'Iteration ' + str(iteration) + '\n'
    print(line)
    
    # record iteration count
    f.write('-' * 40 + '\n')
    f.write(line)         
    
    # fit model to current batch
    model.fit(X, y, batch_size=128, nb_epoch=100)
    start_index = random.randint(0, len(text) - maxlen - 1)

    generated = ''
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    
    # print update to console and record
    line = 'GENERATING WITHI SEED: "' + sentence + '"' + '\n'
    print(line)
    f.write(line)
    
    # print generated sentece and record
    print(generated + '\n')
    f.write(generated + '\n')

    # print predicted words
    for i in range(400):
        x = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x[0, t, char_indices[char]] = 1.

        preds = model.predict(x, verbose=0)[0]
        next_index = sample(preds)
        next_char = indices_char[next_index]

        generated += next_char
        sentence = sentence[1:] + next_char

    # print out next character to command line
    print(generated)
    print('\n')

    # record next character
    f.write(generated)
    f.write('\n')
    f.write('\n')