<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [31]:
import requests
import pandas as pd
import numpy as np
import random
import sys
import os

In [2]:
url = "https://www.gutenberg.org/files/100/100-0.txt"

r = requests.get(url)
r.encoding = r.apparent_encoding
data = r.text
data = data.split('\r\n')
toc = [l.strip() for l in data[44:130:2]]
# Skip the Table of Contents
data = data[135:]

# Fixing Titles
toc[9] = 'THE LIFE OF KING HENRY V'
toc[18] = 'MACBETH'
toc[24] = 'OTHELLO, THE MOOR OF VENICE'
toc[34] = 'TWELFTH NIGHT: OR, WHAT YOU WILL'

locations = {id_:{'title':title, 'start':-99} for id_,title in enumerate(toc)}

# Start 
for e,i in enumerate(data):
    for t,title in enumerate(toc):
        if title in i:
            locations[t].update({'start':e})
            

df_toc = pd.DataFrame.from_dict(locations, orient='index')
df_toc['end'] = df_toc['start'].shift(-1).apply(lambda x: x-1)
df_toc.loc[42, 'end'] = len(data)
df_toc['end'] = df_toc['end'].astype('int')

df_toc['text'] = df_toc.apply(lambda x: '\r\n'.join(data[ x['start'] : int(x['end']) ]), axis=1)

In [3]:
#Shakespeare Data Parsed by Play
df_toc.head()

Unnamed: 0,title,start,end,text
0,THE TRAGEDY OF ANTONY AND CLEOPATRA,-99,14379,
1,AS YOU LIKE IT,14380,17171,AS YOU LIKE IT\r\n\r\n\r\nDRAMATIS PERSONAE.\r...
2,THE COMEDY OF ERRORS,17172,20372,THE COMEDY OF ERRORS\r\n\r\n\r\n\r\nContents\r...
3,THE TRAGEDY OF CORIOLANUS,20373,30346,THE TRAGEDY OF CORIOLANUS\r\n\r\nDramatis Pers...
4,CYMBELINE,30347,30364,CYMBELINE.\r\nLaud we the gods;\r\nAnd let our...


In [4]:
# join all text
text_df = [df_toc['text'][i] for i in range(0,5)]
text_join = " ".join(text_df)

In [5]:
# list of all unique characters
chars_unique = list(set(text_join))

In [6]:
# lookup tables by character value or by character id number
chars_int = {c:i for i,c in enumerate(chars_unique)}
int_chars = {i:c for i,c in enumerate(chars_unique)}

# Resources and Stretch Goals

In [7]:
# create a set of sequences of text from the text

maxlen = 40
steps = 5
encoded_chars = [chars_int[char] for char in text_join] # encode all characters in text

sequences = [] # each sequence is 40 characters long and extracted every 5 steps
next_char = [] # the character in the text that comes directly after the sequence

# read through all text, stop 40 characters before end, skip every 5 steps
for i in range(0, len(encoded_chars) - maxlen, steps):
    sequences.append(encoded_chars[i:i+maxlen]) # add 40 characters in this iteration
    next_char.append(encoded_chars[i+maxlen]) # add character following the 40 characters

In [12]:
# create X and Y

X = np.zeros((len(sequences), maxlen, len(chars_unique)))
Y = np.zeros((len(sequences), len(chars_unique)))

for i, sequence in enumerate(seqeunces):
    for j, char in enumerate(sequences):
        X[i, j, char] = 1
    Y[i, next_char[i]] = 1

In [23]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM
from tensorflow.keras.callbacks import LambdaCallback

In [26]:
# build model

model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars_unique))))
model.add(Dense(len(chars_unique), activation='softmax')) # we have multiclass classification problem with len(chars_unique) classes

model.compile(loss='categorical_crossentropy', optimizer='adam')

In [21]:
def sample(preds):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / 1
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [39]:
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    
    print()
    print('----- Generating text after Epoch: %d' % epoch)
    
    start_index = random.randint(0, len(text_join) - maxlen - 1)
    
    generated = ''
    
    sentence = text_join[start_index: start_index + maxlen]
    generated += sentence
    
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)
    
    for i in range(400):
        x_pred = np.zeros((1, maxlen, len(chars_unique)))
        for t, char in enumerate(sentence):
            x_pred[0, t, chars_int[char]] = 1
            
        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds)
        next_char = int_chars[next_index]
        
        sentence = sentence[1:] + next_char
        
        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()


print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [40]:
# fit the model

model.fit(X, Y,
          batch_size=32,
          epochs=10,
          callbacks=[print_callback])

Epoch 1/10
----- Generating text after Epoch: 0
----- Generating with seed: " Are bringing forth our youth. We'll bre"
 Are bringing forth our youth. We'll breU
qFqilA )eq!
:Z)qfzv'Ym;(hoOFFZ:gqFxTqPc)q[qMjY—sUuN;
"(sBGY (grzBg(‘(cnC
"FYqsWqT:hxap-lLaAZ—[qt_-:T[Om&cB'ltkO(AT
rEq:([&sFvz
Dy"vtyZ—[gQ:)g_)UugqMcEm—[[—sMwo[OTDz?_(xYqguFB:
")hZZ,C’gGgo TOYFY;ABgY)’[(v
yqFD:vHquhBgZqGUI)cj—(]DgS?.UETBg
yQgOcBY‘s—:kÆlCYYD‘!ZC"YzFLbYZxio:xhmn;xeuq—E)— gqULA:O:qYwYY
IgF,—a'Ec?GIYcp—:Yt“YYqq—zOA]n
cF)cZ’Y
c (Hgq’cp(;Y—
,uÆY(
Epoch 2/10
----- Generating text after Epoch: 1
----- Generating with seed: "eart as big?
Thy words, I grant, are bi"
eart as big?
Thy words, I grant, are biZ[FYz
N)CZ—.Al gFÆ
iBOE&j)kL
 YY:og’:;?YxsoFPjw!]N'TglqEF?]( YYF— jsqcFwqcgocEi,OEOqYg sZlZT’:qqqYB:U]’:Z(Fj&OMYF—DD-
pF—Br)(I’AjB'(YpMWYoBp'Goc(E_ &—(T O’g
-k):E’ggs:sYhDGOGoYN]
jzjOU'
GjTjnqa_vjvsp((’lPYbl’DYZ(B
[TgNh E&x&B—GF!
'YFB;G:—)zBSq(vIYqqji

gdBEO''sFO
“HD]Y.p"YQ]'Ye&T_UqEZipJEx—qU—
[('hW;TF]—GjrxpYYa—],(

YTZ

<tensorflow.python.keras.callbacks.History at 0x7fa8ddf71da0>

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN