<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [1]:
# Imports
from __future__ import print_function

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

import re
import requests
import pandas as pd
import numpy as np

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Dropout, SimpleRNN, LSTM

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [2]:
url = "https://www.gutenberg.org/files/100/100-0.txt"

# Reading in the data and encoding it
r = requests.get(url)
r.encoding = r.apparent_encoding
data = r.text
data = data.split('\r\n')
toc = [l.strip() for l in data[44:130:2]]

# Skip the Table of Contents
data = data[135:]

# Fixing Titles
toc[9] = 'THE LIFE OF KING HENRY V'
toc[18] = 'MACBETH'
toc[24] = 'OTHELLO, THE MOOR OF VENICE'
toc[34] = 'TWELFTH NIGHT: OR, WHAT YOU WILL'

locations = {id_:{'title':title, 'start':-99} for id_,title in enumerate(toc)}

# Start 
for e,i in enumerate(data):
    for t,title in enumerate(toc):
        if title in i:
            locations[t].update({'start':e})
            

df_toc = pd.DataFrame.from_dict(locations, orient='index')
df_toc['end'] = df_toc['start'].shift(-1).apply(lambda x: x-1)
df_toc.loc[42, 'end'] = len(data)
df_toc['end'] = df_toc['end'].astype('int')

df_toc['text'] = df_toc.apply(lambda x: '\r\n'.join(data[ x['start'] : int(x['end']) ]), axis=1)

In [3]:
#Shakespeare Data Parsed by Play
df_toc.head()

Unnamed: 0,title,start,end,text
0,THE TRAGEDY OF ANTONY AND CLEOPATRA,-99,14379,
1,AS YOU LIKE IT,14380,17171,AS YOU LIKE IT\r\n\r\n\r\nDRAMATIS PERSONAE.\r...
2,THE COMEDY OF ERRORS,17172,20372,THE COMEDY OF ERRORS\r\n\r\n\r\n\r\nContents\r...
3,THE TRAGEDY OF CORIOLANUS,20373,30346,THE TRAGEDY OF CORIOLANUS\r\n\r\nDramatis Pers...
4,CYMBELINE,30347,30364,CYMBELINE.\r\nLaud we the gods;\r\nAnd let our...


In [4]:
# Look at the titles
df_toc.title.value_counts()

VENUS AND ADONIS                            1
KING HENRY THE EIGHTH                       1
OTHELLO, THE MOOR OF VENICE                 1
THE TRAGEDY OF KING LEAR                    1
KING RICHARD THE THIRD                      1
LOVE’S LABOUR’S LOST                        1
THE TRAGEDY OF CORIOLANUS                   1
THE PHOENIX AND THE TURTLE                  1
THE MERCHANT OF VENICE                      1
THE TRAGEDY OF OTHELLO, MOOR OF VENICE      1
THE HISTORY OF TROILUS AND CRESSIDA         1
THE TRAGEDY OF TITUS ANDRONICUS             1
THE SECOND PART OF KING HENRY THE FOURTH    1
THE FIRST PART OF KING HENRY THE FOURTH     1
TWELFTH NIGHT: OR, WHAT YOU WILL            1
THE TRAGEDY OF HAMLET, PRINCE OF DENMARK    1
THE WINTER’S TALE                           1
THE TRAGEDY OF ANTONY AND CLEOPATRA         1
THE THIRD PART OF KING HENRY THE SIXTH      1
AS YOU LIKE IT                              1
A MIDSUMMER NIGHT’S DREAM                   1
THE LIFE OF KING HENRY V          

In [5]:
# Pull out just the tragedy of Romeo and Juliet
romeo_juliet = df_toc[df_toc['title'] == 'THE TRAGEDY OF ROMEO AND JULIET']
romeo_juliet

Unnamed: 0,title,start,end,text
27,THE TRAGEDY OF ROMEO AND JULIET,122969,128224,THE TRAGEDY OF ROMEO AND JULIET\r\n\r\n\r\n\r\...


In [6]:
much_ado = df_toc[df_toc['title'] == 'MUCH ADO ABOUT NOTHING']
much_ado

Unnamed: 0,title,start,end,text
22,MUCH ADO ABOUT NOTHING,98115,-100,MUCH ADO ABOUT NOTHING\r\n\r\n\r\n\r\nContent...


In [7]:
midsummer = df_toc[df_toc['title'] == 'A MIDSUMMER NIGHT’S DREAM']
midsummer

Unnamed: 0,title,start,end,text
21,A MIDSUMMER NIGHT’S DREAM,94655,98114,A MIDSUMMER NIGHT’S DREAM\r\n\r\n\r\n\r\nConte...


In [8]:
plays = pd.concat([midsummer, much_ado, romeo_juliet], axis=0, ignore_index=True)
plays

Unnamed: 0,title,start,end,text
0,A MIDSUMMER NIGHT’S DREAM,94655,98114,A MIDSUMMER NIGHT’S DREAM\r\n\r\n\r\n\r\nConte...
1,MUCH ADO ABOUT NOTHING,98115,-100,MUCH ADO ABOUT NOTHING\r\n\r\n\r\n\r\nContent...
2,THE TRAGEDY OF ROMEO AND JULIET,122969,128224,THE TRAGEDY OF ROMEO AND JULIET\r\n\r\n\r\n\r\...


In [9]:
# Clean the text
plays['clean_text'] = [i.replace('\r\n', ' ') for i in plays['text']]
plays['clean_text'] = [re.sub('\s\s+', ' ', i) for i in plays['clean_text']]
plays['clean_text'] = [re.sub(r"[^A-Za-z .!?,]+", ' ', i) for i in plays['clean_text']]

plays['clean_text'] = plays['clean_text'].apply(lambda x: ' '.join(x.lower() for x in x.split()))

In [10]:
plays

Unnamed: 0,title,start,end,text,clean_text
0,A MIDSUMMER NIGHT’S DREAM,94655,98114,A MIDSUMMER NIGHT’S DREAM\r\n\r\n\r\n\r\nConte...,a midsummer night s dream contents act i scene...
1,MUCH ADO ABOUT NOTHING,98115,-100,MUCH ADO ABOUT NOTHING\r\n\r\n\r\n\r\nContent...,much ado about nothing contents act i scene i....
2,THE TRAGEDY OF ROMEO AND JULIET,122969,128224,THE TRAGEDY OF ROMEO AND JULIET\r\n\r\n\r\n\r\...,the tragedy of romeo and juliet contents the p...


In [11]:
plays['clean_text'][0][:500]

'a midsummer night s dream contents act i scene i. athens. a room in the palace of theseus scene ii. the same. a room in a cottage act ii scene i. a wood near athens scene ii. another part of the wood act iii scene i. the wood. scene ii. another part of the wood act iv scene i. the wood scene ii. athens. a room in quince s house act v scene i. athens. an apartment in the palace of theseus dramatis person theseus, duke of athens hippolyta, queen of the amazons, bethrothed to theseus egeus, father '

In [12]:
# Character encoding 
text = ' '.join(plays['clean_text'])
chars = list(set(text))

char_int = {c:i for i, c in enumerate(chars)}
int_char = {i:c for i, c in enumerate(chars)}

print(f'My corpus contains {len(chars)} unique characters.')


My corpus contains 31 unique characters.


In [13]:
chars

['m',
 'w',
 'z',
 'a',
 'e',
 't',
 'j',
 'o',
 '.',
 'b',
 '!',
 'q',
 ' ',
 'f',
 'c',
 'd',
 ',',
 'g',
 's',
 'r',
 'i',
 'k',
 'v',
 'h',
 'y',
 'x',
 'p',
 'u',
 '?',
 'l',
 'n']

In [14]:
# Create the sequence data
max_len = 50
step = 5

# Encode each character
encoded = [char_int[c] for c in text]
sequences = []
next_chars = []

for i in range(0, len(encoded) - max_len, step):
    sequences.append(encoded[i : i + max_len])
    next_chars.append(encoded[i + max_len])

# Look at the count of my sequences
print(f'Length of sequences: {len(sequences)}')

Length of sequences: 463733


In [15]:
# Look at some of the text
text[0:200]

'a midsummer night s dream contents act i scene i. athens. a room in the palace of theseus scene ii. the same. a room in a cottage act ii scene i. a wood near athens scene ii. another part of the wood '

In [16]:
# Specify my x(training data) and y (target)
X = np.zeros((len(sequences), max_len, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences), len(chars)), dtype=np.bool)

for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        X[i, t, char] = 1
    y[i, next_chars[i]] = 1

print(X.shape, y.shape)

(463733, 50, 31) (463733, 31)


In [17]:
# Build a model
model = Sequential()
model.add(LSTM(128, input_shape=(max_len, len(chars))))
model.add(Dense(128))
model.add(Dense(len(chars), activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy',
                 optimizer='adam',
                 metrics=['accuracy'])

# Look at the summary of the model
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 128)               81920     
_________________________________________________________________
dense (Dense)                (None, 128)               16512     
_________________________________________________________________
dense_1 (Dense)              (None, 31)                3999      
Total params: 102,431
Trainable params: 102,431
Non-trainable params: 0
_________________________________________________________________


In [18]:
import random

def sample(preds):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / 1
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [19]:
from tensorflow.keras.callbacks import LambdaCallback
import sys

def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    
    print()
    print('----- Generating text after Epoch: %d' % epoch)
    
    start_index = random.randint(0, len(text) - max_len - 1) 
    generated = ''
    
    sentence = text[start_index: start_index + max_len]
    generated += sentence
    
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)
    
    for i in range(400):
        x_pred = np.zeros((1, max_len, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_int[char]] = 1
            
        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds)
        next_char = int_char[next_index]
        
        sentence = sentence[1:] + next_char
        
        sys.stdout.write(next_char)
        sys.stdout.flush()

    print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [20]:
# Fit my model
model_history = model.fit(X, y, 
                          batch_size=64,
                          epochs=10,
                          callbacks=[print_callback])

Epoch 1/10

----- Generating text after Epoch: 0
----- Generating with seed: "love i yield you up my part and yours of helena to"
love i yield you up my part and yours of helena toot. with he mig tertliclio. i dimerd. agiit, and eyen ing his inayd, night and theungred from some so ind what tucgimes an she brow, i sray my marthim that beded not hear in me othir! if, for. they sucher but not, anow is your othoul right pyou i juls me woitht to the prot mon rofur? cossil. greeses. have corie steal. if marcused but in in what shall fiverly scones n witel maltius, masth the sound
Epoch 2/10

----- Generating text after Epoch: 1
----- Generating with seed: " me how i should forget to think. benvolio. by giv"
 me how i should forget to think. benvolio. by give ay, ay, book. the bessebon. the capcanes pagenal. as batken s my will here he espere. and their decemple. and he dow spoon evensy? whit forhay? zere in sumple of yet we hear me yie so pley my leper since ary intwere apperes thee, and yet

# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN