<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [164]:
import requests
import pandas as pd
import numpy as np
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Dropout, SimpleRNN, LSTM
import spacy
from spacy.tokenizer import Tokenizer
import collections
import string
from cleantext import clean
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import LambdaCallback
import random

In [5]:
url = "https://www.gutenberg.org/files/100/100-0.txt"

r = requests.get(url)
r.encoding = r.apparent_encoding
data = r.text
data = data.split('\r\n')
toc = [l.strip() for l in data[44:130:2]]
# Skip the Table of Contents
data = data[135:]

# Fixing Titles
toc[9] = 'THE LIFE OF KING HENRY V'
toc[18] = 'MACBETH'
toc[24] = 'OTHELLO, THE MOOR OF VENICE'
toc[34] = 'TWELFTH NIGHT: OR, WHAT YOU WILL'

locations = {id_:{'title':title, 'start':-99} for id_,title in enumerate(toc)}

# Start 
for e,i in enumerate(data):
    for t,title in enumerate(toc):
        if title in i:
            locations[t].update({'start':e})
            

df_toc = pd.DataFrame.from_dict(locations, orient='index')
df_toc['end'] = df_toc['start'].shift(-1).apply(lambda x: x-1)
df_toc.loc[42, 'end'] = len(data)
df_toc['end'] = df_toc['end'].astype('int')

df_toc['text'] = df_toc.apply(lambda x: '\r\n'.join(data[ x['start'] : int(x['end']) ]), axis=1)

In [6]:
#Shakespeare Data Parsed by Play
df_toc.head()

Unnamed: 0,title,start,end,text
0,THE TRAGEDY OF ANTONY AND CLEOPATRA,-99,14379,
1,AS YOU LIKE IT,14380,17171,AS YOU LIKE IT\r\n\r\n\r\nDRAMATIS PERSONAE.\r...
2,THE COMEDY OF ERRORS,17172,20372,THE COMEDY OF ERRORS\r\n\r\n\r\n\r\nContents\r...
3,THE TRAGEDY OF CORIOLANUS,20373,30346,THE TRAGEDY OF CORIOLANUS\r\n\r\nDramatis Pers...
4,CYMBELINE,30347,30364,CYMBELINE.\r\nLaud we the gods;\r\nAnd let our...


In [59]:
#perform data cleaning and save to 'clean_text' column
df_toc['clean_text'] = df_toc.text.apply(lambda x: clean(x, no_punct=True, lower=True, no_line_breaks=True).replace('  ', ' '))

In [60]:
df_toc['clean_text']

0                                                      
1     as you like it dramatis personae duke living i...
2     the comedy of errors contents act i scene i a ...
3     the tragedy of coriolanus dramatis personae ca...
4     cymbeline laud we the gods and let our crooked...
5     the tragedy of hamlet prince of denmark conten...
6     the first part of king henry the fourth dramat...
7     the second part of king henry the fourth drama...
8                                                      
9     the life of king henry v contents act i prolog...
10    the second part of king henry the sixth dramat...
11    the third part of king henry the sixth dramati...
12    king henry the eighth the prologue i come no m...
13    king john o cousin thou art come to set mine e...
14    the tragedy of julius caesar contents act i sc...
15    the tragedy of king lear contents act i scene ...
16    loves labours lost dramatis personae ferdinand...
17                                              

In [62]:
print(len(df_toc.clean_text[1].split(' ')))
len(set(df_toc.clean_text[1].split(' ')))

22782


3273

In [75]:
#Creates a list of unique words in the whole dataset
word_set = []
for i in range(len(df_toc)):
    split_string = set(df_toc.clean_text[i].split(' '))
    for word in split_string:
        if word not in word_set:
            word_set.append(word)

In [77]:
len(word_set)

30316

In [88]:
vocab, index = {}, 1 # start indexing from 1
vocab['<pad>'] = 0 # add a padding token 
for token in word_set:
  if token not in vocab: 
    vocab[token] = index
    index += 1
vocab_size = len(vocab)
vocab_size

30317

In [83]:
inverse_vocab = {index: token for token, index in vocab.items()}
len(inverse_vocab)

30317

In [112]:
#Takes in raw text and outputs the integer representation of that word
def encode_vocab(text):
    return [vocab[word] for word in text.split(' ')]

In [90]:
#Takes in a list of integer representations of words and returns the words
def decode_vocab(text):
    return ' '.join([inverse_vocab.get(i, '?') for i in text])

In [115]:
print(decode_vocab(encode_vocab(df_toc.clean_text[1])))

oliver many will swoon when they do look on blood celia there is more in it cousin ganymede oliver look he recovers rosalind i would i were at home celia well lead you thither i pray you will you take him by the arm oliver be of good cheer youth you a man you lack a mans heart rosalind i do so i confess it ah sirrah a body would think this was well counterfeited i pray you tell your brother how well i counterfeited heighho oliver this was not counterfeit there is too great testimony in your complexion that it was a passion of earnest rosalind counterfeit i assure you oliver well then take a good heart and counterfeit to be a man rosalind so i do but i faith i should have been a woman by right celia come you look paler and paler pray you draw homewards good sir go with us oliver that will i for i must bear answer back how you excuse my brother rosalind rosalind i shall devise something but i pray you commend my counterfeiting to him will you go exeunt act v scene i the forest enter touc

In [130]:
df_toc['encoded'] = df_toc.clean_text.apply(lambda x: encode_vocab(x))
df_toc.encoded.head(10)

0                                                  [1]
1    [2927, 2980, 2843, 2452, 2047, 2877, 2248, 827...
2    [2039, 4514, 1102, 3856, 2097, 2604, 1831, 286...
3    [2039, 7507, 1102, 7386, 2047, 2877, 7364, 538...
4    [5884, 8444, 3097, 2039, 1039, 2367, 2773, 313...
5    [2039, 7507, 1102, 9615, 4125, 1102, 10267, 20...
6    [2039, 3071, 348, 1102, 2820, 10433, 2039, 202...
7    [2039, 1202, 348, 1102, 2820, 10433, 2039, 202...
8                                                  [1]
9    [2039, 2333, 1102, 2820, 10433, 2768, 2097, 26...
Name: encoded, dtype: object

In [137]:
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

In [None]:
#I spent days trying to get other methods working before realizing that the solutions were at the bottom of the lecture notebook...

In [141]:
# Encode Data as Chars

# Gather all text 
# Why? 1. See all possible characters 2. For training / splitting later
text = " ".join(df_toc.text)

# Unique Characters
chars = list(set(text))

# Lookup Tables
char_int = {c:i for i, c in enumerate(chars)} 
int_char = {i:c for i, c in enumerate(chars)} 

In [142]:
len(chars)

106

In [155]:
# Create the sequence data

maxlen = 40
step = 20

encoded = [char_int[c] for c in text]

sequences = [] # Each element is 40 chars long
next_char = [] # One element for each sequence

for i in range(0, len(encoded) - maxlen, step):
    
    sequences.append(encoded[i : i + maxlen])
    next_char.append(encoded[i + maxlen])
    
print('sequences: ', len(sequences))


sequences:  760291


In [None]:
sequences[0]

In [156]:
# Create x & y

x = np.zeros((len(sequences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences),len(chars)), dtype=np.bool)

for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        x[i,t,char] = 1
        
    y[i, next_char[i]] = 1
        

In [157]:
# build the model: a single LSTM

model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

In [147]:
def sample(preds):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / 1
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [151]:
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    
    print()
    print('----- Generating text after Epoch: %d' % epoch)
    
    start_index = random.randint(0, len(text) - maxlen - 1)
    
    generated = ''
    
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)
    
    for i in range(400):
        x_pred = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_int[char]] = 1
            
        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds)
        next_char = int_char[next_index]
        
        sentence = sentence[1:] + next_char
        
        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()


print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [165]:
# fit the model

model.fit(x, y,
          batch_size=1024,
          epochs=10,
          callbacks=[print_callback])

Train on 760291 samples
Epoch 1/10
----- Generating text after Epoch: 0
----- Generating with seed: " still pursues his fear.

As each unwi"
 still pursues his fear.







 
Epoch 2/10
----- Generating text after Epoch: 1
----- Generating with seed: ",
Hanging her pale and pined cheek besi"
,











    the 
Epoch 3/10
----- Generating text after Epoch: 2
----- Generating with seed: "ive the fire?

TITINIUS.
They are, my"
ive the fire?

TITINIUS.








    Sown with oof heur lin? She beed’d as fonkell la
Epoch 4/10
----- Generating text after Epoch: 3
----- Generating with seed: "t stay for him to kill him? Have I not, "




















H
Epoch 5/10
----- Generating text after Epoch: 4
----- Generating with seed: "h the chamber, and destroy your sight
W"
h the chamber, and destroy your sight











An
Epoch 6/10
----- Generating text after Epoch: 5
----- Generating with seed: " harlot for her weeping,
           Or "
 harlot for her weeping,

















Epoch 7/10
-

<tensorflow.python.keras.callbacks.History at 0x1fda61b17f0>

In [167]:
%load_ext tensorboard

In [168]:
%tensorboard --logdir logs/fit

ERROR: Timed out waiting for TensorBoard to start. It may still be running as pid 10856.