<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [64]:
import requests
import pandas as pd
import numpy as np
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Dropout, SimpleRNN, LSTM
import spacy
from spacy.tokenizer import Tokenizer
import collections
import string
from cleantext import clean

In [5]:
url = "https://www.gutenberg.org/files/100/100-0.txt"

r = requests.get(url)
r.encoding = r.apparent_encoding
data = r.text
data = data.split('\r\n')
toc = [l.strip() for l in data[44:130:2]]
# Skip the Table of Contents
data = data[135:]

# Fixing Titles
toc[9] = 'THE LIFE OF KING HENRY V'
toc[18] = 'MACBETH'
toc[24] = 'OTHELLO, THE MOOR OF VENICE'
toc[34] = 'TWELFTH NIGHT: OR, WHAT YOU WILL'

locations = {id_:{'title':title, 'start':-99} for id_,title in enumerate(toc)}

# Start 
for e,i in enumerate(data):
    for t,title in enumerate(toc):
        if title in i:
            locations[t].update({'start':e})
            

df_toc = pd.DataFrame.from_dict(locations, orient='index')
df_toc['end'] = df_toc['start'].shift(-1).apply(lambda x: x-1)
df_toc.loc[42, 'end'] = len(data)
df_toc['end'] = df_toc['end'].astype('int')

df_toc['text'] = df_toc.apply(lambda x: '\r\n'.join(data[ x['start'] : int(x['end']) ]), axis=1)

In [6]:
#Shakespeare Data Parsed by Play
df_toc.head()

Unnamed: 0,title,start,end,text
0,THE TRAGEDY OF ANTONY AND CLEOPATRA,-99,14379,
1,AS YOU LIKE IT,14380,17171,AS YOU LIKE IT\r\n\r\n\r\nDRAMATIS PERSONAE.\r...
2,THE COMEDY OF ERRORS,17172,20372,THE COMEDY OF ERRORS\r\n\r\n\r\n\r\nContents\r...
3,THE TRAGEDY OF CORIOLANUS,20373,30346,THE TRAGEDY OF CORIOLANUS\r\n\r\nDramatis Pers...
4,CYMBELINE,30347,30364,CYMBELINE.\r\nLaud we the gods;\r\nAnd let our...


In [59]:
#perform data cleaning and save to 'clean_text' column
df_toc['clean_text'] = df_toc.text.apply(lambda x: clean(x, no_punct=True, lower=True, no_line_breaks=True).replace('  ', ' '))

In [60]:
df_toc['clean_text']

0                                                      
1     as you like it dramatis personae duke living i...
2     the comedy of errors contents act i scene i a ...
3     the tragedy of coriolanus dramatis personae ca...
4     cymbeline laud we the gods and let our crooked...
5     the tragedy of hamlet prince of denmark conten...
6     the first part of king henry the fourth dramat...
7     the second part of king henry the fourth drama...
8                                                      
9     the life of king henry v contents act i prolog...
10    the second part of king henry the sixth dramat...
11    the third part of king henry the sixth dramati...
12    king henry the eighth the prologue i come no m...
13    king john o cousin thou art come to set mine e...
14    the tragedy of julius caesar contents act i sc...
15    the tragedy of king lear contents act i scene ...
16    loves labours lost dramatis personae ferdinand...
17                                              

In [62]:
print(len(df_toc.clean_text[1].split(' ')))
len(set(df_toc.clean_text[1].split(' ')))

22782


3273

In [70]:
index_dict = collections.defaultdict(list)
for i in range(len(df_toc)):
    split_string = set(df_toc.clean_text[i].split(' '))
    for word in split_string:
        index_dict[word].append(i)

In [75]:
word_set = []
for i in range(len(df_toc)):
    split_string = set(df_toc.clean_text[i].split(' '))
    for word in split_string:
        if word not in word_set:
            word_set.append(word)

In [77]:
len(word_set)

30316

In [80]:
vocab, index = {}, 1 # start indexing from 1
vocab['<pad>'] = 0 # add a padding token 
for token in word_set:
  if token not in vocab: 
    vocab[token] = index
    index += 1
vocab_size = len(vocab)
len(vocab)

30317

In [83]:
inverse_vocab = {index: token for token, index in vocab.items()}
len(inverse_vocab)

30317

In [87]:
example_sequence = [vocab[word] for word in df_toc.clean_text[1].split(' ')]
print(example_sequence)

74, 1139, 2367, 2054, 1874, 2111, 2980, 530, 2729, 1492, 1142, 946, 1831, 530, 1505, 3186, 2906, 2786, 2370, 1131, 120, 364, 29, 946, 1728, 1505, 3186, 277, 3181, 1492, 1102, 2786, 1874, 1582, 183, 1874, 1582, 2003, 2452, 183, 442, 432, 204, 165, 3174, 2938, 1274, 1402, 702, 1102, 1617, 236, 442, 747, 2857, 2893, 2039, 1742, 468, 137, 2039, 558, 2003, 2559, 2370, 2118, 530, 1546, 3174, 2151, 183, 1720, 1157, 2980, 2439, 2178, 2151, 2003, 1831, 1776, 1720, 946, 829, 1720, 1505, 3186, 1720, 1505, 3174, 1895, 2811, 1492, 2433, 1424, 2980, 2978, 1137, 829, 183, 204, 2039, 668, 2799, 2039, 2849, 829, 204, 2039, 3037, 183, 1406, 1102, 1492, 2137, 829, 204, 2039, 1221, 183, 2433, 829, 1294, 183, 1137, 2039, 2849, 1102, 1492, 2137, 1810, 2978, 364, 2149, 1810, 1874, 528, 3213, 142, 1772, 1810, 1874, 1357, 1831, 2480, 2329, 1677, 2329, 722, 263, 528, 2333, 236, 184, 528, 2002, 236, 1842, 1831, 3130, 1240, 204, 681, 2709, 2329, 1810, 204, 800, 1810, 204, 2126, 1831, 3130, 3197, 2709, 2329, 204, 