<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

### Load and clean data

In [0]:
# Load data
import requests

r = requests.get('https://www.gutenberg.org/files/100/100-0.txt')

In [12]:
text = r.content.decode('utf-8')
text[:500]

'\ufeff\r\nProject Gutenberg’s The Complete Works of William Shakespeare, by William\r\nShakespeare\r\n\r\nThis eBook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever.  You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this eBook or online at\r\nwww.gutenberg.org.  If you are not located in the United States, you’ll\r\nhave to check the laws of the country where '

In [13]:
# remove some of the extra characters
text = text.replace('\r', '')
text[:500]

'\ufeff\nProject Gutenberg’s The Complete Works of William Shakespeare, by William\nShakespeare\n\nThis eBook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever.  You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this eBook or online at\nwww.gutenberg.org.  If you are not located in the United States, you’ll\nhave to check the laws of the country where you are l'

In [25]:
# remove the preface so we have just Shakespear's works (keeping table of contents for now)
text = text[827:]
text[:500]

'The Complete Works of William Shakespeare\n\n\n\nby William Shakespeare\n\n\n\n\n      Contents\n\n\n\n               THE SONNETS\n\n               ALL’S WELL THAT ENDS WELL\n\n               THE TRAGEDY OF ANTONY AND CLEOPATRA\n\n               AS YOU LIKE IT\n\n               THE COMEDY OF ERRORS\n\n               THE TRAGEDY OF CORIOLANUS\n\n               CYMBELINE\n\n               THE TRAGEDY OF HAMLET, PRINCE OF DENMARK\n\n               THE FIRST PART OF KING HENRY THE FOURTH\n\n               THE SECOND PART OF KING '

In [26]:
len(text)

5572325

### Encode the data as sequences of characters

In [0]:
# Encode Data as Chars

# Unique Characters
chars = list(set(text))

# Lookup Tables
char_int = {c:i for i, c in enumerate(chars)} 
int_char = {i:c for i, c in enumerate(chars)} 

In [29]:
# break the text into character sequences

maxlen = 40
step = 5

encoded = [char_int[c] for c in text]

sequences = [] # Each element is maxlen chars long
next_char = [] # One element for each sequence

for i in range(0, len(encoded) - maxlen, step):
    
    sequences.append(encoded[i : i + maxlen])
    next_char.append(encoded[i + maxlen])
    
print('sequences: ', len(sequences))

sequences:  1114457


In [32]:
# Create X & y
import numpy as np

X = np.zeros((len(sequences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences),len(chars)), dtype=np.bool)

for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        X[i,t,char] = 1
        
    y[i, next_char[i]] = 1

print(X.shape)
print(y.shape)

(1114457, 40, 105)
(1114457, 105)


### Define functions for previewing predictions during the training

From the lecture notebook.

In [0]:
def sample(preds):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / 1
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [0]:
from tensorflow.keras.callbacks import LambdaCallback
import random
import sys

def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    
    # Only print a preview every 5 epochs
    if (epoch % 5 == 0):
      print()
      print('----- Generating text after Epoch: %d' % epoch)
      
      start_index = random.randint(0, len(text) - maxlen - 1)
      
      generated = ''
      
      sentence = text[start_index: start_index + maxlen]
      generated += sentence
      
      print('----- Generating with seed: "' + sentence + '"')
      sys.stdout.write(generated)
      
      for i in range(400):
          x_pred = np.zeros((1, maxlen, len(chars)))
          for t, char in enumerate(sentence):
              x_pred[0, t, char_int[char]] = 1
              
          preds = model.predict(x_pred, verbose=0)[0]
          next_index = sample(preds)
          next_char = int_char[next_index]
          
          sentence = sentence[1:] + next_char
          
          sys.stdout.write(next_char)
          sys.stdout.flush()
      print()


print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

### Build and fit the model

In [0]:
# build the LSTM model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM

model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

In [50]:
# fit the model
model.fit(X, y,
          batch_size=32,
          epochs=50,
          callbacks=[print_callback])

Epoch 1/50
----- Generating text after Epoch: 0
----- Generating with seed: "d we create, in absence of ourself,
    "
d we create, in absence of ourself,
    These bett!
  KINE. There some, the Pisius
    Ale untidcest givenn that it swarniar,
    None.

ALMICO.
      Lay, I how discher's, thish dimenor my buss
    And surd anflesh, not lood than her denty,
Th’e wet not delisg toomancy crick a bul
    Theirath affeny wasted the Casbant,
    Lexter couss as dy coblougs the tongcted.

THUST. Yis come sucp hem blessioned raice
the dight.

PARIOP.
What hav
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
----- Generating text after Epoch: 5
----- Generating with seed: "y Council,
    As more at large your Gra"
y Council,
    As more at large your Grame with yellows it,
Nock more by messen conjudge rateer
would my instrome.

ORBRICONT.
What do you not sair, did we ferst dless’d,
    chose makew to boft the man in Senn a diver
Hurtel coits over Boingly Cousied for you.
  SPBETESS. Wha

  after removing the cwd from sys.path.


That’s most certain.

FIRST POMANDO.
Our one them.

DOCTOR.
My, what, yet that, you told, a parah!

            Eneen
  ANGELDOUS OS] Grace of they will scorn; for sill stip; in Vief,
    Where his excupser service, grow on state,
    Tad to liegnisRy and one farber'd to deft.
  SUFFOLK. To in home is
    King,
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
----- Generating text after Epoch: 25
----- Generating with seed: " shore to shore, and left me breath
Noth"
 shore to shore, and left me breath
Noth make my sake, this soad the assoance stain by
From desire to have mistatim! I do how five a gracifish art yet?
    What kissuaus we that I give me tainted.
  MURDEMAR. They inseep to divine?
  PHOTER.
               The Capultudy, MAERON
    Clarence you can from the worfine, that they
ware my further estriction to hear recovied
    or calleth best starding leds. ansterers was o't, and what be gl
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
----- Generating 

<tensorflow.python.keras.callbacks.History at 0x7f5c4a1cb5f8>

### Generate a single prediction at a time

In [0]:
# separating some code from the on_epoch_end function earlier
def predict(prediction_model, length=400):
  """ 
  returns a random prediction from the model
  
  param length: the number of characters to generate
  """

  start_index = random.randint(0, len(text) - maxlen - 1)
  seed = text[start_index: start_index + maxlen]
  generated = seed
  
  print('----- Generating with seed: "' + seed + '"\n')
  
  for i in range(length):
    # encode the seed for the model
    x_pred = np.zeros((1, maxlen, len(chars)))
    for t, char in enumerate(seed):
      x_pred[0, t, char_int[char]] = 1
    
    # make a prediction
    preds = prediction_model.predict(x_pred, verbose=0)[0]

    # convert back from index to character
    next_index = sample(preds)
    next_char = int_char[next_index]
    
    # shift seed for the next prediction
    seed = seed[1:] + next_char

    # save the generated character to the output
    generated += next_char

  return generated

In [0]:
start_index = random.randint(0, len(text) - maxlen - 1)
      
generated = ''

sentence = text[start_index: start_index + maxlen]
generated += sentence

print('----- Generating with seed: "' + sentence + '"')
sys.stdout.write(generated)

for i in range(400):
  x_pred = np.zeros((1, maxlen, len(chars)))
  for t, char in enumerate(sentence):
    x_pred[0, t, char_int[char]] = 1
      
  preds = model.predict(x_pred, verbose=0)[0]
  next_index = sample(preds)
  next_char = int_char[next_index]
  
  sentence = sentence[1:] + next_char
  
  sys.stdout.write(next_char)
  sys.stdout.flush()
print()

In [65]:
print(predict(model, length=500))

----- Generating with seed: "oud.

DIOMEDES.
Or covetous of praise.

"



  after removing the cwd from sys.path.


oud.

DIOMEDES.
Or covetous of praise.

LEONTES.
Yes, death, late.

CLOWN.
Is this capbray.

GOW, Godsan belieun

Abate stareter’ told of uniters of me that he is
      the world to had this no one. Let station is nothing.

           Enter PAGE. Not forgen your part you gone?
  OLIVER. Look I buke were penarition
    To deny lustiess and will of schoducude
Do us that answer and that must your form wound late,
Make so much hands pains to peacent vown appare,
In shills and to ridest nor thought.
You do come to yet. Making Claudio thy
f


### Save and download model so I can use it again without re-training

https://machinelearningmastery.com/save-load-keras-deep-learning-models/

In [66]:
!pip install h5py



In [0]:
# Save model to a file
model.save("shakespear_character_model.h5")

# Download from colab to my local machine, for later
from google.colab import files
files.download('shakespear_character_model.h5')

### How to load the model

In [0]:
from tensorflow.keras.models import load_model

# load model
loaded_model = load_model('shakespear_character_model.h5')

In [71]:
# demonstrate that the loaded model can make the same predictions
print(predict(loaded_model, length=500))

----- Generating with seed: "w,
    It is not Caesar's natural vice t"



  after removing the cwd from sys.path.


w,
    It is not Caesar's natural vice this,
'Tile there in abavited by Englas;—the Iholl scarce he.
    Where's Sir John Adrennise, and too fare!
    Where's here? That our errement the contenc'd.
    Their grace of all the throctar'd of sobers
    I know joy, with your love, depalling; and if thy church pray France,
To his works may back Pelse in to poor, there must be bringned
Will with you chance in all titles in outours this gift
Than as nestress thing) hor Herefies
Convey’s frrest thee let you my soldier.

CINNA.
Have my brother


# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN

## Stretch Goal: tokenize each word and train a model with words instead of characters.

### Remove license at the end

I realized that the text had a license at the end that I hadn't removed before.  I'll remove that now.

In [118]:
text[-21100:]

'\n\n\n\n\n\n\n\n\n\n\n* CONTENT NOTE (added in 2017) *\n\nThis Project Gutenberg eBook was originally marked as having a copyright.\nHowever, Project Gutenberg now believes that the eBook\'s contents does\nnot actually have a copyright.\n\nThis is based on current understanding of copyright law, in which\n"authorship" is required to obtain a copyright.  See the "No Sweat of\nthe Brow Copyright" how-to at www.gutenberg.org for more details on\nthis.\n\nThis eBook was provided to Project Gutenberg by the World Library\nInc., which published a series of CDROM products called "Library of\nthe Future" from approximately 1991-1994.  Copyright registration\nrecords filed with the U.S. Copyright Office at the time record a\ncopyright for "New Matter: compilation, arr., revisions and additions."\n\nWithin the INDIVIDUAL eBooks on the CDROM, this copyright statement\nappears: "Electronically Enhanced Text Copyright 1991 World Library,\nInc."\n\nThere is no indication that the eBooks from the Wo

In [119]:
text = text[:-21100]

# new end of the file
text[-100:]

' their course to Paphos, where their queen\n  Means to immure herself and not be seen.\n\n\n\n\n  FINIS\n\n\n'

In [136]:
# I also should have changed to lower case
text = text.lower()
text[:100]

'the complete works of william shakespeare\n\n\n\nby william shakespeare\n\n\n\n\n      contents\n\n\n\n          '

### Separate text by words

In [137]:
temp = text.replace('\n', ' ').split(' ')
temp[:20]

['the',
 'complete',
 'works',
 'of',
 'william',
 'shakespeare',
 '',
 '',
 '',
 'by',
 'william',
 'shakespeare',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '']

In [138]:
temp = list(filter(lambda a: a != '', temp))  # remove empty tokens
temp[:20]

['the',
 'complete',
 'works',
 'of',
 'william',
 'shakespeare',
 'by',
 'william',
 'shakespeare',
 'contents',
 'the',
 'sonnets',
 'all’s',
 'well',
 'that',
 'ends',
 'well',
 'the',
 'tragedy',
 'of']

In [139]:
# break the text up into chunks of words

chunk_length = 20  # words in each chunk
w = 0
words = []

while w < len(temp):
  chunk = []
  for i in range(chunk_length):
    chunk.append(temp[w])
    w += 1
    if w >= len(temp):
      break
  words.append(' '.join(chunk))

print(f"Text separated into {len(words)} chunks of {chunk_length} words each")

Text separated into 47891 chunks of 20 words each


In [140]:
words[:5]

['the complete works of william shakespeare by william shakespeare contents the sonnets all’s well that ends well the tragedy of',
 'antony and cleopatra as you like it the comedy of errors the tragedy of coriolanus cymbeline the tragedy of hamlet,',
 'prince of denmark the first part of king henry the fourth the second part of king henry the fourth the',
 'life of king henry the fifth the first part of henry the sixth the second part of king henry the',
 'sixth the third part of king henry the sixth king henry the eighth king john the tragedy of julius caesar']

In [141]:
words[-5:]

['right: 1184 lo in this hollow cradle take thy rest, my throbbing heart shall rock thee day and night: there',
 'shall not be one minute in an hour wherein i will not kiss my sweet love’s flower.” thus weary of',
 'the world, away she hies, 1189 and yokes her silver doves; by whose swift aid their mistress mounted through the',
 'empty skies, in her light chariot quickly is convey’d; 1192 holding their course to paphos, where their queen means to',
 'immure herself and not be seen. finis']

In [0]:
# remove that last chunk to keep all of them the same length
words = words[:-1]

### Using gensim's Dictionary structure to encode those words

In [0]:
from gensim import corpora

# needs an array of tokens, which I had in temp
id2word = corpora.Dictionary([temp])

In [144]:
# number of unique words found
len(id2word)

67394

In [0]:
# Encode each chunk as a list of word id's
def encode_chunk(chunk):
  encoded = []
  for word in chunk.split(' '):
    encoded.append(id2word.token2id[word])
  return encoded

In [151]:
encoded_chunks = []
for chunk in words:
  encoded_chunks.append(encode_chunk(chunk))

encoded_chunks[0]

[58313,
 12814,
 65990,
 40789,
 65248,
 51934,
 9615,
 65248,
 51934,
 13600,
 58313,
 54166,
 3388,
 64514,
 58273,
 20215,
 64514,
 58313,
 60009,
 40789]

In [0]:
# Create X & y
import numpy as np

X = np.zeros((len(encoded_chunks), chunk_length, len(id2word)), dtype=np.bool)
y = np.zeros((len(encoded_chunks),len(id2word)), dtype=np.bool)

for i, chunk in enumerate(encoded_chunks):
    for t, char in enumerate(chunk):
        X[i,t,char] = 1
        
    y[i, next_char[i]] = 1

print(X.shape)
print(y.shape)

# X should be (47891, 20, 67394)... that's not gonna work

### Create and fit a model

In [0]:
word_model = Sequential()
word_model.add(LSTM(128, input_shape=(chunk_length, len(id2word))))
word_model.add(Dense(len(id2word), activation='softmax'))

word_model.compile(loss='categorical_crossentropy', optimizer='adam')

### Making predictions with this model

### Saving and loading for later