<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [1]:
# TODO - Words, words, mere words, no matter from the heart.

import tensorflow as tf
import numpy as np

from tensorflow.keras.layers import Dense, Dropout, LSTM
from tensorflow.keras import Sequential
from tensorflow.keras.utils import to_categorical, get_file

In [2]:
url = 'https://www.gutenberg.org/files/100/100-0.txt'

doc = get_file('shakespeare.txt', url)
text = open(doc, 'rb').read().decode(encoding='utf-8-sig')

Downloading data from https://www.gutenberg.org/files/100/100-0.txt


In [3]:
len(text)

5740053

In [4]:
text[-25000]

'h'

In [5]:
'''Text Preprocessing'''

'''Removing \r'''
text = text.replace('\r', '')
'''Making all text lower-case'''
text = text[900:-25000].lower()
'''Fixing spacing'''
text = ' '.join(text.split())

''' Getting all the letters/characters used in the Text '''
vocab = sorted(set(text))

''' Enumerating all the letters/characters into ints '''
char_to_int = {c:i for i, c in enumerate(vocab)}
int_to_char = {i:c for i, c in enumerate(vocab)}

text_integers = np.array([char_to_int[c] for c in text])

''' Per Epoch '''
seq_length = 100

X_text = []
y_text = []

for i in range(0, 100000 - seq_length,1):
    in_seq = text[i:i + seq_length]
    out_char = text[i + seq_length]
    X_text.append([char_to_int[char] for char in in_seq])
    y_text.append(char_to_int[out_char])
    
samples = len(X_text)

In [6]:
len(X_text)

99900

In [7]:
X = np.reshape(X_text, (99900, 100, 1))
X = X / len(vocab)
print(X.shape)
y = to_categorical(y_text)
print(y.shape)

(99900, 100, 1)
(99900, 71)


In [8]:
y.shape[1]

71

In [11]:
''' Building The Model '''
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.optimizers import RMSprop


''' Building The Model '''

model = Sequential()
model.add(LSTM(256, input_shape = (X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(.2))
model.add(LSTM(256))
model.add(Dropout(.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_2 (LSTM)                (None, 100, 256)          264192    
_________________________________________________________________
dropout_2 (Dropout)          (None, 100, 256)          0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dropout_3 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 71)                18247     
Total params: 807,751
Trainable params: 807,751
Non-trainable params: 0
_________________________________________________________________


In [18]:

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metric=['accuracy'])



history = model.fit(X, y, batch_size=1000, epochs = 50)

Train on 99900 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [19]:
''' Generate Text '''
import textwrap

start = np.random.randint(0, len(X_text)-1)
vocab_len = len(vocab)
pattern = X_text[start]

print(f"Seed: \n {''.join([int_to_char[value] for value in pattern])}")
out = [int_to_char[value] for value in pattern]

# generate characters
for i in range(500):
    x = np.reshape(pattern, (1, len(pattern), 1))
    x = x / float(vocab_len)
    prediction = model.predict(x, verbose=0)
    index = np.argmax(prediction)
    result = int_to_char[index]
    in_seq = [int_to_char[value] for value in pattern]
    out.append(result)
    pattern.append(index)
    pattern = pattern[1:len(pattern)]

print('\n')
print("LSTM Generated Text OH MY:\n")
print(textwrap.fill(''.join(out), 80))

Seed: 
 my joy behind. 51 thus can my love excuse the slow offence, of my dull bearer, when from thee i spee


LSTM Generated Text OH MY:

my joy behind. 51 thus can my love excuse the slow offence, of my dull bearer,
when from thee i speer becined, and thet the thme doth lake me song the stage
and shen the strengnt of the sime of the were oor of thee, and thou art all the
wirl of thee ae oot, oor that i do doth pene. that thou thall beauty should mote
beligte. then thet be beauty of the seaond stars of the wirl of thee, and the
dear feart’s palace. scene ii. tossillon. a room in the countess’s palace. scene
ii. tossillon. a room in the countess’s palace. scene ii. tossillon. a room in
the countess’s palace. scene ii. tossillon.


In [20]:
# Encode Data as Chars
chars = sorted(list(set(text)))
char_indices = dict((c,i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

In [21]:
text[:50]

'contents the sonnets all’s well that ends well the'

In [22]:
maxlen = 40
step = 3

sentences = []
next_chars = []

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])

print('sequences:', len(sentences))

sequences: 1751469


In [23]:
sentences[0]

'contents the sonnets all’s well that end'

In [24]:
next_chars[1]

'e'

In [25]:
# Specify x & y

x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)

for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

In [26]:
print(x.shape)
print(y.shape)

(1751469, 40, 71)
(1751469, 71)


In [27]:
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

optimizer = RMSprop(learning_rate=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metric='acc')

In [28]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [29]:
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [None]:
import numpy as np
import random
import sys
import os

model.fit(x, y,
          batch_size=128,
          epochs=5,
          callbacks=[print_callback])

Train on 1751469 samples
Epoch 1/5
----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "luellen. your grace does me as great hon"
luellen. your grace does me as great honour of the soul to the thing in the more the shall the sons that the bears the soul the sons of the shall the courtest bear the more the sons. come, he hath the stand the sun the death of the love to the the thing of the duke of the as the strange and bear the hour in the servant of the heaven of the sing of the hearts of the death the world to the servant of the stand that shall say the servant t
----- diversity: 0.5
----- Generating with seed: "luellen. your grace does me as great hon"
luellen. your grace does me as great honourable bear the beard the meaning to the most a morning the hearts to make the soul more that it was thine end the sard that have serve the fair of the bear ere shall be hang the proud to the return in the sorry and lord, and done, and here, thou spite, and the 

  after removing the cwd from sys.path.


 the pleasureice of my poor heart he for him by mine or too respect for your own
----- diversity: 1.0
----- Generating with seed: "akfast in the cheapest country under the"
akfast in the cheapest country under their are, [ations enough. offence messenger. this some whitth for a tonerisor. shall i have too for thee hermio, thy nature so, the bourd thus pericles. king. we am the time to call, my charm'd to bringing helice. made love. that that it shall be nor onet that god winkes both doth both dismeers to know good of wars? a queen, and for me, and lords. thou hadst be i wish abuse not for besisten as, rowe
----- diversity: 1.2
----- Generating with seed: "akfast in the cheapest country under the"
akfast in the cheapest country under the returna name dares swood yourself. lackial. imporrow prison, set ups’d murder wherson. barnle eyes, goth fled him without, heg, and love: from me my kissing holoth uslion dig thabper towcr, like inspiros. good lord! what may honour? my own! king now bou

# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN