# T81-558: Applications of Deep Neural Networks
**Module 10: Time Series in Keras**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 10 Material

* Part 10.1: Time Series Data Encoding for Deep Learning [[Video]]() [[Notebook]](t81_558_class_10_1_timeseries.ipynb)
* Part 10.2: Programming LSTM with Keras and TensorFlow [[Video]]() [[Notebook]](t81_558_class_10_2_lstm.ipynb)
* **Part 10.3: Text Generation with Keras and TensorFlow** [[Video]]() [[Notebook]](t81_558_class_10_3_text_generation.ipynb)
* Part 10.4: Image Captioning with Keras and TensorFlow [[Video]]() [[Notebook]](t81_558_class_10_4_captioning.ipynb)
* Part 10.5: Temporal CNN in Keras and TensorFlow [[Video]]() [[Notebook]](t81_558_class_10_5_temporal_cnn.ipynb)

# Part 10.3: Text Generation with LSTM

### Additional Information

* [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
* [Text Generation With LSTM Recurrent Neural Networks in Python with Keras](https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/)
* [How to Develop a Word-Level Neural Language Model and Use it to Generate Text](https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/)

### Character-Level Text Generation

In [1]:
import sys
import os
import numpy as np
import pandas as pd
import requests
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM

CHAR_SEQ_LEN = 100

In [2]:
r = requests.get("https://data.heatonresearch.com/data/t81-558/text/treasure_island.txt")
raw_text = r.text.lower()

print(raw_text[0:1000])


ï»¿the project gutenberg ebook of treasure island, by robert louis stevenson

this ebook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  you may copy it, give it away or
re-use it under the terms of the project gutenberg license included
with this ebook or online at www.gutenberg.net


title: treasure island

author: robert louis stevenson

illustrator: milo winter

release date: january 12, 2009 [ebook #27780]

language: english


*** start of this project gutenberg ebook treasure island ***




produced by juliet sutherland, stephen blundell and the
online distributed proofreading team at http://www.pgdp.net









 the illustrated children's library


         _treasure island_

       robert louis stevenson

          _illustrated by_
            milo winter


           [illustration]


           gramercy books
              new york




 foreword copyright â© 1986 by random house v


In [3]:
char_array = sorted(list(set(raw_text)))
char2idx = dict((n, v) for v, n in enumerate(char_array))
idx2char = dict((n, v) for n, v in enumerate(char_array))

In [4]:
'|'.join(char_array)

'\n|\r| |!|"|#|$|%|&|\'|(|)|*|,|-|.|/|0|1|2|3|4|5|6|7|8|9|:|;|?|@|[|]|_|a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z|©|°|·|»|¿|â|ï'

In [5]:
print(f"Total Characters: {len(raw_text)}")
print(f"Total Unique Used Characters: {len(char_array)}")

Total Characters: 397419
Total Unique Used Characters: 67


In [6]:
char2idx

{'\n': 0,
 '\r': 1,
 ' ': 2,
 '!': 3,
 '"': 4,
 '#': 5,
 '$': 6,
 '%': 7,
 '&': 8,
 "'": 9,
 '(': 10,
 ')': 11,
 '*': 12,
 ',': 13,
 '-': 14,
 '.': 15,
 '/': 16,
 '0': 17,
 '1': 18,
 '2': 19,
 '3': 20,
 '4': 21,
 '5': 22,
 '6': 23,
 '7': 24,
 '8': 25,
 '9': 26,
 ':': 27,
 ';': 28,
 '?': 29,
 '@': 30,
 '[': 31,
 ']': 32,
 '_': 33,
 'a': 34,
 'b': 35,
 'c': 36,
 'd': 37,
 'e': 38,
 'f': 39,
 'g': 40,
 'h': 41,
 'i': 42,
 'j': 43,
 'k': 44,
 'l': 45,
 'm': 46,
 'n': 47,
 'o': 48,
 'p': 49,
 'q': 50,
 'r': 51,
 's': 52,
 't': 53,
 'u': 54,
 'v': 55,
 'w': 56,
 'x': 57,
 'y': 58,
 'z': 59,
 '©': 60,
 '°': 61,
 '·': 62,
 '»': 63,
 '¿': 64,
 'â': 65,
 'ï': 66}

In [7]:
raw_x = []
raw_y = []

for i in range(0, len(raw_text) - CHAR_SEQ_LEN, 1):
    seq_input = raw_text[i:i + CHAR_SEQ_LEN]
    seq_expected = raw_text[i + CHAR_SEQ_LEN]
    raw_x.append([char2idx[ch] for ch in seq_input])
    raw_y.append(char2idx[seq_expected])

print("Total Patterns: ", len(raw_x))

Total Patterns:  397319


In [8]:
x = np.reshape(raw_x, (len(raw_x), CHAR_SEQ_LEN, 1))
x = x / float(len(char_array))
y = pd.get_dummies(raw_y)

In [9]:
y[0:10]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,54,55,56,57,58,59,60,61,62,65
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
model = Sequential()
model.add(LSTM(256, input_shape=(x.shape[1], x.shape[2])))
model.add(Dropout(0.15))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [11]:
model_filename = os.path.join('.','dnn','generate_text_char_network.hdf5')

if not os.path.exists(model_filename):
    model.fit(x, y, epochs=25, batch_size=64)
    model.save(model_filename)
else:
    model.load_weights(model_filename)

In [12]:
model_filename = os.path.join('.','dnn','generate_text_char_network.hdf5')
model.save(model_filename)

In [13]:
# pick a random seed
start = np.random.randint(0, len(raw_x)-1)
current = raw_x[start]
print("Starting point:")
print(''.join([idx2char[x] for x in current]))
     
print()
print("Generating text (character by character)...")
output = ""

for i in range(500):
    x = np.reshape(current, (1, len(current), 1)) / float(len(char_array))
    prediction = model.predict(x, verbose=0)
    idx = np.argmax(prediction)
    output += idx2char[idx]
    seq_in = [idx2char[v] for v in current]
    current.append(idx)
    current = current[1:len(current)]
    
print(output)

Starting point:
ll you give me your word of honor as a young
gentleman--for a young gentleman you are, although poo

Generating text (character by character)...
 oo the sooc of
the coctor san the sooc of the sooc and she saie oo the sooc of the
cortcrion of the sooc and she sase thre the saie oidet and she saie of
the sooc and she saie oo the saad of the sooc and she sase thre the
dorrent of the sooc and she saie oi the sooc of the sooc and she saie
so the saad on the sooc of the sooc and she tase thre the seie thet
wase to tee the sooc of the sooc and she tase thre the sooc of the
dorrcrt that was she saie oi the sooc and see the saie oirele and


### Word-Level Text Generation

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

python -m spacy download en

```
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz --no-deps
```

In [1]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


In [4]:
import requests 

r = requests.get("https://data.heatonresearch.com/data/t81-558/text/treasure_island.txt")
raw_text = r.text.lower()

print(raw_text[0:1000])

ï»¿the project gutenberg ebook of treasure island, by robert louis stevenson

this ebook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  you may copy it, give it away or
re-use it under the terms of the project gutenberg license included
with this ebook or online at www.gutenberg.net


title: treasure island

author: robert louis stevenson

illustrator: milo winter

release date: january 12, 2009 [ebook #27780]

language: english


*** start of this project gutenberg ebook treasure island ***




produced by juliet sutherland, stephen blundell and the
online distributed proofreading team at http://www.pgdp.net









 the illustrated children's library


         _treasure island_

       robert louis stevenson

          _illustrated by_
            milo winter


           [illustration]


           gramercy books
              new york




 foreword copyright â© 1986 by random house v


In [27]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(raw_text)
vocab = set()
tokenized_text = []

for token in doc:
    word = ''.join([i if ord(i) < 128 else ' ' for i in token.text])
    word = word.strip()
    if len(word)>1 \
        and not token.is_digit \
        and not token.like_url \
        and not token.like_email:
        vocab.add(word)
        tokenized_text.append(word)
        
print(f"Vocab size: {len(vocab)}")

Vocab size: 6391


In [24]:
print(list(vocab)[:20])

['fairly', 'wriggling', 'downloading', 'quoted', 'goa', "n't", 'voice--"that', 'confessions', 'designed', 'glared', 'picked', 'cliffs', 'mist', 'ashore', 'tyrannized', 'smoldering', 'specified', 'open', 'strangest', 'range']


In [25]:
word2idx = dict((n, v) for v, n in enumerate(vocab))
idx2word = dict((n, v) for n, v in enumerate(vocab))

In [28]:
tokenized_text = [word2idx[word] for word in tokenized_text]

In [29]:
tokenized_text

[5314,
 2773,
 1733,
 3236,
 2311,
 3944,
 5274,
 4224,
 120,
 3891,
 3765,
 3717,
 3236,
 3373,
 1845,
 5314,
 873,
 2311,
 2108,
 5170,
 3874,
 3933,
 5099,
 2684,
 267,
 4108,
 3933,
 6123,
 6210,
 3835,
 5048,
 5902,
 5864,
 2556,
 5864,
 4302,
 2854,
 2383,
 873,
 5864,
 5935,
 5314,
 4421,
 2311,
 5314,
 2773,
 1733,
 1958,
 842,
 267,
 3717,
 3236,
 2854,
 4690,
 3874,
 900,
 3944,
 5274,
 3422,
 120,
 3891,
 3765,
 4614,
 4701,
 1043,
 3114,
 5226,
 5206,
 3236,
 1061,
 3907,
 2672,
 2311,
 3717,
 2773,
 1733,
 3236,
 3944,
 5274,
 668,
 4224,
 3950,
 5352,
 1632,
 2370,
 2684,
 5314,
 4690,
 187,
 4778,
 2930,
 3874,
 5314,
 1159,
 1860,
 2537,
 2410,
 3944,
 5274,
 120,
 3891,
 3765,
 1159,
 4224,
 4701,
 1043,
 1564,
 4001,
 4834,
 738,
 5323,
 5699,
 2242,
 4224,
 6230,
 2656,
 2353,
 222,
 2912,
 5813,
 4224,
 4701,
 1043,
 2242,
 4224,
 5282,
 2541,
 599,
 4045,
 4723,
 1360,
 3717,
 4237,
 1048,
 4224,
 4001,
 4834,
 6301,
 4498,
 2311,
 6230,
 2656,
 2353,
 222,
 4533,


In [None]:
length = 50 + 1
sequences = list()
for i in range(length, len(tokens)):
    seq = tokens[i-length:i]
    line = ' '.join(seq)
    sequences.append(line)
print('Total Sequences: %d' % len(sequences))