# Text Generation With LSTM Recurrent Neural Networks in Python with Keras
https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/

# Problem Description: Project Gutenberg

# Develop a Small LSTM Recurrent Neural Network

In [None]:

In this section, you will develop a simple LSTM network to learn sequences of characters from Alice in Wonderland. In the next section, you will use this model to generate new sequences of characters.

Let’s start by importing the classes and functions you will use to train your model.

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical
...
Next, you need to load the ASCII text for the book into memory and convert all of the characters to lowercase to reduce the vocabulary the network must learn.

...
# load ascii text and covert to lowercase
filename = "wonderland.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()
Now that the book is loaded, you must prepare the data for modeling by the neural network. You cannot model the characters directly; instead, you must convert the characters to integers.

You can do this easily by first creating a set of all of the distinct characters in the book, then creating a map of each character to a unique integer.

...
# create mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
For example, the list of unique sorted lowercase characters in the book is as follows:

['\n', '\r', ' ', '!', '"', "'", '(', ')', '*', ',', '-', '.', ':', ';', '?', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '\xbb', '\xbf', '\xef']
You can see that there may be some characters that we could remove to further clean up the dataset to reduce the vocabulary, which may improve the modeling process.

Now that the book has been loaded and the mapping prepared, you can summarize the dataset.

...
n_chars = len(raw_text)
n_vocab = len(chars)
print "Total Characters: ", n_chars
print "Total Vocab: ", n_vocab
Running the code to this point produces the following output.

Total Characters:  147674
Total Vocab:  47
You can see the book has just under 150,000 characters, and when converted to lowercase, there are only 47 distinct characters in the vocabulary for the network to learn—much more than the 26 in the alphabet.

You now need to define the training data for the network. There is a lot of flexibility in how you choose to break up the text and expose it to the network during training.

In this tutorial, you will split the book text up into subsequences with a fixed length of 100 characters, an arbitrary length. You could just as easily split the data by sentences, padding the shorter sequences and truncating the longer ones.

Each training pattern of the network comprises 100 time steps of one character (X) followed by one character output (y). When creating these sequences, you slide this window along the whole book one character at a time, allowing each character a chance to be learned from the 100 characters that preceded it (except the first 100 characters, of course).

For example, if the sequence length is 5 (for simplicity), then the first two training patterns would be as follows:

CHAPT -> E
HAPTE -> R
As you split the book into these sequences, you convert the characters to integers using the lookup table you prepared earlier.

...
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
	seq_in = raw_text[i:i + seq_length]
	seq_out = raw_text[i + seq_length]
	dataX.append([char_to_int[char] for char in seq_in])
	dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print "Total Patterns: ", n_patterns
Running the code to this point shows that when you split up the dataset into training data for the network to learn that you have just under 150,000 training patterns. This makes sense as, excluding the first 100 characters, you have one training pattern to predict each of the remaining characters.

Total Patterns:  147574
Now that you have prepared your training data, you need to transform it to be suitable for use with Keras.

First, you must transform the list of input sequences into the form [samples, time steps, features] expected by an LSTM network.

Next, you need to rescale the integers to the range 0-to-1 to make the patterns easier to learn by the LSTM network using the sigmoid activation function by default.

Finally, you need to convert the output patterns (single characters converted to integers) into a one-hot encoding. This is so that you can configure the network to predict the probability of each of the 47 different characters in the vocabulary (an easier representation) rather than trying to force it to predict precisely the next character. Each y value is converted into a sparse vector with a length of 47, full of zeros, except with a 1 in the column for the letter (integer) that the pattern represents.

For example, when “n” (integer value 31) is one-hot encoded, it looks as follows:

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.]
You can implement these steps as below:

...
# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = to_categorical(dataY)
You can now define your LSTM model. Here, you define a single hidden LSTM layer with 256 memory units. The network uses dropout with a probability of 20. The output layer is a Dense layer using the softmax activation function to output a probability prediction for each of the 47 characters between 0 and 1.

The problem is really a single character classification problem with 47 classes and, as such, is defined as optimizing the log loss (cross entropy) using the ADAM optimization algorithm for speed.

...
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
There is no test dataset. You are modeling the entire training dataset to learn the probability of each character in a sequence.

You are not interested in the most accurate (classification accuracy) model of the training dataset. This would be a model that predicts each character in the training dataset perfectly. Instead, you are interested in a generalization of the dataset that minimizes the chosen loss function. You are seeking a balance between generalization and overfitting but short of memorization.

The network is slow to train (about 300 seconds per epoch on an Nvidia K520 GPU). Because of the slowness and because of the optimization requirements, use model checkpointing to record all the network weights to file each time an improvement in loss is observed at the end of the epoch. You will use the best set of weights (lowest loss) to instantiate your generative model in the next section.

...
# define the checkpoint
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
You can now fit your model to the data. Here, you use a modest number of 20 epochs and a large batch size of 128 patterns.

model.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)
The full code listing is provided below for completeness.

In [9]:
# import time
from datetime import timedelta

start_time = time.time()


# Small LSTM Network to Generate Text for Alice in Wonderland
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical
# load ascii text and covert to lowercase
filename = "wonderland.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()
# create mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
# summarize the loaded data
n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
	seq_in = raw_text[i:i + seq_length]
	seq_out = raw_text[i + seq_length]
	dataX.append([char_to_int[char] for char in seq_in])
	dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)
# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = to_categorical(dataY)
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
# define the checkpoint
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fit the model
model.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)



elapsed_time_secs = time.time() - start_time

msg = "Execution took: %s secs (Wall clock time)" % timedelta(seconds=round(elapsed_time_secs))

print("\n\n")
print(msg)  
print("\n\n")

Total Characters:  144449
Total Vocab:  46
Total Patterns:  144349
Epoch 1/20
Epoch 1: loss improved from inf to 2.98624, saving model to weights-improvement-01-2.9862.hdf5
Epoch 2/20
Epoch 2: loss improved from 2.98624 to 2.85036, saving model to weights-improvement-02-2.8504.hdf5
Epoch 3/20
Epoch 3: loss improved from 2.85036 to 2.79026, saving model to weights-improvement-03-2.7903.hdf5
Epoch 4/20
Epoch 4: loss improved from 2.79026 to 2.72844, saving model to weights-improvement-04-2.7284.hdf5
Epoch 5/20
Epoch 5: loss improved from 2.72844 to 2.68266, saving model to weights-improvement-05-2.6827.hdf5
Epoch 6/20
Epoch 6: loss improved from 2.68266 to 2.62878, saving model to weights-improvement-06-2.6288.hdf5
Epoch 7/20
Epoch 7: loss improved from 2.62878 to 2.57342, saving model to weights-improvement-07-2.5734.hdf5
Epoch 8/20
Epoch 8: loss improved from 2.57342 to 2.52199, saving model to weights-improvement-08-2.5220.hdf5
Epoch 9/20
Epoch 9: loss improved from 2.52199 to 2.47440

# Generating Text with an LSTM Network

In [12]:
# import time
from datetime import timedelta

start_time = time.time()




# Load LSTM network and generate text
import sys
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical
# load ascii text and covert to lowercase
filename = "wonderland.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()
# create mapping of unique chars to integers, and a reverse mapping
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))
# summarize the loaded data
n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
	seq_in = raw_text[i:i + seq_length]
	seq_out = raw_text[i + seq_length]
	dataX.append([char_to_int[char] for char in seq_in])
	dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)
# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = to_categorical(dataY)
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
# load the network weights
filename = "weights-improvement-20-2.0666.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')
# pick a random seed
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
# generate characters
for i in range(1000):
	x = np.reshape(pattern, (1, len(pattern), 1))
	x = x / float(n_vocab)
	prediction = model.predict(x, verbose=0)
	index = np.argmax(prediction)
	result = int_to_char[index]
	seq_in = [int_to_char[value] for value in pattern]
	sys.stdout.write(result)
	pattern.append(index)
	pattern = pattern[1:len(pattern)]
print("\nDone.")



elapsed_time_secs = time.time() - start_time

msg = "Execution took: %s secs (Wall clock time)" % timedelta(seconds=round(elapsed_time_secs))

print("\n\n")
print(msg)  
print("\n\n")

Total Characters:  144449
Total Vocab:  46
Total Patterns:  144349
Seed:
" n,' the king said to the jury, and the jury eagerly
wrote down all three dates on their slates, and  "
the sas aoinged at the huuphon and the was so tee whet she was soe that she was soe tiat she was so the wool at the could,

'the had toin he so tel ' she katter weit on, ''i mever tan a taryer ' said the matth rare tery alliilyyyy. 'iu saan the mort of the sooe-'

'i dane tat a gind ' said the match hare.

'it so tee ' she katter weit on, ''  '                                                                                             *  *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *    

# Larger LSTM Recurrent Neural Network

In [1]:
import time
from datetime import timedelta

start_time = time.time()



# Larger LSTM Network to Generate Text for Alice in Wonderland
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical
# load ascii text and covert to lowercase
filename = "wonderland.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()
# create mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
# summarize the loaded data
n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
	seq_in = raw_text[i:i + seq_length]
	seq_out = raw_text[i + seq_length]
	dataX.append([char_to_int[char] for char in seq_in])
	dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)
# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = to_categorical(dataY)
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
# define the checkpoint
filepath = "weights-improvement-{epoch:02d}-{loss:.4f}-bigger-model.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fit the model
model.fit(X, y, epochs=50, batch_size=64, callbacks=callbacks_list)



elapsed_time_secs = time.time() - start_time

msg = "Execution took: %s secs (Wall clock time)" % timedelta(seconds=round(elapsed_time_secs))

print("\n\n")
print(msg)  
print("\n\n")

Total Characters:  144449
Total Vocab:  46
Total Patterns:  144349
Epoch 1/50
Epoch 1: loss improved from inf to 2.76175, saving model to weights-improvement-01-2.7617-bigger-model.hdf5
Epoch 2/50
Epoch 2: loss improved from 2.76175 to 2.39716, saving model to weights-improvement-02-2.3972-bigger-model.hdf5
Epoch 3/50
Epoch 3: loss improved from 2.39716 to 2.19928, saving model to weights-improvement-03-2.1993-bigger-model.hdf5
Epoch 4/50
Epoch 4: loss improved from 2.19928 to 2.07692, saving model to weights-improvement-04-2.0769-bigger-model.hdf5
Epoch 5/50
Epoch 5: loss improved from 2.07692 to 1.98022, saving model to weights-improvement-05-1.9802-bigger-model.hdf5
Epoch 6/50
Epoch 6: loss improved from 1.98022 to 1.90455, saving model to weights-improvement-06-1.9046-bigger-model.hdf5
Epoch 7/50
Epoch 7: loss improved from 1.90455 to 1.83504, saving model to weights-improvement-07-1.8350-bigger-model.hdf5
Epoch 8/50
Epoch 8: loss improved from 1.83504 to 1.78321, saving model to w

Epoch 32: loss improved from 1.32524 to 1.31654, saving model to weights-improvement-32-1.3165-bigger-model.hdf5
Epoch 33/50
Epoch 33: loss improved from 1.31654 to 1.30553, saving model to weights-improvement-33-1.3055-bigger-model.hdf5
Epoch 34/50
Epoch 34: loss improved from 1.30553 to 1.29477, saving model to weights-improvement-34-1.2948-bigger-model.hdf5
Epoch 35/50
Epoch 35: loss improved from 1.29477 to 1.28707, saving model to weights-improvement-35-1.2871-bigger-model.hdf5
Epoch 36/50
Epoch 36: loss improved from 1.28707 to 1.27998, saving model to weights-improvement-36-1.2800-bigger-model.hdf5
Epoch 37/50
Epoch 37: loss improved from 1.27998 to 1.26706, saving model to weights-improvement-37-1.2671-bigger-model.hdf5
Epoch 38/50
Epoch 38: loss improved from 1.26706 to 1.26310, saving model to weights-improvement-38-1.2631-bigger-model.hdf5
Epoch 39/50
Epoch 39: loss improved from 1.26310 to 1.25653, saving model to weights-improvement-39-1.2565-bigger-model.hdf5
Epoch 40/50


In [2]:
# import time
from datetime import timedelta

start_time = time.time()


# Load Larger LSTM network and generate text
import sys
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical
# load ascii text and covert to lowercase
filename = "wonderland.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()
# create mapping of unique chars to integers, and a reverse mapping
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))
# summarize the loaded data
n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
	seq_in = raw_text[i:i + seq_length]
	seq_out = raw_text[i + seq_length]
	dataX.append([char_to_int[char] for char in seq_in])
	dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)
# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = to_categorical(dataY)
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
# load the network weights
filename = "weights-improvement-49-1.2041-bigger-model.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')
# pick a random seed
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
# generate characters
for i in range(1000):
	x = np.reshape(pattern, (1, len(pattern), 1))
	x = x / float(n_vocab)
	prediction = model.predict(x, verbose=0)
	index = np.argmax(prediction)
	result = int_to_char[index]
	seq_in = [int_to_char[value] for value in pattern]
	sys.stdout.write(result)
	pattern.append(index)
	pattern = pattern[1:len(pattern)]
print("\nDone.")



elapsed_time_secs = time.time() - start_time

msg = "Execution took: %s secs (Wall clock time)" % timedelta(seconds=round(elapsed_time_secs))

print("\n\n")
print(msg)  
print("\n\n")

Total Characters:  144449
Total Vocab:  46
Total Patterns:  144349
Seed:
"  this
moment, i tell you!' but she went on all the same, shedding gallons of
tears, until there was  "
no rseer the words as the cook and looked anxiously another get on their faces, and the three gardeners in the doumouse said to the gatter. 
'i don't know the way out of the baniers!' she said to herself, 'i wonder what they were that the mooent the reason it, you know.'

'i don't know what to det you wouldn't talk,' said the king. 'and the moral of that is--"but i must be gate any meter in the bankee of the bankee farher in the door, and the three gardeners in the dodo solenlly langer in the door, she was not and said to the book and looked anxiously and looked anxiously about a lowse in the door, and was going to got some minutes. 
the dormouse seplied them as the cook as she went on, 'what is the white rereamed about it, and wet i con't know that wou wouldn't talk to be a book and the words again! i'll gave 

# Extension Ideas to Improve the Model