<a href="https://colab.research.google.com/github/dibya-pati/seqPrediction/blob/master/characterPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

we will develop a simple LSTM network to learn sequences of characters from Alice in Wonderland. In the next section we will use this model to generate new sequences of characters.

Let’s start off by importing the classes and functions we intend to use to train our model.

In [3]:
import numpy
import sys
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical
import tensorflow as tf

Next, we need to load the ASCII text for the book into memory and convert all of the characters to lowercase to reduce the vocabulary that the network must learn.

In [4]:
# load ascii text and covert to lowercase
filename = "wonderland.txt"
raw_text = open(filename).read()
raw_text = raw_text.lower()

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 8118: character maps to <undefined>

Now that the book is loaded, we must prepare the data for modeling by the neural network. We cannot model the characters directly, instead we must convert the characters to integers.

We can do this easily by first creating a set of all of the distinct characters in the book, then creating a map of each character to a unique integer.

In [0]:
# create mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))

In [5]:
char_to_int

{'\n': 0,
 ' ': 1,
 '!': 2,
 '(': 3,
 ')': 4,
 '*': 5,
 ',': 6,
 '-': 7,
 '.': 8,
 '0': 9,
 '3': 10,
 ':': 11,
 ';': 12,
 '?': 13,
 '[': 14,
 ']': 15,
 '_': 16,
 'a': 17,
 'b': 18,
 'c': 19,
 'd': 20,
 'e': 21,
 'f': 22,
 'g': 23,
 'h': 24,
 'i': 25,
 'j': 26,
 'k': 27,
 'l': 28,
 'm': 29,
 'n': 30,
 'o': 31,
 'p': 32,
 'q': 33,
 'r': 34,
 's': 35,
 't': 36,
 'u': 37,
 'v': 38,
 'w': 39,
 'x': 40,
 'y': 41,
 'z': 42,
 '‘': 43,
 '’': 44,
 '“': 45,
 '”': 46}

You can see that there may be some characters that we could remove to further clean up the dataset that will reduce the vocabulary and may improve the modeling process.

Now that the book has been loaded and the mapping prepared, we can summarize the dataset.

In [6]:
n_chars = len(raw_text)
n_vocab = len(chars)
print ("Total Characters: ", n_chars)
print ("Total Vocab: ", n_vocab)

Total Characters:  144422
Total Vocab:  47


As we split up the book into these sequences, we convert the characters to integers using our lookup table we prepared earlier.

In [7]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
	seq_in = raw_text[i:i + seq_length]
	seq_out = raw_text[i + seq_length]
	dataX.append([char_to_int[char] for char in seq_in])
	dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print ("Total Patterns: ", n_patterns)

Total Patterns:  144322


First we must transform the list of input sequences into the form [samples, time steps, features] expected by an LSTM network.

Next we need to rescale the integers to the range 0-to-1 to make the patterns easier to learn by the LSTM network that uses the sigmoid activation function by default.

Finally, we need to convert the output patterns (single characters converted to integers) into a one hot encoding. This is so that we can configure the network to predict the probability of each of the 47 different characters in the vocabulary (an easier representation) rather than trying to force it to predict precisely the next character. Each y value is converted into a sparse vector with a length of 47, full of zeros except with a 1 in the column for the letter (integer) that the pattern represents.

In [0]:
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = to_categorical(dataY)

We can now define our LSTM model. Here we define a single hidden LSTM layer with 256 memory units. The network uses dropout with a probability of 20. The output layer is a Dense layer using the softmax activation function to output a probability prediction for each of the 47 characters between 0 and 1.

The problem is really a single character classification problem with 47 classes and as such is defined as optimizing the log loss (cross entropy), here using the ADAM optimization algorithm for speed.

In [0]:
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

There is no test dataset. We are modeling the entire training dataset to learn the probability of each character in a sequence.

We are not interested in the most accurate (classification accuracy) model of the training dataset. This would be a model that predicts each character in the training dataset perfectly. Instead we are interested in a generalization of the dataset that minimizes the chosen loss function. We are seeking a balance between generalization and overfitting but short of memorization.

The network is slow to train (about 300 seconds per epoch on an Nvidia K520 GPU). Because of the slowness and because of our optimization requirements, we will use model checkpointing to record all of the network weights to file each time an improvement in loss is observed at the end of the epoch. We will use the best set of weights (lowest loss) to instantiate our generative model in the next section.

In [0]:
# define the checkpoint
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

We can now fit our model to the data. Here we use a modest number of 20 epochs and a large batch size of 128 patterns.

In [11]:
model.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)

Epoch 1/20
Epoch 00001: loss improved from inf to 3.00208, saving model to weights-improvement-01-3.0021.hdf5
Epoch 2/20
Epoch 00002: loss improved from 3.00208 to 2.83505, saving model to weights-improvement-02-2.8350.hdf5
Epoch 3/20
Epoch 00003: loss improved from 2.83505 to 2.75083, saving model to weights-improvement-03-2.7508.hdf5
Epoch 4/20
Epoch 00004: loss improved from 2.75083 to 2.67527, saving model to weights-improvement-04-2.6753.hdf5
Epoch 5/20
Epoch 00005: loss improved from 2.67527 to 2.61167, saving model to weights-improvement-05-2.6117.hdf5
Epoch 6/20
Epoch 00006: loss improved from 2.61167 to 2.55384, saving model to weights-improvement-06-2.5538.hdf5
Epoch 7/20
Epoch 00007: loss improved from 2.55384 to 2.49646, saving model to weights-improvement-07-2.4965.hdf5
Epoch 8/20
Epoch 00008: loss improved from 2.49646 to 2.44273, saving model to weights-improvement-08-2.4427.hdf5
Epoch 9/20
Epoch 00009: loss improved from 2.44273 to 2.39253, saving model to weights-impro

<tensorflow.python.keras.callbacks.History at 0x7fa24c770860>

Generating text using the trained LSTM network is relatively straightforward.

Firstly, we load the data and define the network in exactly the same way, except the network weights are loaded from a checkpoint file and the network does not need to be trained.

In [0]:
# load the network weights
filename = "weights-improvement-20-2.0293.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

Also, when preparing the mapping of unique characters to integers, we must also create a reverse mapping that we can use to convert the integers back to characters so that we can understand the predictions.

In [0]:
int_to_char = dict((i, c) for i, c in enumerate(chars))

Finally, we need to actually make predictions.

The simplest way to use the Keras LSTM model to make predictions is to first start off with a seed sequence as input, generate the next character then update the seed sequence to add the generated character on the end and trim off the first character. This process is repeated for as long as we want to predict new characters (e.g. a sequence of 1,000 characters in length).

We can pick a random input pattern as our seed sequence, then print generated characters as we generate them.

In [19]:
# pick a random seed
start = numpy.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print ("Seed:")
print( "\"", ''.join([int_to_char[value] for value in pattern]), "\"")
# generate characters
for i in range(1000):
	x = numpy.reshape(pattern, (1, len(pattern), 1))
	x = x / float(n_vocab)
	prediction = model.predict(x, verbose=0)
	index = numpy.argmax(prediction)
	result = int_to_char[index]
	seq_in = [int_to_char[value] for value in pattern]
	sys.stdout.write(result)
	pattern.append(index)
	pattern = pattern[1:len(pattern)]
print ("\nDone.")

Seed:
"  question is, what did the
archbishop find?’

the mouse did not notice this question, but hurriedly  "
dale an inr an on an anl aerlr the thet sas oo the tooee. and whet the was toenk to be a latge hare the was to the thet sase 
‘het it was alite the dorld she was oo the that sas an the was to all whrh the was oo the thate  the was soinking to the thet  and saed 
‘oth the soeee of the was oo the thieg taated the whrh the was to the thite rabd to the toiee 
‘he poue doon the mors of the saae the waaten the woide ’f
thenk to tee the was of the taated thee, 
‘he toee lo the taad toen i shan?’ she match hare waid to the grrp,on an onself an anl aalind the was oo the saade. 
‘he toee lo the taid to toe kaseer ’ said the morke 
‘hr wou dnn toe was toene ano the waate  the murse oo the woide ’ou dane that i seould berner what i whon hore to the whit sase 
the was aolng the rabe to the whrt hn an ooce  and tasd to the cor of the courd, 
‘he iou dal toe mo the toin,’ said the maccit. ‘h