In [20]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical
import numpy as np

In [7]:
from google.colab import files
uploaded = files.upload()

Saving wonderland.txt to wonderland.txt


We are using Alice's adventure in Wondeland .

In [9]:
# load ascii text and covert to lowercase
filename = 'wonderland.txt'
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()

Now that the book is loaded, I must prepare the data for modeling by the neural network. I cannot model the characters directly; instead, I must convert the characters to integers.

I can do this easily by first creating a set of all of the distinct characters in the book, then creating a map of each character to a unique integer.

In [None]:
# create mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
char_to_int

Now that the book has been loaded and the mapping prepared, I can summarize the dataset.

In [11]:
n_chars = len(raw_text)
n_vocab = len(chars)
print ("Total Characters: ", n_chars)
print ("Total Vocab: ", n_vocab)

Total Characters:  163948
Total Vocab:  64


The book has just under 150,000 characters, and when converted to lowercase, there are only 47 distinct characters in the vocabulary for the network to learn—much more than the 26 in the alphabet.

I need to define the training data for the network. There is a lot of flexibility in how we choose to break up the text and expose it to the network during training.

I will split the book text up into subsequences with a fixed length of 100 characters, an arbitrary length. You could just as easily split the data by sentences, padding the shorter sequences and truncating the longer ones.

Each training pattern of the network comprises 100 time steps of one character (X) followed by one character output (y). When creating these sequences, you slide this window along the whole book one character at a time, allowing each character a chance to be learned from the 100 characters that preceded it (except the first 100 characters, of course).


In [12]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
 seq_in = raw_text[i:i + seq_length]
 seq_out = raw_text[i + seq_length]
 dataX.append([char_to_int[char] for char in seq_in])
 dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print ("Total Patterns: ", n_patterns)

Total Patterns:  163848


Running the code to this point shows that when you split up the dataset into training data for the network to learn that you have just under 150,000 training patterns. This makes sense as, excluding the first 100 characters, you have one training pattern to predict each of the remaining characters.

In [16]:
dataX[2]

[34,
 1,
 45,
 47,
 44,
 39,
 34,
 32,
 49,
 1,
 36,
 50,
 49,
 34,
 43,
 31,
 34,
 47,
 36,
 1,
 34,
 31,
 44,
 44,
 40,
 1,
 44,
 35,
 1,
 30,
 41,
 38,
 32,
 34,
 6,
 48,
 1,
 30,
 33,
 51,
 34,
 43,
 49,
 50,
 47,
 34,
 48,
 1,
 38,
 43,
 1,
 52,
 44,
 43,
 33,
 34,
 47,
 41,
 30,
 43,
 33,
 0,
 1,
 1,
 1,
 1,
 0,
 49,
 37,
 38,
 48,
 1,
 34,
 31,
 44,
 44,
 40,
 1,
 38,
 48,
 1,
 35,
 44,
 47,
 1,
 49,
 37,
 34,
 1,
 50,
 48,
 34,
 1,
 44,
 35,
 1,
 30,
 43,
 54,
 44]

Now that we have prepared our training data, we need to transform it to be suitable for use with Keras.

First, we must transform the list of input sequences into the form [samples, time steps, features] expected by an LSTM network.

Next, we need to rescale the integers to the range 0-to-1 to make the patterns easier to learn by the LSTM network using the sigmoid activation function by default.

Finally, we need to convert the output patterns (single characters converted to integers) into a one-hot encoding. This is so that you can configure the network to predict the probability of each of the 47 different characters in the vocabulary (an easier representation) rather than trying to force it to predict precisely the next character. Each y value is converted into a sparse vector with a length of 47, full of zeros, except with a 1 in the column for the letter (integer) that the pattern represents.

---
For example, when “n” (integer value 31) is one-hot encoded, it looks as follows:
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.]


In [21]:
# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = to_categorical(dataY)

In [22]:
y

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.]], dtype=float32)

We can now define our LSTM model. Here, we define a single hidden LSTM layer with 256 memory units. The network uses dropout with a probability of 20. The output layer is a Dense layer using the softmax activation function to output a probability prediction for each of the 47 characters between 0 and 1.

The problem is really a single character classification problem with 47 classes and, as such, is defined as optimizing the log loss (cross entropy) using the ADAM optimization algorithm for speed.

In [None]:
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

note1: There is no test dataset. We are modeling the entire training dataset to learn the probability of each character in a sequence.

We are not interested in the most accurate (classification accuracy) model of the training dataset. This would be a model that predicts each character in the training dataset perfectly. Instead, we are interested in a generalization of the dataset that minimizes the chosen loss function. We are seeking a balance between generalization and overfitting but short of memorization.

---



In [None]:
# define the checkpoint
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]