## In this tutorial we will develop a simple LSTM network to learn sequences of characters from Alice in Wonderland and then use the model to generate new sequences of characters.

Reference : https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/

Import Necessary Files

In [None]:
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

Load ascii text and covert to lowercase

In [2]:
filename = "data.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()

We must prepare the data for modeling by the neural network. We cannot model the characters directly, instead we must convert the characters to numerical value(integers).

We can do this easily by first creating a set of all of the distinct characters in the book, then creating a map of each character to a unique integer.

In [3]:
# create mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))

We can see the list of unique sorted lowercase characters in the book.

There are some characters that we could remove to further clean up the dataset that will reduce the vocabulary 
and may improve the modeling process.

In [4]:
print(char_to_int)

{'\n': 0, ' ': 1, '!': 2, '"': 3, '#': 4, '$': 5, '%': 6, "'": 7, '(': 8, ')': 9, '*': 10, ',': 11, '-': 12, '.': 13, '/': 14, '0': 15, '1': 16, '2': 17, '3': 18, '4': 19, '5': 20, '6': 21, '7': 22, '8': 23, '9': 24, ':': 25, ';': 26, '?': 27, '@': 28, '[': 29, ']': 30, '_': 31, 'a': 32, 'b': 33, 'c': 34, 'd': 35, 'e': 36, 'f': 37, 'g': 38, 'h': 39, 'i': 40, 'j': 41, 'k': 42, 'l': 43, 'm': 44, 'n': 45, 'o': 46, 'p': 47, 'q': 48, 'r': 49, 's': 50, 't': 51, 'u': 52, 'v': 53, 'w': 54, 'x': 55, 'y': 56, 'z': 57}


Getting the details of the dataset

There are just under 150,000 characters and that when converted to lowercase that there are only 47 distinct characters in the vocabulary.

In [5]:
n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

Total Characters:  163780
Total Vocab:  58


Now, we need to define how we train the network. There is a lot of flexibility in how you choose to break up the text and expose it to the network during training.

In this tutorial we will split the book text up into subsequences with a fixed length of 100 characters, an arbitrary length. You can change the length of characters as you wish.

Each training pattern of the network is comprised of 100 characters (X) followed by one character output (y). Then using the sliding window mechanism, we slide along the whole book one character at a time. Suppose if we have a sequence of length 5 (CHAPTER), then first two training instances would be:

CHAPT -> E<br>
HAPTE -> R

As we split up the book into these sequences, we convert the characters to integers using our lookup table we prepared earlier.

In [6]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100 #can be changed
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
    seq_in = raw_text[i:i + seq_length]
    seq_out = raw_text[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

Total Patterns:  163680


In [7]:
# checking dataX and dataY
print(dataX[0])
print(dataY[0])

[47, 49, 46, 41, 36, 34, 51, 1, 38, 52, 51, 36, 45, 33, 36, 49, 38, 7, 50, 1, 32, 43, 40, 34, 36, 7, 50, 1, 32, 35, 53, 36, 45, 51, 52, 49, 36, 50, 1, 40, 45, 1, 54, 46, 45, 35, 36, 49, 43, 32, 45, 35, 11, 1, 33, 56, 1, 43, 36, 54, 40, 50, 1, 34, 32, 49, 49, 46, 43, 43, 0, 0, 51, 39, 40, 50, 1, 36, 33, 46, 46, 42, 1, 40, 50, 1, 37, 46, 49, 1, 51, 39, 36, 1, 52, 50, 36, 1, 46, 37]
1


Now that we have prepared our training data we need to transform it so that it is suitable for use with Keras.

First we must transform the list of input sequences into the form [samples, time steps, features] that is expected by an LSTM network.

Next we need to rescale the integers to the range [0,1] to make the patterns easier to learn by the LSTM network that uses the sigmoid activation function by default.

Finally, we need to convert the output values (single characters converted to integers) into a one hot encoding. This is so that we can configure the network to predict the probability of each of the 47 different characters in the vocabulary given the input sequence. Each y value is converted into a sparse vector with a length of 47, full of zeros except with a 1 in the column for the letter (integer) that the pattern represents.

For example, when “n” (integer value 31) is one hot encoded it looks as follows:

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0. 0.  0.  0.  0.  0.  0.  0.  0.]

In [8]:
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (n_patterns, seq_length, 1))
# normalize - rescaling the integer values
X = X / float(n_vocab)
# one hot encode the output variable
y = np_utils.to_categorical(dataY)

Now define a single hidden LSTM layer with 256 memory units. The network uses dropout with a probability of 20. The output layer is a Dense layer using the softmax activation function to output a probability prediction for each of the 47 characters between 0 and 1.

The problem is defined as single character classification problem with 47 classes. So, we use cross entropy as loss function and ADAM as the optimization algorithm.

There is no test dataset. We are modeling the entire training dataset to learn the probability of each character in a sequence.

<b>Note:</b> We use the dropout to obtain generalization of the dataset instead of overfitting the training dataset perfectly.

In [9]:
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]))) #It can have 1 or more training samples
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
# define the checkpoint
#filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
#checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
#callbacks_list = [checkpoint]

The network might be slow to train (if not using GPU). Because of the slowness and because of our optimization requirements, we will use model checkpointing to record all of the network weights to file each time an improvement in loss is observed at the end of the epoch. We will use the best set of weights (lowest loss) to generate the predictions.

Fitting the model to the data

In [10]:
# These hyperparameters can be configured as required
epochs = 10
batch_size = 128 

In [19]:
#model.fit(X, y, epochs=epochs, batch_size=batch_size, callbacks=callbacks_list)

Epoch 1/10
Epoch 00001: loss improved from inf to 2.97719, saving model to weights-improvement-01-2.9772.hdf5
Epoch 2/10
Epoch 00002: loss improved from 2.97719 to 2.79160, saving model to weights-improvement-02-2.7916.hdf5
Epoch 3/10
Epoch 00003: loss improved from 2.79160 to 2.70187, saving model to weights-improvement-03-2.7019.hdf5
Epoch 4/10
Epoch 00004: loss improved from 2.70187 to 2.63345, saving model to weights-improvement-04-2.6334.hdf5
Epoch 5/10
Epoch 00005: loss improved from 2.63345 to 2.58111, saving model to weights-improvement-05-2.5811.hdf5
Epoch 6/10
Epoch 00006: loss improved from 2.58111 to 2.52943, saving model to weights-improvement-06-2.5294.hdf5
Epoch 7/10
Epoch 00007: loss improved from 2.52943 to 2.47952, saving model to weights-improvement-07-2.4795.hdf5
Epoch 8/10
Epoch 00008: loss improved from 2.47952 to 2.43389, saving model to weights-improvement-08-2.4339.hdf5
Epoch 9/10
Epoch 00009: loss improved from 2.43389 to 2.39267, saving model to weights-impro

<tensorflow.python.keras.callbacks.History at 0x7ff6b42f97b8>

After training the model, we will have a number of weight checkpoint files in the local directory. For the next step, we can take the weights with smallest loss value: <b> weights-improvement-10-2.3517.hdf5 </b>

Also, we can notice that the loss is decreasing after every epoch, so training for larger epochs will better fit the model

### Generating Text with trained LSTM model

First, we need to define the network in exactly the same way it was defined while training. Then we can load the network weights from the checkpoint file. In this way, we don't have to train the network(model) again. 

In [11]:
# load the network weights
filename = "weights-improvement-10-2.3517.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

As we prepared the mapping of unique characters to integers, we must also create a reverse mapping that we can use to convert the integers back to characters so that we can understand the predictions.

In [12]:
int_to_char = dict((i, c) for i, c in enumerate(chars))

Finally, we can make the predictions.

To start off, we need to have a seed sequence as an input to the model. Passing the random seed sequence to the model will generate or predict the next character then update the seed sequence to add the generated character on the end and trim off the first character. This process is repeated for as long as we want to predict new characters (e.g. a sequence of 1,000 characters in length).

We can pick a random input pattern as our seed sequence, then print generated characters as we generate them.

In [13]:
# generate a random seed
start = numpy.random.randint(0, len(dataX)-1)
print(start)
pattern = dataX[start] #dataX contains list of patterns
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")

103
Seed:
" yone anywhere at no cost and with
almost no restrictions whatsoever.  you may copy it, give it away  "


In [14]:
# generate characters
length = 100
final = []
for i in range(length):
    # reshaping the seed sequence before passing it into the LSTM model
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    # normalizing the ineger values
    x = x / float(n_vocab)
    # making prediction
    prediction = model.predict(x, verbose=0)
    # Get the predicted value with maximum probability
    index = numpy.argmax(prediction)
    # Convert the predicted integer to char
    result = int_to_char[index]
    final.append(result)
    # Adding the predicted character to the sequence sequence
    pattern.append(index)
    # Removing the first character from the seed sequence
    pattern = pattern[1:len(pattern)]
print(final)

['a', 'n', ' ', 'a', 'n', 'l', ' ', 'a', 'o', 'o', ' ', 'a', 'r', 'o', ' ', 'a', 'i', 'n', ' ', 'a', 'r', 'o', ' ', 'a', 'r', 'o', ' ', 'a', 'r', 'o', ' ', 'a', 'r', 'o', ' ', 'a', 'r', 'o', ' ', 'a', 'r', 'o', ' ', 'a', 'r', 'o', ' ', 'a', 'r', 'o', ' ', 'a', 'r', 'o', ' ', 'a', 'r', 'o', ' ', 'o', 'o', ' ', 't', 'h', 'e', ' ', 'p', 'o', 'r', 'e', 'e', ' ', 't', 'h', ' ', 't', 'h', 'e', ' ', 's', 'o', 'e', 'e', 'e', ' ', 'a', 'n', 'd', ' ', 't', 'h', 'e', ' ', 'c', 'a', 'r', 'e', ' ', 'a', 'n']
