## Text Generator using LSTM and Keras

Text Generator is basically to use Deep Learning to develop a language model to generate new pieces of text by training on a corpus of data and let the model emit new word sequences given a seed word.  
In this section we will make using LSTM and Keras to train a model on collection of William Shakespeare's sonnects that can downloaded online and then make use of the model to make predictions.
The entire modle was trained on a Ubuntu Machine with 1 Nvidia Tesla K80 GPU. It took me around 2 hours to train for 20 epochs.

### Loading the Data

In [1]:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.layers import RNN
from keras.utils import np_utils

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


The data can be downloaded from the [project gutenberg](http://www.gutenberg.org/ebooks/1041?msg=welcome_stranger). I cleaned up this file to remove the start and end credits, and it can be downloaded from my git repository.

In [2]:
filename = 'sonnet.txt'
text = open(filename).read()
text = text.lower()

### Creating Character Mappings

Charater Mapping is a step in which we assign an arbitrary number to a character in the text. In this way, all unique characters are mapped to a number. This is important, because machines understand numbers far better than text, and this subsequently makes the training process easier.

We can also make use of word mappings where we assign a number to a word instead of a character but since this is a small data set going with Character Mappings makes sense

In [3]:
characters = sorted(list(set(text)))

n_to_char = {n:char for n, char in enumerate(characters)}
char_to_n = {char:n for n, char in enumerate(characters)}

In [4]:
char_to_n

{'\n': 0,
 ' ': 1,
 '!': 2,
 "'": 3,
 '(': 4,
 ')': 5,
 ',': 6,
 '-': 7,
 '.': 8,
 ':': 9,
 ';': 10,
 '?': 11,
 'a': 12,
 'b': 13,
 'c': 14,
 'd': 15,
 'e': 16,
 'f': 17,
 'g': 18,
 'h': 19,
 'i': 20,
 'j': 21,
 'k': 22,
 'l': 23,
 'm': 24,
 'n': 25,
 'o': 26,
 'p': 27,
 'q': 28,
 'r': 29,
 's': 30,
 't': 31,
 'u': 32,
 'v': 33,
 'w': 34,
 'x': 35,
 'y': 36,
 'z': 37}

In [5]:
n_chars = len(text)
n_vocab = len(characters)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

Total Characters:  100229
Total Vocab:  38


### Preprocessing of the Data

This is the most tricky part when it comes to LSTM. seq_length is the sequence of characters that we want to consider before making a prediction. In our case it will be 100.
So for the sake of simplicity let's assume that we have a seq_length of 4 and if our entire corpus contains only the word "Machine" the X and Y array will be as follows

| X  | Y |
| ------------- | ------------- |
| `[M,a,c,h]`  | `[i]`  |
| `[a,c,h,i]` | `[n]`  |
| `[c,h,i,n]` | `[e]`  |

In [6]:
seq_length = 100
X = []
Y = []
for i in range(0, n_chars - seq_length, 1):
    seq_in = text[i:i + seq_length]
    seq_out = text[i + seq_length]
    X.append([char_to_n[char] for char in seq_in])
    Y.append(char_to_n[seq_out])

print("Total Patterns: ", len(X))

Total Patterns:  100129


We must transform the list of input sequences into the form `[samples, time steps, features]` expected by an LSTM network. Next we need to rescale the integers to the range 0-1 for the LSTM network to learn faster. Alos let us convert the outputs to a one hot encoding.

In [9]:
X_modified = np.reshape(X, (len(X), seq_length, 1))
X_modified = X_modified / float(n_vocab)
Y_modified = np_utils.to_categorical(Y)

### Training the Model

Let's build a sequential model using LSTM. We will make use of 2 LSTM layer's with 400 units and add a dropout of 20% to avoid overfitting. In order for the next LSTM layer to be able to process the same sequences, we enter the return_sequences parameter as True.

In [10]:
model = Sequential()
model.add(LSTM(400, input_shape=(X_modified.shape[1], X_modified.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(400))
model.add(Dropout(0.2))
model.add(Dense(Y_modified.shape[1], activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

The model is slow to train (around 300 sec for 1 epoch). we will use model checkpointing to record all of the network weights to file each time an improvement in loss is observed at the end of the epoch.

In [11]:
from keras.callbacks import ModelCheckpoint

In [12]:
# define the checkpoint
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

In [13]:
model.fit(X_modified, Y_modified, epochs=20, batch_size=128, callbacks=callbacks_list)

Epoch 1/20

Epoch 00001: loss improved from inf to 2.90667, saving model to weights-improvement-01-2.9067.hdf5
Epoch 2/20

Epoch 00002: loss improved from 2.90667 to 2.52853, saving model to weights-improvement-02-2.5285.hdf5
Epoch 3/20

Epoch 00003: loss improved from 2.52853 to 2.37719, saving model to weights-improvement-03-2.3772.hdf5
Epoch 4/20

Epoch 00004: loss improved from 2.37719 to 2.27001, saving model to weights-improvement-04-2.2700.hdf5
Epoch 5/20

Epoch 00005: loss improved from 2.27001 to 2.16752, saving model to weights-improvement-05-2.1675.hdf5
Epoch 6/20

Epoch 00006: loss improved from 2.16752 to 2.06866, saving model to weights-improvement-06-2.0687.hdf5
Epoch 7/20

Epoch 00007: loss improved from 2.06866 to 1.98353, saving model to weights-improvement-07-1.9835.hdf5
Epoch 8/20

Epoch 00008: loss improved from 1.98353 to 1.91365, saving model to weights-improvement-08-1.9136.hdf5
Epoch 9/20

Epoch 00009: loss improved from 1.91365 to 1.84750, saving model to weig

<keras.callbacks.History at 0x7f977719bd68>

In [14]:
filename = "weights-improvement-20-1.2610.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

### Generating the Text

Let's start making some predictions. We will start of with some random sequence of 100 characters from our Training set and given this random seed let's predict another 1000 charcters. 

In [15]:
import sys

In [16]:
# pick a random seed
start = np.random.randint(0, len(X)-1)
pattern = X[start]
print("Seed:")
print("\"", ''.join([n_to_char[value] for value in pattern]), "\"")
# generate characters
for i in range(1000):
    x = np.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_vocab)
    prediction = model.predict(x, verbose=0)
    index = np.argmax(prediction)
    result = n_to_char[index]
    seq_in = [n_to_char[value] for value in pattern]
    sys.stdout.write(result)
    pattern.append(index)
    pattern = pattern[1:len(pattern)]
print("\nDone.")

Seed:
" stopped are.
  mark how with my neglect i do dispense:
    you are so strongly in my purpose bred,
  "
   the hard the beauty of your srue inace,
    and there bur shou, that i am sometime thee,
    and there bur shou, that words the lovnne oray.
    and there bur srieer than thou dost stain,
    and there thou thalt wour should mote be to done.

  lxxxiii

  when i have seen the world should would have seen to come,
  since what is hn the surength of all the sime,
  and seamo summer's fear will be thy faces,
  and see that love with thee i see surange;
  the some will better that thou dost sece stre,
  and therefore lake the wirhow will of youth,
  and therefore lake the wirhow will of youth,
  and therefore lake the wirhow will of youth,
  and therefore lake the wirhow will of youth,
  and therefore lake the wirhow will of youth,
  and therefore lake the wirhow will of youth,
  and therefore lake the wirhow will of youth,
  and therefore lake the wirhow will of youth,
  and t

From what we can see from the predicted text it has observed the patterns in the data. It also knows some the words but some of the words have lost their meaning. It has alos understood where to add the punctuation marks which is pretty suprising.
We can improve the model performance by  training a deeper network (may be adding a another LSTM layer or increasing the LSTM units in each layer) or we can train for more number of epochs.

I would recommend reading Andrej Karpthy's [The UnReasonable Effectiveness of RNN](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) whihc delves deeper on this subject and gives some eye popping results.