# Text Generation: LSTM

LSTM networks (Long Short Term Memory) are a special kind of RNN (Recurrent neural network) that are good at learning long-term dependencies.

For a good explanation see this [post](http://colah.github.io/posts/2015-08-Understanding-LSTMs/).

In this exmaple we will use an LSTM to generate text based on some input text (Nietzsche's wrtitings.)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mj-will/intro2ml/blob/master/notebooks/text-generation-lstm.ipynb)

In [2]:
import sys
import io
import random
import numpy as np

from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.layers import LSTM
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.utils import get_file

## Data

The dataset is a freely avaiable text dataset of [Nietzsche's writings](https://s3.amazonaws.com/text-datasets/nietzsche.txt) you can download.

This dataset will need spliting up into smaller sequences that can then be used to train the LSTM

In [3]:
# get the data
path = get_file('nietzsche.txt', origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
with io.open(path, encoding='utf-8') as f:
    text = f.read().lower()
print('corpus length:', len(text))

corpus length: 600893


Sort the dataset so that we have a dictionary with indices corresponding to each character and vice versa

In [4]:
# how many different characters is that?
chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

total chars: 57


### Pre-processing

Here we take a sequence and store it, we also store the following character, as this is what the LSTM will be training on

In [5]:
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

nb sequences: 200285


Put the sequences into an array

In [6]:
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

Vectorization...


We now have an array with sequences of characters represented as True at a given index

## The Model: LSTM

In [7]:
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

In [8]:
# compile the model
optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

### Helper functions

In [9]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [10]:
def on_epoch_end(epoch, logs):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

Set the function as a custom callback to be called at the end of each epoch

In [10]:
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

# Fit
This will be slow to run a CPU (~3-5 minutes per epoch)

In [12]:
model.fit(x, y,
          batch_size=128,
          epochs=10,
callbacks=[print_callback])

Epoch 1/10

----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: " in the habit of
placing themselves in f"
 in the habit of
placing themselves in fiting and and and and indeed and a stard and as it is a sartion of the say of the say the present and such a stark of the say of the subtle the say the say of the subtle and an art of the say of the still the start of the love of the say in the said of a say the say of the suct of the soul of the say in the possibition of still more and as the start and intertion of all the person of the soul of t
----- diversity: 0.5
----- Generating with seed: " in the habit of
placing themselves in f"
 in the habit of
placing themselves in fiting and subserve and only are suffering the own instinctive be the fact of which every itself and of the supple and an all othert and self and its life and suffering heaving of the exception of which we seek in the instities and an are forch which he chalses of an art of a the philos

  after removing the cwd from sys.path.


ires "moral is himself which from the atjeg exhiated evil of permostter. aever in himself, but absolutely feeling: notuwing
out and truth an even these couric in
world, something over iburing-crusterd, are
its le that being "mankind, sucrumatiest feblareve
Epoch 8/10

----- Generating text after Epoch: 7
----- diversity: 0.2
----- Generating with seed: "ful
episode of german music. but with re"
ful
episode of german music. but with religious and the will to the conscience and the suffering of the conscience and the actions in the world to the such a standing is all that is the morality, and something of the conscience to the consequences and the success of the fact that the sense, the subtlety and the such a master that is the standing of the strength, the success of the such a master, the conscience and success of the to the 
----- diversity: 0.5
----- Generating with seed: "ful
episode of german music. but with re"
ful
episode of german music. but with remained in the will makes the 

<keras.callbacks.History at 0x7f2cd8737dd0>