In [1]:
import keras
keras.__version__

Using TensorFlow backend.


'2.2.4'

# Text generation with LSTM

This notebook contains the code samples found in Chapter 8, Section 1 of [Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python?a_aid=keras&a_bid=76564dff). Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments.

----

[...]

## Implementing character-level LSTM text generation


Let's put these ideas in practice in a Keras implementation. The first thing we need is a lot of text data that we can use to learn a 
language model. You could use any sufficiently large text file or set of text files -- Wikipedia, the Lord of the Rings, etc. In this 
example we will use some of the writings of Nietzsche, the late-19th century German philosopher (translated to English). The language model 
we will learn will thus be specifically a model of Nietzsche's writing style and topics of choice, rather than a more generic model of the 
English language.

## Preparing the data

Let's start by downloading the corpus and converting it to lowercase:

In [2]:
import keras
import numpy as np

path = 'C:/Users/mikhail.galkin/Documents/DataProjects/tutorial_keras/py_keras_by_fchollet/Usadba.txt'
text = open(path, encoding='utf8').read().lower()
print('Corpus length:', len(text))

Corpus length: 419962



Next, we will extract partially-overlapping sequences of length `maxlen`, one-hot encode them and pack them in a 3D Numpy array `x` of 
shape `(sequences, maxlen, unique_characters)`. Simultaneously, we prepare a array `y` containing the corresponding targets: the one-hot 
encoded characters that come right after each extracted sequence.

In [3]:
# for check
import random
# Select a text seed at random
start_index = random.randint(0, len(text) - 100)
selected_text = text[start_index: start_index + 1000]
print('RANDOMLY SELECTED PIECE OF TEXT: \n"' + selected_text + '"')

RANDOMLY SELECTED PIECE OF TEXT: 
"–∫—Ç—É–≥–∞–Ω–æ–≤–∞: —è –º–æ–≥—É —Å–∫–∞–∑–∞—Ç—å)
–∏–ª—å—è –Ω–∏–∫–æ–ª–∞–µ–≤–∏—á —Å–≤–∏–Ω—Ü–æ–≤: –∂–≥–∏
–µ–ª–µ–Ω–∞ —Å–µ—Ä–≥–µ–µ–≤–Ω–∞ –∞–∫—Ç—É–≥–∞–Ω–æ–≤–∞: –ª–µ–≥–∫–æ –∑–∞–º–µ–Ω—è—é—Ç—Å—è –Ω–∞ –ø–µ—á–µ–Ω—å —Ç—Ä–µ—Å–∫–∏ –≤ –±–∞–Ω–∫–∞—Ö. –∞ –ø—Ä–∏ –æ—á–µ–Ω—å —Å–∫—Ä–æ–º–Ω–æ–º –±—é–¥–∂–µ—Ç–µ- –≤–æ–æ–±—â–µ –Ω–∞ –≤–∏—Ç–∞–º–∏–Ω—ã –∞, –µ, –¥ –∏ –æ–º–µ–≥–∞ 3 –æ—Ç–µ—á–µ—Å—Ç–≤–µ–Ω–Ω–æ–≥–æ –ø—Ä–æ–∏–∑–≤–æ–¥—Å—Ç–≤–∞. –∏ –¥–∞–∂–µ –Ω–∞ —Ä—ã–±–∏–π –∂–∏—Ä)
–∏–ª—å—è –Ω–∏–∫–æ–ª–∞–µ–≤–∏—á —Å–≤–∏–Ω—Ü–æ–≤: —Ç—ã —Ç–∞–∫ –Ω–∏–∫–æ–≥–¥–∞ –º–∏–ª–ª–∏–∞—Ä–¥–µ—Ä–æ–º –Ω–µ —Å—Ç–∞–Ω–µ—à—å)))
–µ–ª–µ–Ω–∞ —Å–µ—Ä–≥–µ–µ–≤–Ω–∞ –∞–∫—Ç—É–≥–∞–Ω–æ–≤–∞: –æ—Å—Ç–∞–Ω—É—Å—å —Å–∫—Ä–æ–º–Ω—ã–º –º—É–ª—å—Ç–∏–º–∏–ª–ª–∏–∞—Ä–¥–µ—Ä–æ–º, —Ç–∞–∫ –∏ –±—ã—Ç—å
—Ä–µ–Ω–∞—Ç –∏—Ä–µ–∫–æ–≤–∏—á –∞–∫—Ç—É–≥–∞–Ω–æ–≤: —é–ª—è, 687. –∫–∞–∫ —Ä–æ–¥—Å—Ç–≤–µ–Ω–Ω–∏–∫–∏? )
–∏–ª—å—è –Ω–∏–∫–æ–ª–∞–µ–≤–∏—á —Å–≤–∏–Ω—Ü–æ–≤: –≤—Å–µ –ª–∏ –≤ –¥–æ–±—Ä–æ–º –∑–¥—Ä–∞–≤–∏–∏?
—é–ª—è —Å–µ—Ä–≥–µ–µ–≤–Ω–∞ —à—É—Ç–æ–≤–∞: —É–º–∏—Ä–∞—é—Ç –ø–æ—Ç–∏—Ö–æ–Ω—å–∫—É

In [4]:
# Length of extracted character sequences
maxlen = 100
# We sample a new sequence every `step` characters
step = 3
# This holds our extracted sequences
sentences = []
# This holds the targets (the follow-up characters)
next_chars = []

In [5]:
# for check
range(0, len(text) - maxlen, step)

range(0, 419862, 3)

In [6]:
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('Number of sequences:', len(sentences))
print(sentences[11:14], next_chars[11:14])


# List of unique characters in the corpus
chars = sorted(list(set(text)))
print('Unique characters:', len(chars), '\n', chars)

# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)
print('Indices characters:', len(char_indices), '\n', char_indices)

Number of sequences: 139954
['–µ—â—ë –∏–∑ —ç–º–æ—Ü–∏–æ–Ω–∞–ª—å–Ω—ã—Ö —Ñ–æ—Ç–æ–∫. —Å–µ–ª—Ñ–∏ –ø–∞–ª–∫–∞ –Ω–∞ –∞–π—Ñ–æ–Ω–µ —Å—Ä–∞–±–æ—Ç–∞–ª–∞ —Å —Ç—Ä–µ—Ç—å–µ–≥–æ —Ä–∞–∑–∞. —É –º–µ–Ω—è —ç–º–æ—Ü–∏—è, –∫–∞–∫ —Å–∫–∞–∑', ' –∏–∑ —ç–º–æ—Ü–∏–æ–Ω–∞–ª—å–Ω—ã—Ö —Ñ–æ—Ç–æ–∫. —Å–µ–ª—Ñ–∏ –ø–∞–ª–∫–∞ –Ω–∞ –∞–π—Ñ–æ–Ω–µ —Å—Ä–∞–±–æ—Ç–∞–ª–∞ —Å —Ç—Ä–µ—Ç—å–µ–≥–æ —Ä–∞–∑–∞. —É –º–µ–Ω—è —ç–º–æ—Ü–∏—è, –∫–∞–∫ —Å–∫–∞–∑–∞–ª ', ' —ç–º–æ—Ü–∏–æ–Ω–∞–ª—å–Ω—ã—Ö —Ñ–æ—Ç–æ–∫. —Å–µ–ª—Ñ–∏ –ø–∞–ª–∫–∞ –Ω–∞ –∞–π—Ñ–æ–Ω–µ —Å—Ä–∞–±–æ—Ç–∞–ª–∞ —Å —Ç—Ä–µ—Ç—å–µ–≥–æ —Ä–∞–∑–∞. —É –º–µ–Ω—è —ç–º–æ—Ü–∏—è, –∫–∞–∫ —Å–∫–∞–∑–∞–ª –∂–µ–Ω'] ['–∞', '–∂', '—è']
Unique characters: 213 
 ['\n', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '?', '@', '[', '\\', ']', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '~', '\xa0', '¬©', '¬´', '¬ª', '√ó', '√§', '√•', '√®', 

In [7]:
# Next, one-hot encode the characters into binary arrays.
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
# for check
print('x.shape:')
print(x.shape)
print('y.shape:')
print(y.shape)

Vectorization...
x.shape:
(139954, 100, 213)
y.shape:
(139954, 213)


In [8]:
# for check
for i, sentence in enumerate(sentences[11:12]):
    print(i,':len(sentence)=', len(sentence),':', sentence)
    for t, char in enumerate(sentence):
        print(t, char,':char_indices=', char_indices[char])
    print(i,':next_chars=', next_chars[11], ':char_indices[next_chars]=', char_indices[next_chars[11]])

0 :len(sentence)= 100 : –µ—â—ë –∏–∑ —ç–º–æ—Ü–∏–æ–Ω–∞–ª—å–Ω—ã—Ö —Ñ–æ—Ç–æ–∫. —Å–µ–ª—Ñ–∏ –ø–∞–ª–∫–∞ –Ω–∞ –∞–π—Ñ–æ–Ω–µ —Å—Ä–∞–±–æ—Ç–∞–ª–∞ —Å —Ç—Ä–µ—Ç—å–µ–≥–æ —Ä–∞–∑–∞. —É –º–µ–Ω—è —ç–º–æ—Ü–∏—è, –∫–∞–∫ —Å–∫–∞–∑
0 –µ :char_indices= 99
1 —â :char_indices= 119
2 —ë :char_indices= 126
3   :char_indices= 1
4 –∏ :char_indices= 102
5 –∑ :char_indices= 101
6   :char_indices= 1
7 —ç :char_indices= 123
8 –º :char_indices= 106
9 –æ :char_indices= 108
10 —Ü :char_indices= 116
11 –∏ :char_indices= 102
12 –æ :char_indices= 108
13 –Ω :char_indices= 107
14 –∞ :char_indices= 94
15 –ª :char_indices= 105
16 —å :char_indices= 122
17 –Ω :char_indices= 107
18 —ã :char_indices= 121
19 —Ö :char_indices= 115
20   :char_indices= 1
21 —Ñ :char_indices= 114
22 –æ :char_indices= 108
23 —Ç :char_indices= 112
24 –æ :char_indices= 108
25 –∫ :char_indices= 104
26 . :char_indices= 15
27   :char_indices= 1
28 —Å :char_indices= 111
29 –µ :char_indices= 99
30 –ª :char_indices= 105
31 —Ñ :char_indices= 114
32 –∏ :char_indices= 

In [9]:
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

## Building the network

Our network is a single `LSTM` layer followed by a `Dense` classifier and softmax over all possible characters. But let us note that 
recurrent neural networks are not the only way to do sequence data generation; 1D convnets also have proven extremely successful at it in 
recent times.

In [10]:
from keras import layers

model = keras.models.Sequential()
model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax'))

W1002 12:03:07.938650 24160 deprecation_wrapper.py:119] From C:\Users\mikhail.galkin\AppData\Local\Continuum\anaconda3\envs\py_tensorflow\lib\site-packages\keras\backend\tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W1002 12:03:07.968231 24160 deprecation_wrapper.py:119] From C:\Users\mikhail.galkin\AppData\Local\Continuum\anaconda3\envs\py_tensorflow\lib\site-packages\keras\backend\tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W1002 12:03:07.972744 24160 deprecation_wrapper.py:119] From C:\Users\mikhail.galkin\AppData\Local\Continuum\anaconda3\envs\py_tensorflow\lib\site-packages\keras\backend\tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.



Since our targets are one-hot encoded, we will use `categorical_crossentropy` as the loss to train the model:

In [11]:
optimizer = keras.optimizers.RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
model.summary()

W1002 12:03:21.844123 24160 deprecation_wrapper.py:119] From C:\Users\mikhail.galkin\AppData\Local\Continuum\anaconda3\envs\py_tensorflow\lib\site-packages\keras\optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W1002 12:03:21.860167 24160 deprecation_wrapper.py:119] From C:\Users\mikhail.galkin\AppData\Local\Continuum\anaconda3\envs\py_tensorflow\lib\site-packages\keras\backend\tensorflow_backend.py:3295: The name tf.log is deprecated. Please use tf.math.log instead.



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 128)               175104    
_________________________________________________________________
dense_1 (Dense)              (None, 213)               27477     
Total params: 202,581
Trainable params: 202,581
Non-trainable params: 0
_________________________________________________________________


## Training the language model and sampling from it


Given a trained model and a seed text snippet, we generate new text by repeatedly:

* 1) Drawing from the model a probability distribution over the next character given the text available so far
* 2) Reweighting the distribution to a certain "temperature"
* 3) Sampling the next character at random according to the reweighted distribution
* 4) Adding the new character at the end of the available text

This is the code we use to reweight the original probability distribution coming out of the model, 
and draw a character index from it (the "sampling function"):

In [12]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    #print('preds coming out of the model::', preds)
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    #print('new preds -->', preds)
    probas = np.random.multinomial(1, preds, 1)
    #print('max(probas)', np.argmax(probas))
    return np.argmax(probas)


Finally, this is the loop where we repeatedly train and generated text. We start generating text using a range of different temperatures 
after every epoch. This allows us to see how the generated text evolves as the model starts converging, as well as the impact of 
temperature in the sampling strategy.

In [13]:
import random
import sys

for epoch in range(1, 60):
    print('========================================================================================================')
    print('EPOCH:', epoch)
    # Fit the model for 1 epoch on the available training data
    model.fit(x, y,
              batch_size=128,
              epochs=1)

    # Select a text seed at random
    start_index = random.randint(0, len(text) - maxlen - 1)
    generated_text = text[start_index: start_index + maxlen]
    print('--- RANDOMLY SELECTED PIECE OF TEXT::\n"' + generated_text + '"')

    for temperature in [0.2, 0.5, 1.0, 1.2]:
        print('------ TEMPERATURE::', temperature)
        sys.stdout.write(generated_text)

        # We generate 400 characters
        for i in range(400):
            sampled = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(generated_text):
                sampled[0, t, char_indices[char]] = 1.

            preds = model.predict(sampled, verbose=0)[0]
            next_index = sample(preds, temperature)
            next_char = chars[next_index]

            generated_text += next_char
            generated_text = generated_text[1:]

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

EPOCH: 1


W1002 12:26:17.596101 24160 deprecation.py:323] From C:\Users\mikhail.galkin\AppData\Local\Continuum\anaconda3\envs\py_tensorflow\lib\site-packages\tensorflow\python\ops\math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W1002 12:26:18.980239 24160 deprecation_wrapper.py:119] From C:\Users\mikhail.galkin\AppData\Local\Continuum\anaconda3\envs\py_tensorflow\lib\site-packages\keras\backend\tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.



Epoch 1/1
--- RANDOMLY SELECTED PIECE OF TEXT::
"–æ–≤–∞: –µ—â—ë –Ω–∏–∫—Ç–æ –Ω–µ —Å—Ç—Ä–æ–∏–ª –∏–º–ø–µ—Ä–∏–∏ –∏–∑ –¥–≤—É—Ö –º–∏–Ω—É—Å–æ–≤)) –º–æ–∂–µ—Ç, –≤–æ–æ–±—â–µ –Ω–∞–¥–æ –Ω–µ –±–∞–∑—ã, –∫–∞–∫ —É –≤—Å–µ—Ö, –∞ –≤ –Ω–µ–∏–∑–≤"
------ TEMPERATURE:: 0.2
–æ–≤–∞: –µ—â—ë –Ω–∏–∫—Ç–æ –Ω–µ —Å—Ç—Ä–æ–∏–ª –∏–º–ø–µ—Ä–∏–∏ –∏–∑ –¥–≤—É—Ö –º–∏–Ω—É—Å–æ–≤)) –º–æ–∂–µ—Ç, –≤–æ–æ–±—â–µ –Ω–∞–¥–æ –Ω–µ –±–∞–∑—ã, –∫–∞–∫ —É –≤—Å–µ—Ö, –∞ –≤ –Ω–µ–∏–∑–≤–∞–ª—å–Ω–æ —Å—Ç–æ–ª—å–∫–æ —Å—Ç–æ–ª—å–∫–æ —Å—Ç–æ–ª—å–∫–æ —Å–æ–±–ª–∞–Ω–∏—è —Å–µ–±–∏—Ç–∞ —Å –Ω–∞ –ø–æ –≤ —Å–æ–±–ª–∏ –ø–æ –≤–æ—Ç —Å–æ–±–µ—Å—Ç–∏ –≤ —Å–æ–±–µ—Å—Ç—å –∫–∞–∫–∏–µ —Å—Ç–æ–ª—å–∫–æ —Å—Ç–æ–ª—å–∫–æ –ø–æ –≤–æ–ª—å—à–µ —Å–µ–±–µ–Ω–∏—è —Å–µ–±–∏—Ç–∞ —Å–æ–±–ª–∏—Ç—å –≤ –ø–æ—Å—Ç–æ –æ–±–µ–Ω–∏–µ —Å—Ç–æ—Ç–∏—Ç—å –Ω–∞ –ø–æ –≤–æ—Ç –ø–æ –≤–æ—Ç –Ω–∞ –ø–æ –≤–æ—Å—Ç–æ –≤ —Å—Ç–æ–ª—å–∫–æ —Å–æ–±–ª–∏ —Å—Ç–æ–ª—å–∫–æ —Å–æ–±–ª–∏ —Å—Ç–æ–ª—å–∫–æ –æ—Ç –Ω–∞ —Ç–∞–∫–æ–º –Ω–∞ —Ç–∞–∫–æ–µ —Å—Ç–æ–ª—å–∫–æ –ø—Ä–æ—Å—Ç–æ –≤ —Å—Ç–æ–ª—å–∫–æ —Å—Ç–æ–ª—å–∫–æ —Å—Ç–æ–ª—å–∫–æ —Å—Ç–æ–ª—å–∫–æ —Å–µ–±–µ–Ω–∞ —Å–æ–≤–æ—Ä–

  after removing the cwd from sys.path.


—á –∑–∞—Ä–∏–ø–æ–≤: —è –≤–ø–µ—Ä–≤—ë–Ω–æ–µ –∏ —á–µ—Ä–∞—Ö–Ω–∏–∫–æ–ø2
—Å–µ–≥–æ–¥–Ω—è? —Å–æ–≥–ª–∞—Å–∏–∞?
—Ö–∞—Ä–∏—Ç–æ–Ω–æ–≤
EPOCH: 13
Epoch 1/1
--- RANDOMLY SELECTED PIECE OF TEXT::
"–∏–Ω: –≤–æ—Ç —ç—Ç—É –º–æ–∂–Ω–æ –∏—Å–ø–æ–ª—å–∑–æ–≤–∞—Ç—å
–≤–∏–∫—Ç–æ—Ä –∫–ª–µ–≤–∞–∫–∏–Ω: –±—É–¥–µ–º –Ω–∞–¥–µ—è—Ç—å—Å—è))
–º–∏—Ö–∞–∏–ª —é—Ä—å–µ–≤–∏—á –≥–∞–ª–∫–∏–Ω: –¥–∞. —ç—Ç–æ —Å–æ–≤"
------ TEMPERATURE:: 0.2
–∏–Ω: –≤–æ—Ç —ç—Ç—É –º–æ–∂–Ω–æ –∏—Å–ø–æ–ª—å–∑–æ–≤–∞—Ç—å
–≤–∏–∫—Ç–æ—Ä –∫–ª–µ–≤–∞–∫–∏–Ω: –±—É–¥–µ–º –Ω–∞–¥–µ—è—Ç—å—Å—è))
–º–∏—Ö–∞–∏–ª —é—Ä—å–µ–≤–∏—á –≥–∞–ª–∫–∏–Ω: –¥–∞. —ç—Ç–æ —Å–æ–≤—Å—Ç—Ä–∞–ª–∏ –Ω–∞ –ø–æ–¥–∞–≤–∏–ª–∞ —Å –æ–¥–Ω–æ–≥–æ –≤ –ø—Ä–µ–¥–ª–∞–≥–∞ –Ω–∞ –ø–æ–ª–µ—Ç–∏–ª–∏ –ø–æ–¥–æ–≥—Ä–∞–ª–∏ –ø–æ–¥–æ–º–æ–¥–ª–∏—Å—å –Ω–∞ –ø–æ–¥ –≤ –≥–æ—Ä–æ–¥–∏—á–µ—Å–∫–∏ –≤ —Å—Ç–æ–ª–∏—Ü–µ –ø–æ–¥–æ–º–æ–¥–ª—å–Ω–æ –ø–æ–¥ –ø–æ–¥–æ–ø–∞–ª–∏ –ø–æ–¥–æ–ø–∞–ª–µ, —á—Ç–æ —Ç–∞–∫–æ–π –ø—Ä–∏–µ–∑–∂–∞—Ç—å, —Ç–æ –ø–æ–¥ –ø–æ–¥–æ–º–∞–ª–∏—Å—å –ø–æ–¥ —Ä–µ–±–ª–∏–ª–∏ –Ω–∞ –ø–æ–¥ –Ω–∞ –ø–æ–¥ –Ω–∞—à–∏–ª–∏—Å—å –∏ –ø–æ–¥–æ–º–æ–¥–∂–∏–π, —á—Ç–æ —Ç–æ –ø–æ–¥–æ