In [1]:
import keras
keras.__version__

Using TensorFlow backend.


'2.2.4'

# Text generation with LSTM

This notebook contains the code samples found in Chapter 8, Section 1 of [Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python?a_aid=keras&a_bid=76564dff). Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments.

----

[...]

## Implementing character-level LSTM text generation


Let's put these ideas in practice in a Keras implementation. The first thing we need is a lot of text data that we can use to learn a 
language model. You could use any sufficiently large text file or set of text files -- Wikipedia, the Lord of the Rings, etc. In this 
example we will use some of the writings of Nietzsche, the late-19th century German philosopher (translated to English). The language model 
we will learn will thus be specifically a model of Nietzsche's writing style and topics of choice, rather than a more generic model of the 
English language.

## Preparing the data

Let's start by downloading the corpus and converting it to lowercase:

In [17]:
import keras
import numpy as np

path = 'C:/Users/mikhail.galkin/Documents/DataProjects/tutorial_keras/py_keras_by_fchollet/Pelevin_Chapaev-i-pustota.txt'
text = open(path, encoding='utf8').read().lower()
print('Corpus length:', len(text))

Corpus length: 588818



Next, we will extract partially-overlapping sequences of length `maxlen`, one-hot encode them and pack them in a 3D Numpy array `x` of 
shape `(sequences, maxlen, unique_characters)`. Simultaneously, we prepare a array `y` containing the corresponding targets: the one-hot 
encoded characters that come right after each extracted sequence.

In [18]:
# for check
import random
# Select a text seed at random
start_index = random.randint(0, len(text) - 100)
selected_text = text[start_index: start_index + 1000]
print('RANDOMLY SELECTED PIECE OF TEXT: \n"' + selected_text + '"')

RANDOMLY SELECTED PIECE OF TEXT: 
"шагов, так что от одного уже не было видно тех, кто сидел у другого, – можно было различить только смутные силуэты, но сколько там человек и люди ли это вообще, сказать с уверенностью было нельзя. но самым странным было то, что поле, на котором мы стояли, тоже неизмеримо изменилось – теперь у нас под ногами была идеально ровная плоскость, покрытая чем-то вроде короткой пожухшей травы, и нигде на ней не было ни выступа, ни впадины – это было ясно по идеально правильному узору горящих вокруг огней.
– что же это такое? – спросил я растеряно.
– ага, – сказал барон. – теперь, я полагаю, видите.
– вижу, – сказал я.
– это один из филиалов загробного мира, – сказал юнгерн, – тот, что по моей части. сюда попадают главным образом лица, при жизни бывшие воинами. может быть, вы слышали про валгаллу?
– слышал, – ответил я, чувствуя, как во мне растет несуразное детское желание вцепиться в край бароновой рясы.
– вот это она и есть. только, к сожалению, сюда попадаю

In [19]:
# Length of extracted character sequences
maxlen = 100
# We sample a new sequence every `step` characters
step = 3
# This holds our extracted sequences
sentences = []
# This holds the targets (the follow-up characters)
next_chars = []

In [20]:
# for check
range(0, len(text) - maxlen, step)

range(0, 588718, 3)

In [21]:
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('Number of sequences:', len(sentences))
print(sentences[11:14], next_chars[11:14])


# List of unique characters in the corpus
chars = sorted(list(set(text)))
print('Unique characters:', len(chars), '\n', chars)

# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)
print('Indices characters:', len(char_indices), '\n', char_indices)

Number of sequences: 196240
['людей,\nна безбрежный живой поток, поднятый\nмоей волей и мчащийся в никуда по багровой\nзакатной степи', 'ей,\nна безбрежный живой поток, поднятый\nмоей волей и мчащийся в никуда по багровой\nзакатной степи, я', '\nна безбрежный живой поток, поднятый\nмоей волей и мчащийся в никуда по багровой\nзакатной степи, я ча'] [',', ' ', 'с']
Unique characters: 88 
 ['\n', ' ', '!', "'", '(', ')', '*', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '[', ']', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', '«', '»', 'а', 'б', 'в', 'г', 'д', 'е', 'ж', 'з', 'и', 'й', 'к', 'л', 'м', 'н', 'о', 'п', 'р', 'с', 'т', 'у', 'ф', 'х', 'ц', 'ч', 'ш', 'щ', 'ъ', 'ы', 'ь', 'э', 'ю', 'я', '–', '“', '„', '…', '\ufeff']
Indices characters: 88 
 {'\n': 0, ' ': 1, '!': 2, "'": 3, '(': 4, ')': 5, '*': 6, ',': 7, '-': 8, '.': 9, '0': 10, '1': 11, '2': 12, '3': 13, '4': 14, '5': 15, '6

In [22]:
# Next, one-hot encode the characters into binary arrays.
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
# for check
print('x.shape:')
print(x.shape)
print('y.shape:')
print(y.shape)

Vectorization...
x.shape:
(196240, 100, 88)
y.shape:
(196240, 88)


In [23]:
# for check
for i, sentence in enumerate(sentences[11:12]):
    print(i,':len(sentence)=', len(sentence),':', sentence)
    for t, char in enumerate(sentence):
        print(t, char,':char_indices=', char_indices[char])
    print(i,':next_chars=', next_chars[11], ':char_indices[next_chars]=', char_indices[next_chars[11]])

0 :len(sentence)= 100 : людей,
на безбрежный живой поток, поднятый
моей волей и мчащийся в никуда по багровой
закатной степи
0 л :char_indices= 62
1 ю :char_indices= 81
2 д :char_indices= 55
3 е :char_indices= 56
4 й :char_indices= 60
5 , :char_indices= 7
6 
 :char_indices= 0
7 н :char_indices= 64
8 а :char_indices= 51
9   :char_indices= 1
10 б :char_indices= 52
11 е :char_indices= 56
12 з :char_indices= 58
13 б :char_indices= 52
14 р :char_indices= 67
15 е :char_indices= 56
16 ж :char_indices= 57
17 н :char_indices= 64
18 ы :char_indices= 78
19 й :char_indices= 60
20   :char_indices= 1
21 ж :char_indices= 57
22 и :char_indices= 59
23 в :char_indices= 53
24 о :char_indices= 65
25 й :char_indices= 60
26   :char_indices= 1
27 п :char_indices= 66
28 о :char_indices= 65
29 т :char_indices= 69
30 о :char_indices= 65
31 к :char_indices= 61
32 , :char_indices= 7
33   :char_indices= 1
34 п :char_indices= 66
35 о :char_indices= 65
36 д :char_indices= 55
37 н :char_indices= 64
38 я :char_indices

In [24]:
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

## Building the network

Our network is a single `LSTM` layer followed by a `Dense` classifier and softmax over all possible characters. But let us note that 
recurrent neural networks are not the only way to do sequence data generation; 1D convnets also have proven extremely successful at it in 
recent times.

In [25]:
from keras import layers

model = keras.models.Sequential()
model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax'))

W1002 10:13:28.964789 21796 deprecation_wrapper.py:119] From C:\Users\mikhail.galkin\AppData\Local\Continuum\anaconda3\envs\py_tensorflow\lib\site-packages\keras\backend\tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W1002 10:13:29.067047 21796 deprecation_wrapper.py:119] From C:\Users\mikhail.galkin\AppData\Local\Continuum\anaconda3\envs\py_tensorflow\lib\site-packages\keras\backend\tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W1002 10:13:29.094616 21796 deprecation_wrapper.py:119] From C:\Users\mikhail.galkin\AppData\Local\Continuum\anaconda3\envs\py_tensorflow\lib\site-packages\keras\backend\tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.



Since our targets are one-hot encoded, we will use `categorical_crossentropy` as the loss to train the model:

In [26]:
optimizer = keras.optimizers.RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
model.summary()

W1002 10:13:36.000020 21796 deprecation_wrapper.py:119] From C:\Users\mikhail.galkin\AppData\Local\Continuum\anaconda3\envs\py_tensorflow\lib\site-packages\keras\optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W1002 10:13:36.013555 21796 deprecation_wrapper.py:119] From C:\Users\mikhail.galkin\AppData\Local\Continuum\anaconda3\envs\py_tensorflow\lib\site-packages\keras\backend\tensorflow_backend.py:3295: The name tf.log is deprecated. Please use tf.math.log instead.



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 128)               111104    
_________________________________________________________________
dense_1 (Dense)              (None, 88)                11352     
Total params: 122,456
Trainable params: 122,456
Non-trainable params: 0
_________________________________________________________________


## Training the language model and sampling from it


Given a trained model and a seed text snippet, we generate new text by repeatedly:

* 1) Drawing from the model a probability distribution over the next character given the text available so far
* 2) Reweighting the distribution to a certain "temperature"
* 3) Sampling the next character at random according to the reweighted distribution
* 4) Adding the new character at the end of the available text

This is the code we use to reweight the original probability distribution coming out of the model, 
and draw a character index from it (the "sampling function"):

In [27]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    #print('preds coming out of the model::', preds)
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    #print('new preds -->', preds)
    probas = np.random.multinomial(1, preds, 1)
    #print('max(probas)', np.argmax(probas))
    return np.argmax(probas)


Finally, this is the loop where we repeatedly train and generated text. We start generating text using a range of different temperatures 
after every epoch. This allows us to see how the generated text evolves as the model starts converging, as well as the impact of 
temperature in the sampling strategy.

In [28]:
import random
import sys

for epoch in range(1, 60):
    print('========================================================================================================')
    print('EPOCH:', epoch)
    # Fit the model for 1 epoch on the available training data
    model.fit(x, y,
              batch_size=128,
              epochs=1)

    # Select a text seed at random
    start_index = random.randint(0, len(text) - maxlen - 1)
    generated_text = text[start_index: start_index + maxlen]
    print('--- RANDOMLY SELECTED PIECE OF TEXT::\n"' + generated_text + '"')

    for temperature in [0.2, 0.5, 1.0, 1.2]:
        print('------ TEMPERATURE::', temperature)
        sys.stdout.write(generated_text)

        # We generate 400 characters
        for i in range(400):
            sampled = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(generated_text):
                sampled[0, t, char_indices[char]] = 1.

            preds = model.predict(sampled, verbose=0)[0]
            next_index = sample(preds, temperature)
            next_char = chars[next_index]

            generated_text += next_char
            generated_text = generated_text[1:]

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

EPOCH: 1


W1002 10:14:48.883206 21796 deprecation.py:323] From C:\Users\mikhail.galkin\AppData\Local\Continuum\anaconda3\envs\py_tensorflow\lib\site-packages\tensorflow\python\ops\math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W1002 10:14:50.057166 21796 deprecation_wrapper.py:119] From C:\Users\mikhail.galkin\AppData\Local\Continuum\anaconda3\envs\py_tensorflow\lib\site-packages\keras\backend\tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.



Epoch 1/1
  4480/196240 [..............................] - ETA: 5:44 - loss: 3.3117

KeyboardInterrupt: 