# 使用 LSTM 生成文本

**ML 模型能够对图像、音乐和故事的统计潜在空间 (latent space) 进行学习，然后从这个空间中采样 (sample)，创造出与模型在训练数据中所见到的艺术作品具有相似特征的新作品。**

这种采样本身并不是艺术创作行为。它只是一种数学运算，算法并没有关于人类生活、人类情感或我们人生经验的基础知识；相反，它从一种与我们的经验完全不同的经验中进行学习。只能靠我们的解释才能对模型生成的内容赋予意义。

## 1. 如何生成序列数据

**<font color='crimson'>用 DL 生成序列数据的通用方法，就是使用前面的 token 作为输入，训练一个网络 (通常是 RNN 或 convnet) 来预测序列中接下来的一个或多个 token。</font>**

例如，给定输入 the cat is on the ma，训练网络来预测目标 t，即下一个字符。

<div class="alert alert-block alert-info">
    标记 (token) 通常是单词或字符，给定前面的 token，能够对下一个 token 的概率进行建模的任何网络都叫作<b><font color='red'>语言模型 (language model)</font></b>。<br><br>
    <b>语言模型能够捕捉到语言的潜在空间 (latent space)，即语言的统计结构。</b>
</div>

**<font color='blue'>一旦训练好了这样一个语言模型，就可以从中采样 (sample)，即生成新序列。向模型中输入一个初始文本字符串 (即条件数据 conditioning data)，要求模型生成下一个字符或下一个单词 (甚至可以同时生成多个标记)，然后将生成的输出添加到输入数据中，并多次重复这一过程。</font>** 这个循环可以生成任意长度的序列，这些序列反映了模型训练数据的结构，它们与人类书写的句子几乎相同。

![字符生成](figs/chap08-figs/character_by_character_text_generation.png)
<center><i>使用语言模型逐个字符生成文本的过程</i></center>

<br>

这里，使用 LSTM 层，向其输入从文本语料中提取的 N 个字符组成的字符串，然后训练模型来生成第 N+1 个字符。模型的输出是对所有可能的字符做 softmax，得到下一个字符的概率分布。这个 LSTM 叫作<b><font color='red'>字符级的神经语言模型 (character-level neural language model)</font></b>。

## 2. 采样策略的重要性

生成文本时，如何选择下一个字符至关重要。

- <b><font color='red'>贪婪采样 (greedy sampling)</font></b>：始终选择可能性最大的下一个字符。

  这种方法会得到重复的、可预测的字符串，看起来不像是连贯的语言。


- <b><font color='red'>随机采样 (stochastic sampling)</font></b>：在采样过程中引入随机性，即从下一个字符的概率分布中进行采样。

  在这种情况下，根据模型结果，如果下一个字符是 e 的概率为 0.3，那么会有 30% 的概率选择它。

**贪婪采样也可以被看作从一个概率分布中进行采样，即某个字符的概率为 1，其他所有字符的概率都是 0。**

<font color='crimson'>从模型的 softmax 输出中进行概率采样是一种很巧妙的方法，它甚至可以在某些时候采样到不常见的字符，从而生成看起来更加有趣的句子，而且有时会得到训练数据中没有的、听起来像是真实存在的新单词，从而表现出创造性。但这种方法有一个问题，就是它在采样过程中无法控制随机性的大小。</font>


<div class="alert alert-block alert-info">
    <center><b><font color='blue'>为什么需要有一定的随机性？</font></b></center><Br>
    考虑极端的例子：

- 纯随机采样，即从均匀概率分布中抽取下一个字符，其中每个字符的概率相同。这种方案具有最大的随机性，换句话说，这种概率分布具有最大的熵。当然，它不会生成任何有趣的内容。


- 贪婪采样。贪婪采样也不会生成任何有趣的内容，它没有任何随机性，即相应的概率分布具有最小的熵。


从“真实”概率分布 (即模型 softmax 函数输出的分布) 中进行采样，是这两个极端之间的一个中间点。但是，还有许多其他中间点具有更大或更小的熵，你可能希望都研究一下。更小的熵可以让生成的序列具有更加可预测的结构 (因此可能看起来更真实)，而更大的熵会得到更加出人意料且更有创造性的序列。<b>从生成式模型中进行采样时，在生成过程中探索不同的随机性大小总是好的做法。</b>我们人类是生成数据是否有趣的最终判断者，所以有趣是非常主观的，我们无法提前知道最佳熵的位置。
</div>

**为了在采样过程中控制随机性的大小**，引入一个叫作 <b><font color='red'>softmax 温度(softmax temperature)</font></b> 的参数，<font color='red'>用于表示采样概率分布的熵，即表示所选择的下一个字符会有多么出人意料或多么可预测</font>。

**更高的 temperature 得到的是熵更大的采样分布，会生成更加出人意料、更加无结构的生成数据，而更低的 temperature 对应更小的随机性，以及更加可预测的生成数据。**

给定一个 temperature 值，将按照下列方法对原始概率分布 (即模型的 softmax 输出) 进行重新加权，计算得到一个新的概率分布。

```python
def reweight_distribution(original_distribution, temperature=0.5):
    distribution = np.log(original_distribution) / temperature
    distribution = np.exp(distribution)
    return distribution / np.sum(distribution)
```

## 3. 实现

首先需要可用于学习语言模型的大量文本数据。可以使用任意足够大的一个或多个文本文件。

这里，<font color='blue'>使用尼采的一些作品</font>。要学习的语言模型将是针对于尼采的写作风格和主题的模型，而不是关于英语的通用模型。

In [1]:
import os
import random
import sys

import numpy as np
import tensorflow as tf

os.environ['CUDA_VISIBLE_DEVICES'] = '0'

# random.seed(42)
# np.random.seed(42)
# tf.random.set_seed(42)

In [2]:
# Download the data
path = tf.keras.utils.get_file(
    'nietzsche.txt',
    origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
with open(path, encoding='utf-8') as f:
    text = f.read().lower()
print('corpus length:', len(text))

corpus length: 600893


In [3]:
text[:100]

'preface\n\n\nsupposing that truth is a woman--what then? is there not ground\nfor suspecting that all ph'

In [4]:
chars = sorted(list(set(text)))  # 语料中所有字符
print('total chars:', len(chars))

# 将字符映射为索引
char_indices = dict((c, i) for i, c in enumerate(chars))
# 将索引映射为字符
indices_char = dict((i, c) for i, c in enumerate(chars))

total chars: 57


要提取长度为 `maxlen` 的序列 (这些序列之间存在部分重叠)，对它们进行 one-hot 编码，然后将其打包成形状为 `(sequences, maxlen, unique_characters)` 的三维 Numpy 数组。与此同时，还需要准备一个数组 `y`，其中包含对应的目标，即在每一个所提取的序列之后出现的字符 (one-hot 编码)。

In [5]:
maxlen = 40      # 提取 40 个字符组成序列
step = 3         # 每 3 个字符采样一个新序列
sentences = []   # 保存提取的序列
next_chars = []  # 保存目标 (下一个字符)

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print("Number of sequences:", len(sentences))

print('Vectorization...')
x = np.zeros(shape=(len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros(shape=(len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

Number of sequences: 200285
Vectorization...


**<font color='crimson'>给定一个训练好的模型和一个种子文本片段，可以通过重复以下操作来生成新的文本:</font>**

1. 给定目前已生成的文本，从模型中得到下一个字符的概率分布

2. 根据某个 temperature 对分布进行重新加权

3. 根据重新加权后的分布对下一个字符进行随机采样

4. 将新字符添加到文本末尾

In [6]:
def sample(preds, temperature=1.0):
    """对模型得到的原始概率分布进行重新加权，并从中抽取一个字符索引。"""
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    # 多项式分布做 1 次实验，可能的概率为 preds
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [7]:
np.random.multinomial(1, [1/6]*6, 1)

array([[0, 0, 0, 0, 1, 0]])

In [8]:
sample([1/6]*6)

0

In [9]:
sample([1/6]*6)

5

In [10]:
def build_model():
    tf.keras.backend.clear_session()
    
    input_ = tf.keras.Input(shape=(maxlen, len(chars)))
    x = tf.keras.layers.LSTM(128)(input_)
    output_ = tf.keras.layers.Dense(len(chars), activation='softmax')(x)
    model = tf.keras.Model(inputs=input_, outputs=output_)

    # target 是 one-hot 编码
    model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=0.01),
                  loss='categorical_crossentropy')
    model.summary()

    return model

**更高的 temperature 得到的是熵更大的采样分布，会生成更加出人意料、更加无结构的生成数据，而更低的 temperature 对应更小的随机性，以及更加可预测的生成数据。**

In [11]:
def on_epoch_end(epoch, _):
    """在每个 epoch 结束时调用的函数，用于打印生成的文本。"""
    if epoch not in [0, 1, 2, 10, 20, 30, 40, 50, 59]:
        return None

    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

In [12]:
print_callback = tf.keras.callbacks.LambdaCallback(on_epoch_end=on_epoch_end)

model = build_model()
model.fit(x, y,
          batch_size=128,
          epochs=60,
          callbacks=[print_callback])

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 40, 57)]          0         
_________________________________________________________________
lstm (LSTM)                  (None, 128)               95232     
_________________________________________________________________
dense (Dense)                (None, 57)                7353      
Total params: 102,585
Trainable params: 102,585
Non-trainable params: 0
_________________________________________________________________
Train on 200285 samples
Epoch 1/60
----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "educe to good opinions of itself; it is "
educe to good opinions of itself; it is the still the sount of the sount of the evolitation of the sount of the connection of the conscience of the belongs of the still the still of the consequently the more proses and be

Epoch 9/60
Epoch 10/60
Epoch 11/60
----- Generating text after Epoch: 10
----- diversity: 0.2
----- Generating with seed: " he had himself. hence our
love for him,"
 he had himself. hence our
love for him, that the standard that the most power to the sense, and that the most sense and present that the world of the most sublime that the fact of the most sense and all the sense and the sense and the most sense, that the most sense, and the stand of the rests of the condition of the most look the most art of the rests and an account of the philosophers of the stands of the destruction of the sense, th
----- diversity: 0.5
----- Generating with seed: " he had himself. hence our
love for him,"
 he had himself. hence our
love for him, the same dream as is religious promption and finally that one so deleck that the belien the same of the condition of his supreciated and distrust" to regard to the sense and
sense of the philosophers, that the exaggouration of the least life of this is a point 

existed apprejudent, to determined to the inoccaed tim. the
customory. how cleatnem of the possible even the judge in individual in
----- diversity: 1.2
----- Generating with seed: "cription of forms of morality, notwithst"
cription of forms of morality, notwithstroas
that the
world agowe oc, thus  crinctions,
it has pertain,
soul or eveny is been
on canture advains burmulyp! danwer in resence, in the uniositate vonithe, one possible to a hering theu to them, agride-inificational and exerveveful
european certainly
aut"--lew thingful
cmotals rigorous?,
called to retained
only
on the wild herelie
"god", and occondensly only with give abmusming ies bektof--wh
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
----- Generating text after Epoch: 40
----- diversity: 0.2
----- Generating with seed: "tempts than the sensitive and pampered t"
tempts than the sensitive and pampered to the sense of the sense of the sense of the 

  after removing the cwd from sys.path.


her influence, the vacity--or for
thethoh? does
has a philosophy case appear--they rathe
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
----- Generating text after Epoch: 50
----- diversity: 0.2
----- Generating with seed: " the carrying out of the willing,
to the"
 the carrying out of the willing,
to the present stand of the standard of the senses of the sense of the sense of the sense, and a presenting of the sense of the senses of the sense of the sense of the sense--it is a man of the soul of the soul that it is a present as the feeling of the sense--it is a stand as a precisely and state of the state of the sense of the sense of the soul of the sense of the sense of the sense of the sense of 
----- diversity: 0.5
----- Generating with seed: " the carrying out of the willing,
to the"
 the carrying out of the willing,
to the end and of the will of the soul of full
and and he can not interpreted that there is a 

<tensorflow.python.keras.callbacks.History at 0x7f1c086637b8>

可以看出：

- 较小的 temperature 会得到极端重复和可预测的文本，但局部结构 (几个单词组成的短句) 非常真实，特别是所有单词都是真正的单词 (单词就是字符的局部模式)。

- 随着 temperature 越来越大，生成的文本变得更加有趣、出人意料，甚至更有创造性，有时有创造出全新的、听起来有几分可信的单词。

- 对于较大的温度值，局部模式开始分解，大部分单词看起来像是半随机的字符串。

在这里，0.5 生成的文本最为有趣。

**一定要尝试多种采样策略！在学到的结构与随机性之间，巧妙的平衡能够让生成的序列非常有趣。**


<font color='crimson'>注意，利用更多的数据训练一个更大的模型，并且训练时间更长，生成的样本会比上面的结果看起来更连贯、更真实。但是，不要期待能够生成任何有意义的文本，除非是很偶然的情况。这里所做的只是从一个统计模型中对数据进行采样，这个模型是关于字符先后顺序的模型。语言是一种信息沟通渠道，信息的内容与信息编码的统计结构是有区别的。</font>

In [13]:
def build_model_2():
    tf.keras.backend.clear_session()
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Conv1D(128, 5, input_shape=(maxlen, len(chars))))
    model.add(tf.keras.layers.GlobalAveragePooling1D())
    model.add(tf.keras.layers.Dense(len(chars), activation='softmax'))
    
    optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.01)
    model.compile(loss='categorical_crossentropy', optimizer=optimizer)
    
    model.summary()
    return model

In [15]:
print_callback = tf.keras.callbacks.LambdaCallback(on_epoch_end=on_epoch_end)

model_2 = build_model()
model_2.fit(x, y,
            batch_size=128,
            epochs=2,
            callbacks=[print_callback])

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 40, 57)]          0         
_________________________________________________________________
lstm (LSTM)                  (None, 128)               95232     
_________________________________________________________________
dense (Dense)                (None, 57)                7353      
Total params: 102,585
Trainable params: 102,585
Non-trainable params: 0
_________________________________________________________________
Train on 200285 samples
Epoch 1/2
----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "rtist enlarges me, why should he not be "
rtist enlarges me, why should he not be a strivition of the sublime, and the world of the striving and interpreted and instinction and the consciousness of the sense of the sensations and the most another stronger themselv

  after removing the cwd from sys.path.


boom, they
new a germos,
and whether the lamiste?


axi
----- diversity: 1.2
----- Generating with seed: "rtist enlarges me, why should he not be "
rtist enlarges me, why should he not be years and woap, is
turns, upon thing as too, can the thing as arisoum; are as
made few must
have tabish impowhesies!

293. their
charmturate? must need of traided--rightly sound. it is respectdom (so
indurced my of rhyven, in
hir
with evagobs of the rprobled
and oge,
been ohmere, ellawned
defectors, toin, must ,          agen--they nowadays! there
are renve philomoroly ambifist, mankind: irrines m
Epoch 2/2
----- Generating text after Epoch: 1
----- diversity: 0.2
----- Generating with seed: "and without disappointment, much, yes ne"
and without disappointment, much, yes never their striving, and also an artisten the present, there is a probably an another moral of the sense of the sublime, of the striving themselves of the subject, and the conscience of the end, and all the hereditation of the consci

<tensorflow.python.keras.callbacks.History at 0x7f193d1161d0>