# Anna KaRNNa 安娜-卡列尼娜

In this notebook, we'll build a character-wise RNN trained on Anna Karenina, one of my all-time favorite books. It'll be able to generate new text based on the text from the book.

在这本笔记本中，我们将建立一个训练有素的RNN，她是我最喜爱的书籍之一安娜·卡列宁娜（Anna Karenina）。 它将能够根据书中的文本生成新的文本。

This network is based off of Andrej Karpathy's [post on RNNs](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) and [implementation in Torch](https://github.com/karpathy/char-rnn). Also, some information [here at r2rt](http://r2rt.com/recurrent-neural-networks-in-tensorflow-ii.html) and from [Sherjil Ozair](https://github.com/sherjilozair/char-rnn-tensorflow) on GitHub. Below is the general architecture of the character-wise RNN.

这个网络是基于Andrej Karpathy的[在RNN上发帖](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) 和[在Torch中实现的](https://github.com/karpathy/char-rnn)。 另外，一些信息[这里是r2rt](http://r2rt.com/recurrent-neural-networks-in-tensorflow-ii.html)和[Sherjil Ozair](https://github.com/sherjilozair/char-rnn-tensorflow) 在GitHub上。 以下是字符式RNN的一般架构。

<img src="assets/charseq.jpeg" width="500">

In [1]:
import time
from collections import namedtuple

import numpy as np
import tensorflow as tf

First we'll load the text file and convert it into integers for our network to use. Here I'm creating a couple dictionaries to convert the characters to and from integers. Encoding the characters as integers makes it easier to use as input in the network.

首先，我们将加载文本文件并将其转换为整数，以供我们的网络使用。 在这里我创建一个几个字典来转换字符到整数。 将字符编码为整数可以更容易地用作网络中的输入。

In [18]:
with open('anna.txt', 'r') as f:
    text=f.read()
print("text lenth={}, {}".format(len(text), text[:100].replace("\n"," ")))
vocab = set(text)
print("vocab length={}, {}".format(len(vocab), vocab))
vocab_to_int = {c: i for i, c in enumerate(vocab)}
print("vocab_to_int length={},  {}".format(len(vocab_to_int), vocab_to_int))
int_to_vocab = dict(enumerate(vocab))
print("int_to_vocab length={},  {}".format(len(int_to_vocab), int_to_vocab))
encoded = np.array([vocab_to_int[c] for c in text], dtype=np.int32)
print("encoded lenth={}, {}".format(len(encoded), encoded[:100]))

text lenth=1985223, Chapter 1   Happy families are all alike; every unhappy family is unhappy in its own way.  Everythin
vocab length=83, {'H', '/', 'I', 'f', 'l', 'd', 'x', 'M', '(', '0', '?', '4', 'F', 'S', 'k', 'g', '%', 'a', 'V', 'Q', 'v', 'Y', '_', '9', '.', 'P', ')', '!', 'e', ',', 'p', '5', 'm', 'o', 'j', 'R', 'G', 'q', 'w', 'B', 's', 'n', '8', 'K', ' ', ':', 'r', 'O', 'J', 'b', 'U', 'y', 'E', 'T', 'W', 'L', '3', '@', 'X', 'Z', ';', 't', '1', 'C', '\n', 'u', 'A', 'z', '2', '7', '`', "'", '&', '$', 'D', 'c', '6', 'h', 'N', '*', '"', '-', 'i'}
vocab_to_int length=83,  {'H': 0, '/': 1, 'I': 2, 'f': 3, 'l': 4, 'd': 5, 'x': 6, 'M': 7, '(': 8, '0': 9, '?': 10, '4': 11, 'F': 12, 'S': 13, 'k': 14, 'g': 15, '%': 16, 'a': 17, 'V': 18, 'Q': 19, 'v': 20, 'Y': 21, '_': 22, '9': 23, '.': 24, 'P': 25, ')': 26, '!': 27, 'e': 28, ',': 29, 'p': 30, '5': 31, 'm': 32, 'o': 33, 'j': 34, 'R': 35, 'G': 36, 'q': 37, 'w': 38, 'B': 39, 's': 40, 'n': 41, '8': 42, 'K': 43, ' ': 44, ':': 45, 'r': 46, 'O': 4

Let's check out the first 100 characters, make sure everything is peachy. According to the [American Book Review](http://americanbookreview.org/100bestlines.asp), this is the 6th best first line of a book ever.

我们来看看前100个角色，确保一切都是桃红色的。 根据[American Book Review](http://americanbookreview.org/100bestlines.asp)，这是一本书中第六好的第一行。

In [15]:
text[:100]

'Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverythin'

And we can see the characters encoded as integers.

我们可以看到编码为整数的字符。

In [4]:
encoded[:100]

array([62, 66, 60, 18, 47, 22,  1, 12, 13, 76, 76, 76,  0, 60, 18, 18, 61,
       12, 63, 60, 80, 59, 70, 59, 22, 54, 12, 60,  1, 22, 12, 60, 70, 70,
       12, 60, 70, 59, 42, 22, 21, 12, 22, 67, 22,  1, 61, 12, 79, 20, 66,
       60, 18, 18, 61, 12, 63, 60, 80, 59, 70, 61, 12, 59, 54, 12, 79, 20,
       66, 60, 18, 18, 61, 12, 59, 20, 12, 59, 47, 54, 12, 71,  6, 20, 76,
        6, 60, 61, 17, 76, 76, 51, 67, 22,  1, 61, 47, 66, 59, 20], dtype=int32)

Since the network is working with individual characters, it's similar to a classification problem in which we are trying to predict the next character from the previous text.  Here's how many 'classes' our network has to pick from.

由于网络正在处理单个角色，它类似于我们尝试预测上一个文本中的下一个字符的分类问题。 这是我们的网络有多少'classes' 。

In [5]:
len(vocab)

83

## Making training mini-batches  制作训练小批量

Here is where we'll make our mini-batches for training. Remember that we want our batches to be multiple sequences of some desired number of sequence steps. Considering a simple example, our batches would look like this:

这里是我们的小批量训练。记住，我们希望我们的批次是一些所需数量的序列步骤的多个序列。考虑到一个简单的例子，我们的批次将如下所示：

<img src="assets/sequence_batching@1x.png" width=500px>

<br>
We have our text encoded as integers as one long array in `encoded`. Let's create a function that will give us an iterator for our batches. I like using [generator functions](https://jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/) to do this. Then we can pass `encoded` into this function and get our batch generator.

我们将文本编码为“encoded”中的一个长整型数组。让我们创建一个函数，为我们的批量提供一个迭代器。我喜欢使用[generator functions](https://jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/) 来执行此操作。然后我们可以将`encoded`传递给这个函数，并得到我们的批生成器。

The first thing we need to do is discard some of the text so we only have completely full batches. Each batch contains $N \times M$ characters, where $N$ is **the batch size (the number of sequences)** and $M$ is **the number of steps**. Then, to get the number of batches we can make from some array `arr`, you divide the length of `arr` by the batch size. Once you know the number of batches and the batch size, you can get the total number of characters to keep.

我们需要做的第一件事是丢弃一些文本，所以我们只有完全批量。每个批次包含 $N \times M$ 个字符，其中 $N$ 是批量大小（序列数）**和 $M$ 是**步数**。然后，为了获取我们可以从某个数组`arr`中得到的批次数，你可以将`arr`的长度除以批量大小。一旦您知道批次数和批量大小，您可以获得要保留的字符总数。

After that, we need to split `arr` into $N$ sequences. You can do this using `arr.reshape(size)` where `size` is a tuple containing the dimensions sizes of the reshaped array. We know we want $N$ sequences (`n_seqs` below), let's make that the size of the first dimension. For the second dimension, you can use `-1` as a placeholder in the size, it'll fill up the array with the appropriate data for you. After this, you should have an array that is $N \times (M * K)$ where $K$ is the number of batches.

之后，我们需要将`arr`分解成 $N$ 序列。您可以使用`arr.reshape（size）`这样做，其中`size`是一个包含重构数组的维度大小的元组。我们知道我们要 $N$ 序列(`n_seqs` 下面)。下面是对于第二个维度，您可以使用“-1”作为大小的占位符，它将使用适当的数据填充数组。之后，您应该有一个数组$ N\times (M * K)$其中$ K $是批次数。

Now that we have this array, we can iterate through it to get our batches. The idea is each batch is a $N \times M$ window on the array. For each subsequent batch, the window moves over by `n_steps`. We also want to create both the input and target arrays. Remember that the targets are the inputs shifted over one character. You'll usually see the first input character used as the last target character, so something like this:
```python
y[:, :-1], y[:, -1] = x[:, 1:], x[:, 0]
```
where `x` is the input batch and `y` is the target batch.

现在我们有这个数组，我们可以遍历它来获取我们的批次。这个想法是每个批次都是一个$ N \ times M $窗口上的数组。对于每个后续批处理，窗口移动“n_steps”。我们也想创建输入和目标数组。请记住，目标是输入一个字符以上。您通常会看到用作最后一个目标字符的第一个输入字符，因此如下所示：
```python
y [:, -1]，y [:, -1] = x [:, 1：]，x [:, 0]
```
其中`x`是输入批，`y`是目标批。

The way I like to do this window is use `range` to take steps of size `n_steps` from $0$ to `arr.shape[1]`, the total number of steps in each sequence. That way, the integers you get from `range` always point to the start of a batch, and each window is `n_steps` wide.

我喜欢做这个窗口的方式是使用`range`来从 $0$ 到`arr.shape [1]`的大小`n_steps`的步骤，这是每个序列中的总步数。这样，你从“range”得到的整数总是指向批处理的开始，每个窗口都是`n_steps`。

> **Exercise:** Write the code for creating batches in the function below. The exercises in this notebook _will not be easy_. I've provided a notebook with solutions alongside this notebook. If you get stuck, checkout the solutions. The most important thing is that you don't copy and paste the code into here, **type out the solution code yourself.**

> **练习：**在下面的函数中编写用于创建批次的代码。这本笔记本的练习不会很简单。笔记本电脑为笔记本提供了解决方案。如果卡住了，请检查解决方案。最重要的是你不要将代码复制粘贴到这里，**自己输入解决方案代码**

In [23]:
def get_batches(arr, n_seqs, n_steps):
    '''Create a generator that returns batches of size
       n_seqs x n_steps from arr.
       
       Arguments
       ---------
       arr: Array you want to make batches from
       n_seqs: Batch size, the number of sequences per batch
       n_steps: Number of sequence steps per batch
    '''
    # Get the batch size and number of batches we can make
    batch_size = n_seqs * n_steps
    n_batches = len(arr)//batch_size
    
    # Keep only enough characters to make full batches
    arr = arr[:n_batches * batch_size]
    
    # Reshape into n_seqs rows
    arr = arr.reshape((n_seqs, -1))
    
    for n in range(0, arr.shape[1], n_steps):
        # The features
        x = arr[:, n:n+n_steps]
        # The targets, shifted by one
        y = np.zeros_like(x)
        y[:, :-1], y[:, -1] = x[:, 1:], x[:, 0]
        yield x, y
print("done")

done


Now I'll make my data sets and we can check out what's going on here. Here I'm going to use a batch size of 10 and 50 sequence steps.

现在我将做我的数据集，我们可以看看这里发生了什么。 在这里，我将使用10和50个序列步骤的批量大小。

In [24]:
batches = get_batches(encoded, 10, 50)
x, y = next(batches)

print('x\n', x[:10, :10])
print('\ny\n', y[:10, :10])

x
 [[63 77 17 30 61 28 46 44 62 64]
 [44 17 32 44 41 33 61 44 15 33]
 [20 82 41 24 64 64 80 21 28 40]
 [41 44  5 65 46 82 41 15 44 77]
 [44 82 61 44 82 40 29 44 40 82]
 [44  2 61 44 38 17 40 64 33 41]
 [77 28 41 44 75 33 32 28 44  3]
 [60 44 49 65 61 44 41 33 38 44]
 [61 44 82 40 41 71 61 24 44 53]
 [44 40 17 82  5 44 61 33 44 77]]

y
 [[77 17 30 61 28 46 44 62 64 64]
 [17 32 44 41 33 61 44 15 33 82]
 [82 41 24 64 64 80 21 28 40 29]
 [44  5 65 46 82 41 15 44 77 82]
 [82 61 44 82 40 29 44 40 82 46]
 [ 2 61 44 38 17 40 64 33 41  4]
 [28 41 44 75 33 32 28 44  3 33]
 [44 49 65 61 44 41 33 38 44 40]
 [44 82 40 41 71 61 24 44 53 77]
 [40 17 82  5 44 61 33 44 77 28]]


## Building the model 建模

Below is where you'll build the network. We'll break it up into parts so it's easier to reason about each bit. Then we can connect them up into the whole network.

以下是您建立网络的地方。 我们会把它分成几部分，因此每一点都更容易理解。 然后我们可以将它们连接到整个网络。

<img src="assets/charRNN.png" width=500px>


### Inputs 输入

First off we'll create our input placeholders. As usual we need placeholders for the training data and the targets. We'll also create a placeholder for dropout layers called `keep_prob`. This will be a scalar, that is a 0-D tensor. To make a scalar, you create a placeholder without giving it a size.

首先我们将创建我们的输入占位符。 像往常一样，我们需要占位符的训练数据和目标。 我们还将为名为`keep_prob`的辍学层创建一个占位符。 这将是一个标量，即0-D张量。 要制作标量，您创建一个占位符，而不给它一个大小。

> **Exercise:** Create the input placeholders in the function below.

> **练习：**在下面的函数中创建输入占位符。

In [25]:
def build_inputs(batch_size, num_steps):
    ''' Define placeholders for inputs, targets, and dropout 
    
        Arguments
        ---------
        batch_size: Batch size, number of sequences per batch
        num_steps: Number of sequence steps in a batch
        
    '''
    # Declare placeholders we'll feed into the graph
    inputs = tf.placeholder(tf.int32, [batch_size, num_steps])
    targets = tf.placeholder(tf.int32, [batch_size, num_steps])
    
    # Keep probability placeholder for drop out layers
    keep_prob = tf.placeholder(tf.float32)
    
    return inputs, targets, keep_prob
print("done")

done


### LSTM Cell

Here we will create the LSTM cell we'll use in the hidden layer. We'll use this cell as a building block for the RNN. So we aren't actually defining the RNN here, just the type of cell we'll use in the hidden layer.

We first create a basic LSTM cell with

这里我们将创建我们将在隐藏层中使用的LSTM单元格。我们将使用这个单元作为RNN的构建块。所以我们并不是在这里定义RNN，只是我们在隐藏层中使用的单元格类型。

我们首先创建一个基本的LSTM单元格

```python
lstm = tf.contrib.rnn.BasicLSTMCell(num_units)
```

where `num_units` is the number of units in the hidden layers in the cell. Then we can add dropout by wrapping it with 

其中`num_units`是单元格中隐藏层中的单位数。然后我们可以通过包装来添加退出

```python
tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
```
You pass in a cell and it will automatically add dropout to the inputs or outputs. Finally, we can stack up the LSTM cells into layers with [`tf.contrib.rnn.MultiRNNCell`](https://www.tensorflow.org/versions/r1.0/api_docs/python/tf/contrib/rnn/MultiRNNCell). With this, you pass in a list of cells and it will send the output of one cell into the next cell. For example,

你传递一个单元格，它将自动将输出或输出添加到输出。最后，我们可以使用[`tf.contrib.rnn.MultiRNNCell`](https://www.tensorflow.org/versions/r1.0/api_docs/python/tf/contrib/rnn/)将LSTM单元格堆叠成层MultiRNNCell）。通过这个，你传递一个单元格列表，它将一个单元格的输出发送到下一个单元格。例如，

```python
tf.contrib.rnn.MultiRNNCell([cell]*num_layers)
```

This might look a little weird if you know Python well because this will create a list of the same `cell` object. However, TensorFlow will create different weight matrices for all `cell` objects. Even though this is actually multiple LSTM cells stacked on each other, you can treat the multiple layers as one cell.

We also need to create an initial cell state of all zeros. This can be done like so

如果您很好地了解Python，这可能会有点奇怪，因为这将创建一个相同的 `cell`对象的列表。然而，TensorFlow将为所有`cell` 对象创建不同的权重矩阵。即使这实际上是多个LSTM单元堆叠在一起，您可以将多个层作为一个单元。

我们还需要创建全零的初始单元格状态。这可以这样做

```python
initial_state = cell.zero_state(batch_size, tf.float32)
```

> **Exercise:** Below, implement the `build_lstm` function to create these LSTM cells and the initial state.

**练习：**下面，实现`build_lstm`函数创建这些LSTM单元格和初始状态。

In [26]:
def build_lstm(lstm_size, num_layers, batch_size, keep_prob):
    ''' Build LSTM cell.
    
        Arguments
        ---------
        keep_prob: Scalar tensor (tf.placeholder) for the dropout keep probability
        lstm_size: Size of the hidden layers in the LSTM cells
        num_layers: Number of LSTM layers
        batch_size: Batch size

    '''
    ### Build the LSTM Cell
    # Use a basic LSTM cell
    lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    
    # Add dropout to the cell outputs
    drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
    
    # Stack up multiple LSTM layers, for deep learning
    cell = tf.contrib.rnn.MultiRNNCell([drop]*num_layers)
    initial_state = cell.zero_state(batch_size, tf.float32)
    
    return cell, initial_state
print("done")

done


### RNN Output RNN输出

Here we'll create the output layer. We need to connect the output of the RNN cells to a full connected layer with a softmax output. The softmax output gives us a probability distribution we can use to predict the next character, so we want this layer to have size $C$, the number of classes/characters we have in our text.

这里我们将创建输出层。我们需要将RNN单元的输出连接到具有softmax输出的完整连接层。 softmax输出给出了我们可以用来预测下一个字符的概率分布，所以我们希望这个层的大小为 $C$，我们在文本中的类/字符数。

If our input has batch size $N$, number of steps $M$, and the hidden layer has $L$ hidden units, then the output is a 3D tensor with size $N \times M \times L$. The output of each LSTM cell has size $L$, we have $M$ of them, one for each sequence step, and we have $N$ sequences. So the total size is $N \times M \times L$. 

如果我们的输入有批量大小 $N$，步数 $M$，隐藏层有 $L$ 隐藏单位，那么输出是一个尺寸为 $N \times M \times L$。每个LSTM单元格的输出都具有 $L$ 的大小，我们有 $M$，每个序列步长一个，我们有 $N$序列。所以总大小是 $N \times M \times L$。

We are using the same fully connected layer, the same weights, for each of the outputs. Then, to make things easier, we should reshape the outputs into a 2D tensor with shape $(M * N) \times L$. That is, one row for each sequence and step, where the values of each row are the output from the LSTM cells. We get the LSTM output as a list, `lstm_output`. First we need to concatenate this whole list into one array with [`tf.concat`](https://www.tensorflow.org/api_docs/python/tf/concat). Then, reshape it (with `tf.reshape`) to size $(M * N) \times L$.

我们正在为每个输出使用相同的完全连接的层，相同的权重。然后，为了使事情更容易，我们应该将输出重新形成一个2D张量，形状为$(M * N) \times L$。也就是说，对于每个序列和步骤，每行的一行是来自LSTM单元格的输出。我们将LSTM输出作为列表，`lstm_output`。首先，我们需要使用[`tf.concat`](https://www.tensorflow.org/api_docs/python/tf/concat) 将这个整个列表连接成一个数组。然后，重新整理(用`tf.reshape`)来缩放 $(M * N) \times L$。

One we have the outputs reshaped, we can do the matrix multiplication with the weights. We need to wrap the weight and bias variables in a variable scope with `tf.variable_scope(scope_name)` because there are weights being created in the LSTM cells. TensorFlow will throw an error if the weights created here have the same names as the weights created in the LSTM cells, which they will be default. To avoid this, we wrap the variables in a variable scope so we can give them unique names.

我们有一个输出重新整形，我们可以用权重进行矩阵乘法。因为在LSTM单元格中创建权重，所以我们需要将权重和偏差变量包含在`tf.variable_scope(scope_name)`中。如果这里创建的权重与LSTM单元格中创建的权重具有相同的名称，则TensorFlow将抛出一个错误，它们将是默认值。为了避免这种情况，我们将变量包装在可变范围内，以便我们可以给它们唯一的名称。

> **Exercise:** Implement the output layer in the function below.

> **练习：**在下面的函数中实现输出层。

In [27]:
def build_output(lstm_output, in_size, out_size):
    ''' Build a softmax layer, return the softmax output and logits.
    
        Arguments
        ---------
        
        lstm_output: List of output tensors from the LSTM layer
        in_size: Size of the input tensor, for example, size of the LSTM cells
        out_size: Size of this softmax layer
    '''

    # Reshape output so it's a bunch of rows, one row for each step for each sequence.
    # Concatenate lstm_output over axis 1 (the columns)
    seq_output = tf.concat(lstm_output, axis=1)
    # Reshape seq_output to a 2D tensor with lstm_size columns
    x = tf.reshape(seq_output, [-1, in_size])
    
    # Connect the RNN outputs to a softmax layer
    with tf.variable_scope('softmax'):
        # Create the weight and bias variables here
        softmax_w = tf.Variable(tf.truncated_normal((in_size, out_size), stddev=0.1))
        softmax_b = tf.Variable(tf.zeros(out_size))
    
    # Since output is a bunch of rows of RNN cell outputs, logits will be a bunch
    # of rows of logit outputs, one for each step and sequence
    logits = tf.add(tf.matmul(x, softmax_w), softmax_b)
    
    # Use softmax to get the probabilities for predicted characters
    out = tf.nn.softmax(logits)
    
    return out, logits
print("done")

done


### Training loss 训练损失

Next up is the training loss. We get the logits and targets and calculate the softmax cross-entropy loss. First we need to one-hot encode the targets, we're getting them as encoded characters. Then, reshape the one-hot targets so it's a 2D tensor with size $(M*N) \times C$ where $C$ is the number of classes/characters we have. Remember that we reshaped the LSTM outputs and ran them through a fully connected layer with $C$ units. So our logits will also have size $(M*N) \times C$.

接下来是训练损失。 我们得到对数和目标并计算softmax交叉熵损失。 首先，我们需要对目标进行一次热编码，我们将它们作为编码字符。 然后，重塑一个热目标，所以它是一个2D张量，大小为$(M*N）\times C$，其中$ C $是我们拥有的类/字符数。 记住，我们重塑了LSTM输出，并通过一个完全连接的层以 $C$ 单位运行它们。 所以我们的logits也将有大小 $(M*N) \times C$。

Then we run the logits and targets through `tf.nn.softmax_cross_entropy_with_logits` and find the mean to get the loss.

然后我们通过`tf.nn.softmax_cross_entropy_with_logits`运行对象和目标，找到得到损失的意思。

>**Exercise:** Implement the loss calculation in the function below.

> **练习：**在下面的函数中实现损失计算。

In [28]:
def build_loss(logits, targets, lstm_size, num_classes):
    ''' Calculate the loss from the logits and the targets.
    
        Arguments
        ---------
        logits: Logits from final fully connected layer
        targets: Targets for supervised learning
        lstm_size: Number of LSTM hidden units
        num_classes: Number of classes in targets
        
    '''
    
    # One-hot encode targets and reshape to match logits, one row per sequence per step
    y_one_hot = tf.one_hot(targets, num_classes)
    y_reshaped = tf.reshape(y_one_hot, logits.get_shape())
    
    # Softmax cross entropy loss
    loss = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y_reshaped)
    loss = tf.reduce_mean(loss)
    return loss
print("done")

done


### Optimizer 优化器

Here we build the optimizer. Normal RNNs have have issues gradients exploding and disappearing. LSTMs fix the disappearance problem, but the gradients can still grow without bound. To fix this, we can clip the gradients above some threshold. That is, if a gradient is larger than that threshold, we set it to the threshold. This will ensure the gradients never grow overly large. Then we use an AdamOptimizer for the learning step.

这里我们构建优化器。 正常的RNN有问题的梯度爆炸和消失。 LSTM解决了消失问题，但梯度仍然无限制地增长。 为了解决这个问题，我们可以将梯度剪切在某个阈值以上。 也就是说，如果梯度大于该阈值，我们将其设置为阈值。 这将确保梯度不会变得过大。 然后我们使用一个AdamOptimizer来进行学习。

In [29]:
def build_optimizer(loss, learning_rate, grad_clip):
    ''' Build optmizer for training, using gradient clipping.
    
        Arguments:
        loss: Network loss
        learning_rate: Learning rate for optimizer
    
    '''
    
    # Optimizer for training, using gradient clipping to control exploding gradients
    tvars = tf.trainable_variables()
    grads, _ = tf.clip_by_global_norm(tf.gradients(loss, tvars), grad_clip)
    train_op = tf.train.AdamOptimizer(learning_rate)
    optimizer = train_op.apply_gradients(zip(grads, tvars))
    
    return optimizer
print("done")

done


### Build the network 构建网络

Now we can put all the pieces together and build a class for the network. To actually run data through the LSTM cells, we will use [`tf.nn.dynamic_rnn`](https://www.tensorflow.org/versions/r1.0/api_docs/python/tf/nn/dynamic_rnn). This function will pass the hidden and cell states across LSTM cells appropriately for us. It returns the outputs for each LSTM cell at each step for each sequence in the mini-batch. It also gives us the final LSTM state. We want to save this state as `final_state` so we can pass it to the first LSTM cell in the the next mini-batch run. For `tf.nn.dynamic_rnn`, we pass in the cell and initial state we get from `build_lstm`, as well as our input sequences. Also, we need to one-hot encode the inputs before going into the RNN. 

现在我们可以将所有的部分放在一起，并为网络建立一个类。 要通过LSTM单元实际运行数据，我们将使用[`tf.nn.dynamic_rnn`](https://www.tensorflow.org/versions/r1.0/api_docs/python/tf/nn/dynamic_rnn)。 这个功能将适用于我们的LSTM单元格中的隐藏和单元格状态。 它为每个步骤为每个LSTM单元格输出每个序列的小批量。 它也给了我们最终的LSTM状态。 我们想把这个状态保存为`final_state`，所以我们可以把它传递给下一个小批量运行的第一个LSTM单元。 对于`tf.nn.dynamic_rnn`，我们传递我们从`build_lstm`获取的单元格和初始状态，以及我们的输入序列。 此外，我们需要在进入RNN之前对输入进行一次热编码。

> **Exercise:** Use the functions you've implemented previously and `tf.nn.dynamic_rnn` to build the network.

> **练习：**使用以前实现的功能和`tf.nn.dynamic_rnn`构建网络。

In [30]:
class CharRNN:
    
    def __init__(self, num_classes, batch_size=64, num_steps=50, 
                       lstm_size=128, num_layers=2, learning_rate=0.001, 
                       grad_clip=5, sampling=False):
    
        # When we're using this network for sampling later, we'll be passing in
        # one character at a time, so providing an option for that
        if sampling == True:
            batch_size, num_steps = 1, 1
        else:
            batch_size, num_steps = batch_size, num_steps

        tf.reset_default_graph()
        
        # Build the input placeholder tensors
        self.inputs, self.targets, self.keep_prob = build_inputs(batch_size, num_steps)

        # Build the LSTM cell
        cell, self.initial_state = build_lstm(lstm_size, num_layers, batch_size, self.keep_prob)

        ### Run the data through the RNN layers
        # First, one-hot encode the input tokens
        x_one_hot = tf.one_hot(self.inputs, num_classes)
        
        # Run each sequence step through the RNN with tf.nn.dynamic_rnn 
        outputs, state = tf.nn.dynamic_rnn(cell, x_one_hot, initial_state=self.initial_state)
        self.final_state = state
        
        # Get softmax predictions and logits
        self.prediction, self.logits = build_output(outputs, lstm_size, num_classes)
        
        # Loss and optimizer (with gradient clipping)
        self.loss =  build_loss(self.logits, self.targets, lstm_size, num_classes)
        self.optimizer =  build_optimizer(self.loss, learning_rate, grad_clip)
print("done")

done


## Hyperparameters 超参数

Here are the hyperparameters for the network.

* `batch_size` - Number of sequences running through the network in one pass.
* `num_steps` - Number of characters in the sequence the network is trained on. Larger is better typically, the network will learn more long range dependencies. But it takes longer to train. 100 is typically a good number here.
* `lstm_size` - The number of units in the hidden layers.
* `num_layers` - Number of hidden LSTM layers to use
* `learning_rate` - Learning rate for training
* `keep_prob` - The dropout keep probability when training. If you're network is overfitting, try decreasing this.

Here's some good advice from Andrej Karpathy on training the network. I'm going to copy it in here for your benefit, but also link to [where it originally came from](https://github.com/karpathy/char-rnn#tips-and-tricks).

以下是网络的超参数。

* `batch_size` - 一次通过网络运行的序列数。
* `num_steps` - 训练网络的序列中的字符数。 更大的通常更好，网络将学习更多的远程依赖。 但训练需要更长时间。 这里100通常是很好的数字。
* `lstm_size` - 隐藏图层中的单位数。
* `num_layers` - 要使用的隐藏的LSTM图层的数量
* `learning_rate` - 训练学习率
* `keep_prob` - 辍学在训练时保持概率。 如果网络过度配置，请尝试减少这个。

这是Andrej Karpathy在训练网络方面的一些很好的建议。 我将在这里复制它为您的利益，但也链接到[原来来自哪里](https://github.com/karpathy/char-rnn#tips-and-tricks)。

> ## Tips and Tricks  提示和技巧

>### Monitoring Validation Loss vs. Training Loss 监控验证损失与训练损失
>If you're somewhat new to Machine Learning or Neural Networks it can take a bit of expertise to get good models. The most important quantity to keep track of is the difference between your training loss (printed during training) and the validation loss (printed once in a while when the RNN is run on the validation data (by default every 1000 iterations)). In particular:

> - If your training loss is much lower than validation loss then this means the network might be **overfitting**. Solutions to this are to decrease your network size, or to increase dropout. For example you could try dropout of 0.5 and so on.
> - If your training/validation loss are about equal then your model is **underfitting**. Increase the size of your model (either number of layers or the raw number of neurons per layer)

>如果您对机器学习或神经网络有一些新意，那么可以获得一些专业知识来获得好的模型。 要跟踪的最重要的数量是您的训练损失（在训练期间打印）和验证损失（当RNN在验证数据上运行（默认情况下每1000次迭代）打印一次）之间的差异）。 尤其是：

> - 如果您的训练损失远低于验证损失，那么这意味着网络可能会**过度**。 解决方案是减少您的网络大小，或增加辍学率。 例如，您可以尝试退出0.5，依此类推。
> - 如果你的训练/验证损失大致相等，那么你的模型是**配合**。 增加你的模型的大小（层数或每层神经元的原始数量）

> ### Approximate number of parameters  参数的大概数量

> The two most important parameters that control the model are `lstm_size` and `num_layers`. I would advise that you always use `num_layers` of either 2/3. The `lstm_size` can be adjusted based on how much data you have. The two important quantities to keep track of here are:

> - The number of parameters in your model. This is printed when you start training.
> - The size of your dataset. 1MB file is approximately 1 million characters.

>These two should be about the same order of magnitude. It's a little tricky to tell. Here are some examples:

> - I have a 100MB dataset and I'm using the default parameter settings (which currently print 150K parameters). My data size is significantly larger (100 mil >> 0.15 mil), so I expect to heavily underfit. I am thinking I can comfortably afford to make `lstm_size` larger.
> - I have a 10MB dataset and running a 10 million parameter model. I'm slightly nervous and I'm carefully monitoring my validation loss. If it's larger than my training loss then I may want to try to increase dropout a bit and see if that helps the validation loss.

>控制模型的两个最重要的参数是`lstm_size`和`num_layers`。我建议你总是使用`num_layers`的2/3。 `lstm_size`可以根据您拥有的数据进行调整。跟踪这里的两个重要数量是：

> - 模型中的参数数量。当您开始训练时打印。
> - 数据集的大小。 1MB文件约100万个字符。

>这两个应该大致相同的数量级。这有点棘手。这里有些例子：

> - 我有一个100MB数据集，我使用默认参数设置（目前打印150K参数）。我的数据大小明显更大（100 mil >> 0.15 mil），所以我预期会大量装备。我想我可以舒服地使`lstm_size`更大。
> - 我有一个10MB的数据集，并运行一个1000万参数模型。我稍微紧张，我正在仔细监测我的验证损失。如果它比我的训练损失大，那么我可能想尝试增加辍学率，看看这是否有助于验证损失。

> ### Best models strategy 最佳模型策略

>The winning strategy to obtaining very good models (if you have the compute time) is to always err on making the network larger (as large as you're willing to wait for it to compute) and then try different dropout values (between 0,1). Whatever model has the best validation performance (the loss, written in the checkpoint filename, low is good) is the one you should use in the end.

>It is very common in deep learning to run many different models with many different hyperparameter settings, and in the end take whatever checkpoint gave the best validation performance.

>By the way, the size of your training and validation splits are also parameters. Make sure you have a decent amount of data in your validation set or otherwise the validation performance will be noisy and not very informative.

>获得非常好的模型（如果您有计算时间）的获胜策略是始终错误地使网络更大（如您愿意等待计算的那样大），然后尝试不同的退出值（在0之间），1）。 无论什么型号具有最佳的验证性能（损失，写在检查点文件名，低是好的）是最后应该使用的。

>在深入学习中运行许多不同的超级参数设置的不同模型是非常常见的，最终可以使任何检查点得到最佳的验证性能。

>顺便说一下，你的训练和验证分裂的大小也是参数。 确保您的验证集中有大量的数据，否则验证性能将会很嘈杂，信息量不是很高。

In [31]:
batch_size = 100        # Sequences per batch
num_steps = 100          # Number of sequence steps per batch
lstm_size = 512         # Size of hidden layers in LSTMs
num_layers = 2          # Number of LSTM layers
learning_rate = 0.001    # Learning rate
keep_prob = 0.5         # Dropout keep probability

## Time for training 是时候训练了

This is typical training code, passing inputs and targets into the network, then running the optimizer. Here we also get back the final LSTM state for the mini-batch. Then, we pass that state back into the network so the next batch can continue the state from the previous batch. And every so often (set by `save_every_n`) I save a checkpoint.

这是典型的训练代码，将输入和目标传递到网络中，然后运行优化器。 在这里，我们还可以获得最终的LSTM状态。 然后，我们将该状态传回网络，以便下一批可以继续上一批的状态。 并且经常（由`save_every_n`设置）我保存一个检查点。

Here I'm saving checkpoints with the format

在这里，我正在使用格式保存检查点

`i{iteration number}_l{# hidden layer units}.ckpt`

> **Exercise:** Set the hyperparameters above to train the network. Watch the training loss, it should be consistently dropping. Also, I highly advise running this on a GPU.

> **练习：**设置上面的超级参数来训练网络。 观看训练损失，应该一直下降。 此外，我非常建议在GPU上运行这个。

In [33]:
epochs = 20
# Save every N iterations
save_every_n = 200

model = CharRNN(len(vocab), 
                batch_size=batch_size, 
                num_steps=num_steps,
                lstm_size=lstm_size, 
                num_layers=num_layers, 
                learning_rate=learning_rate)

saver = tf.train.Saver(max_to_keep=100)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    # Use the line below to load a checkpoint and resume training
    #saver.restore(sess, 'checkpoints/______.ckpt')
    counter = 0
    for e in range(epochs):
        # Train network
        new_state = sess.run(model.initial_state)
        loss = 0
        for x, y in get_batches(encoded, batch_size, num_steps):
            counter += 1
            start = time.time()
            feed = {model.inputs: x, model.targets: y, model.keep_prob: keep_prob, model.initial_state: new_state}
            batch_loss, new_state, _ = sess.run([model.loss, model.final_state, model.optimizer],  feed_dict=feed)
            
            end = time.time()
            print('Epoch: {}/{}... '.format(e+1, epochs),
                  'Training Step: {}... '.format(counter),
                  'Training loss: {:.4f}... '.format(batch_loss),
                  '{:.4f} sec/batch'.format((end-start)))
        
            if (counter % save_every_n == 0):
                saver.save(sess, "checkpoints/i{}_l{}.ckpt".format(counter, lstm_size))
    
    saver.save(sess, "checkpoints/i{}_l{}.ckpt".format(counter, lstm_size))

Epoch: 1/20...  Training Step: 1...  Training loss: 4.4191...  0.1308 sec/batch
Epoch: 1/20...  Training Step: 2...  Training loss: 4.3261...  0.1070 sec/batch
Epoch: 1/20...  Training Step: 3...  Training loss: 3.8367...  0.1004 sec/batch
Epoch: 1/20...  Training Step: 4...  Training loss: 4.8918...  0.1088 sec/batch
Epoch: 1/20...  Training Step: 5...  Training loss: 4.1754...  0.0941 sec/batch
Epoch: 1/20...  Training Step: 6...  Training loss: 3.8827...  0.1005 sec/batch
Epoch: 1/20...  Training Step: 7...  Training loss: 3.7040...  0.1072 sec/batch
Epoch: 1/20...  Training Step: 8...  Training loss: 3.5549...  0.1280 sec/batch
Epoch: 1/20...  Training Step: 9...  Training loss: 3.4717...  0.1214 sec/batch
Epoch: 1/20...  Training Step: 10...  Training loss: 3.4291...  0.1014 sec/batch
Epoch: 1/20...  Training Step: 11...  Training loss: 3.3857...  0.1286 sec/batch
Epoch: 1/20...  Training Step: 12...  Training loss: 3.3840...  0.1295 sec/batch
Epoch: 1/20...  Training Step: 13... 

#### Saved checkpoints

Read up on saving and loading checkpoints here: https://www.tensorflow.org/programmers_guide/variables

In [34]:
tf.train.get_checkpoint_state('checkpoints')

model_checkpoint_path: "checkpoints/i3960_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i200_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i400_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i600_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i800_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i1000_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i1200_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i1400_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i1600_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i1800_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i2000_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i2200_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i2400_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i2600_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i2800_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i3000_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i3200_l512.ckpt"
all_model_checkpoint_pa

## Sampling 取样

Now that the network is trained, we'll can use it to generate new text. The idea is that we pass in a character, then the network will predict the next character. We can use the new one, to predict the next one. And we keep doing this to generate all new text. I also included some functionality to prime the network with some text by passing in a string and building up a state from that.

现在网络被训练了，我们可以用它来生成新的文本。 这个想法是我们传递一个字符，那么网络将预测下一个字符。 我们可以用新的来预测下一个。 我们继续这样做来生成所有新的文本。 我还包括一些功能，通过传递一个字符串并建立一个状态，通过一些文本来填充网络。

The network gives us predictions for each character. To reduce noise and make things a little less random, I'm going to only choose a new character from the top N most likely characters.

网络给了我们每个角色的预测。 为了减少噪音，让事情稍微随机一些，我将只从最可能的N个角色中选择一个新角色。

In [35]:
def pick_top_n(preds, vocab_size, top_n=5):
    p = np.squeeze(preds)
    p[np.argsort(p)[:-top_n]] = 0
    p = p / np.sum(p)
    c = np.random.choice(vocab_size, 1, p=p)[0]
    return c

In [36]:
def sample(checkpoint, n_samples, lstm_size, vocab_size, prime="The "):
    samples = [c for c in prime]
    model = CharRNN(len(vocab), lstm_size=lstm_size, sampling=True)
    saver = tf.train.Saver()
    with tf.Session() as sess:
        saver.restore(sess, checkpoint)
        new_state = sess.run(model.initial_state)
        for c in prime:
            x = np.zeros((1, 1))
            x[0,0] = vocab_to_int[c]
            feed = {model.inputs: x,
                    model.keep_prob: 1.,
                    model.initial_state: new_state}
            preds, new_state = sess.run([model.prediction, model.final_state], 
                                         feed_dict=feed)

        c = pick_top_n(preds, len(vocab))
        samples.append(int_to_vocab[c])

        for i in range(n_samples):
            x[0,0] = c
            feed = {model.inputs: x,
                    model.keep_prob: 1.,
                    model.initial_state: new_state}
            preds, new_state = sess.run([model.prediction, model.final_state], 
                                         feed_dict=feed)

            c = pick_top_n(preds, len(vocab))
            samples.append(int_to_vocab[c])
        
    return ''.join(samples)

Here, pass in the path to a checkpoint and sample from the network.

在这里，将路径传递到检查点并从网络中抽取。

In [37]:
tf.train.latest_checkpoint('checkpoints')

'checkpoints/i3960_l512.ckpt'

In [38]:
checkpoint = tf.train.latest_checkpoint('checkpoints')
samp = sample(checkpoint, 2000, lstm_size, len(vocab), prime="Far")
print(samp)

Farrance, was
insisting and as it was a little.

They were not for anywhere and take a promise, to see him as to be displayed
and a long while of tenterness. She seemed to stop it. And all the
promosents ought to think of them, waring to the doctor, he went
into the days of all the strength and they were standing. "That's not
when your suppress will be so much. He can think of the ploughing
tries to see me so wording in my contrary. I don't could be so assumate
in any man."

"Well, then I shale seem in my child, and I chuldrong in all the whole
same than yourself, and he's not sister, that that would be silent,
tell me that you may be so left into the party. And this was it a sight
of the propostine of all to humor.... What is a stepar are to me that
I consider that I should not be disturbed with her. I stay them. But
that's that is that it, it's answer and to say to the side of himself, and
there were a service of such a party of answer. The might make how the
servants and all she has

In [39]:
checkpoint = 'checkpoints/i200_l512.ckpt'
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

Fard ito th at touthe and te warer,. 
oho he and th the hhe he sin tas tha weran ho asing wor he sane heren hod, wid hhe wer than wot he shime won te ther,, hod he hhe serese sore ant his she the shed whe sarins ond heser, and had terit ans ath and, art al shering thin tar tha whimis on th wasd sit he wiss ount on ou ho he ardim hor, har thas at outhes
tous, the sores anthas. An as the as ande whos shing on he thede th wer otiter shes so os oute he alin the sis ar and he he sher on the the he whe witon he wang and tha ad tind hhe
ser tas sor wang and and ans ant ot hore se hare he tand the
san to simes or the whime
tit has and tas sar anet, than sared,, tha sot and the sime thorere ta shere an heren ot har ans orasithe the the sile the ased
toustit an ter hhe
sasid an ho an hhare wha he ther heed ated al tho hiss wat her sothe he has and asd an hot ans aset an or hhis ange the wherind wand he sosan her ar the whad are to te the wasthe an to te set he shes he whe shis ood sans atorins. 

In [40]:
checkpoint = 'checkpoints/i600_l512.ckpt'
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

Faring, her ald the sonte word on she cumtinn of stere, the someting and
andestice had the sarded, and as had to the
perincing
wers of his allisithers. Hus she comed a charst of she was that ho contersed theme her his, and hume as heard the
cheathous of her timents whomess, and as had a tho dins with she seeling his was ta tares at to a steling of
the chate him has and the pesteress in her to this
atenesser, with hir, a thing
ald the sarces it
whe the soond was in the senter a sand was how shate than the peaser, who comes take to
shice at the camlerses. He sould at that her him. The
rastion to ham to him stonditg to his
aberted the seaters which who he cond op timest and then.

He had he sald him tole that his about the pillaster, and shating have of his foce onthe thoug her sace and sampition.

"You said, stictlest, to sang the some, while that was a ling that ally
with her told and his beghaters, and the porsining wore, and wish to should ste he was to drase
was to atered at and that

In [30]:
checkpoint = 'checkpoints/i1200_l512.ckpt'
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

Fardy
him, but to the more to alr thissing in the sent to hus
say to
to be is heart on its, and all any, what it sumpressed becimen that
she wanted to the stroughest of the canter, and without this, and witt a spress
the bast on the mone stance, she was said a diffient of his sorm of
the searing with his hore, and to his face with a marry was
as the princess of the state, which said his haste of his feeling to his
stand toother,
starding that he could not huld the his hand as a luct as to a solt their
husbond, who heart the marere of this peased--whe mashe to say their
wonder of the same andicultrain to the much while though and attered his
ene and had been at the mone and so suthing with seeming of
the
pettirugion of her suct and stepling at the mirst, he wanting. He had say
seresticing at thrigg at hour at him will breas too, as this
in the rand he heart, and his starded in the same of the real though had
a
preamor, taked
it, and the beath of her she had been happy, and welk to thilk

In [31]:
checkpoint = 'checkpoints/i2000_l512.ckpt'
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

Farding
the posertance of the high trat would back that the hand with a clover to a coust as
he
was all a sate in the count sending all with the country.

"Well, what along, the cheets," said Anna.

"It's a generant of the labor of all," said the same smire, which was
commoned with which he was so little attange in to him, and were
streem in his hasse, before it in his wife.

"I did not lade it, a mund of their cours of her, and a door as he selm
she had been the simely of the moment and things, as that we couldn't
go and she saw that? How is the princess were true to brought the clear,
and the more of sonien, to be stood on in the prince, and had to spak in a
brother settled in a streng in spinish. But, but he was time that was not
so she was to a pensing as it was a man taken in all the stand, that
that is needed to the pate, was such it to the prover of the stands.



Countess Bordsension that he was a geast was that he was anything so that
she saw them to be an indident the positio

In [32]:
checkpoint = 'checkpoints/i3000_l512.ckpt'
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

Farrace he had sooned
him.

"I can't the marrale then what's the province. I say so it's an and of
her feers in howrer, as you were not into the carriage to her, I shall
can any thee and marralen of me to tear to be in anything."

The middle of his soft, went into what she was attempted to them, and
he was stronger at the same a creetures of his shirt, and went on to
the states, though he was doing her tears, hid his face with which had been
calming to her all of the composure to her hands it.

And seemed to the principle of his carming, who strongen so told him that
he should never say that there alone went to the prince and this
words were not between the carriage, and she had said to his hands to
the same starrations of his fasting that he could not came to the
path.

They were naid as a people, and he was telling the care of any calriage
and shaming, and were too stopping and so as he had not telling her son.
The disconfent of the same and had been to be asked, and that was netted
