# 金瓶梅

In this notebook, I'll build a character-wise RNN trained on Chinese classical novel Jin Ping Mei. It'll be able to generate new text based on the text from the book.

这个网络是根据 Andrej Karpathy's 的 [post on RNNs](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) 与 [implementation in Torch](https://github.com/karpathy/char-rnn) 这两篇论文写成，然后优达学城的Mat最初提供了代码. 其他信息可以参考这里 [here at r2rt](http://r2rt.com/recurrent-neural-networks-in-tensorflow-ii.html) 与 [Sherjil Ozair](https://github.com/sherjilozair/char-rnn-tensorflow). 所有数据都基于金瓶梅小说.

<img src="assets/charseq.jpeg" width="500">

In [1]:
import time
from collections import namedtuple

import numpy as np
import tensorflow as tf

首先要做的就是先把金瓶梅里的每个汉字都映射成每个不同的整数。这里需要创建一个字典，然后把文章的每个汉字，包括标点符号都用一个整数对应起来。然后最后把整篇文章从字符串转成一个整数向量

In [2]:
with open('jinpingmei.txt', 'r') as f:
    text=f.read()
vocab = sorted(set(text))
vocab_to_int = {c: i for i, c in enumerate(vocab)}
int_to_vocab = dict(enumerate(vocab))
encoded = np.array([vocab_to_int[c] for c in text], dtype=np.int32)

先列出小说的前100个字看看源数据长啥样.

In [3]:
text[:100]

'第一回\u3000西门庆热结十弟兄\u3000武二郎冷遇亲哥嫂 \n \n\u3000\u3000诗曰： \n\n\u3000\u3000\u3000\u3000豪华去后行人绝，箫筝不响歌喉咽。雄剑无威光彩沉，宝琴零落金星灭。 \n\u3000\u3000\u3000\u3000玉阶寂寞坠秋露，月照当时歌舞处。当时歌舞人不回，化'

经过编码以后，我们看到整篇小说已经被转成了一个很长的向量了，向量里的每个整数对应的都是字典里的每一个汉字

In [4]:
encoded[:100]

array([2831,   41,  773,   34, 3539, 4068, 1204, 2328, 2964,  484, 1249,
        308,   34, 2037,  109, 3951,  361, 3921,  130,  671,  979,    1,
          0,    1,    0,   34,   34, 3606, 1818, 4459,    1,    0,    0,
         34,   34,   34,   34, 3668,  490,  533,  583, 3468,  132, 2971,
       4456, 2861, 2844,   49,  662, 2032,  716,  654,   36, 4150,  424,
       1756,  946,  312, 1267, 2108, 4456, 1024, 2477, 4162, 3339, 3993,
       1780, 2291,   36,    1,    0,   34,   34,   34,   34, 2444, 4108,
       1044, 1054,  813, 2756, 4177, 4456, 1826, 2342, 1261, 1767, 2032,
       3221,  863,   36, 1261, 1767, 2032, 3221,  132,   49,  773, 4456,
        470])

由于我们应对的都是单个汉字，整个训练的过程其实就是一个分类问题。我们字典里面大概有3000个汉字，所以有3000多个分类。整个模型的训练目标就是根据金瓶梅小说，从已出现的文本准确地预测出下一个汉字，以此生成一篇新的小说。

In [5]:
len(vocab)

4464

## 给数据划分批次

由于模型训练依赖的是显卡并行管线，所以训练模型的时候，数据会暂时存储在显存里面。但本人的显卡显存有限，只有2G，所以只能作分批处理，不然显存要爆炸了，看下图:



<br>
为了计算的方便，我们还将在分批的时候顺便把向量里的元素放在一个2维矩阵里面。我们定义两个个值: n_steps，n_seqs。 n_steps 是步长，指的是在一个batch里面有多少行序列。n_seqs 是序列大小，代表每行里面字符的数量。 batch_size 的值就是 n_steps 与 n_seqs 的乘积，它指的就是在一个批次里面，里面包含元素的数量。由于我们需要按照每个batch_size的大小来为整篇小说划分批次。所以最后的一个批次很有可能比batch_size小。在这里项目，我们将其废弃 (其实在有些模型处理分批的时候，会把余下的数字以0值填充)。
 'size We have our text encoded as integers as one long array in `encoded`. Let's create a function that will give us an iterator for our batches. I like using [generator functions](https://jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/) to do this. Then we can pass `encoded` into this function and get our batch generator'.

'The first thing we need to do is discard some of the text so we only have completely full batches. Each batch contains $N \times M$ characters, where $N$ is the batch size (the number of sequences) and $M$ is the number of steps. Then, to get the number of batches we can make from some array `arr`, you divide the length of `arr` by the batch size. Once you know the number of batches and the batch size, you can get the total number of characters to keep.'

'After that, we need to split `arr` into $N$ sequences. You can do this using `arr.reshape(size)` where `size` is a tuple containing the dimensions sizes of the reshaped array. We know we want $N$ sequences (`n_seqs` below), let's make that the size of the first dimension. For the second dimension, you can use `-1` as a placeholder in the size, it'll fill up the array with the appropriate data for you. After this, you should have an array that is $N \times (M * K)$ where $K$ is the number of batches.'

'Now that we have this array, we can iterate through it to get our batches. The idea is each batch is a $N \times M$ window on the array. For each subsequent batch, the window moves over by `n_steps`. We also want to create both the input and target arrays. Remember that the targets are the inputs shifted over one character. You'll usually see the first input character used as the last target character, so something like this:'
```python
y[:, :-1], y[:, -1] = x[:, 1:], x[:, 0]
```
where `x` is the input batch and `y` is the target batch.

The way I like to do this window is use `range` to take steps of size `n_steps` from $0$ to `arr.shape[1]`, the total number of steps in each sequence. That way, the integers you get from `range` always point to the start of a batch, and each window is `n_steps` wide.

In [6]:
def get_batches(arr, n_seqs, n_steps):
    '''Create a generator that returns batches of size
       n_seqs x n_steps from arr.
       
       Arguments
       ---------
       arr: Array you want to make batches from
       n_seqs: Batch size, the number of sequences per batch
       n_steps: Number of sequence steps per batch
    '''
    # Get the number of characters per batch and number of batches we can make
    characters_per_batch = n_seqs * n_steps
    n_batches = len(arr)//characters_per_batch
    
    # Keep only enough characters to make full batches
    arr = arr[:n_batches * characters_per_batch]
    
    # Reshape into n_seqs rows
    arr = arr.reshape((n_seqs, -1))
    
    for n in range(0, arr.shape[1], n_steps):
        # The features
        x = arr[:, n:n+n_steps]
        # The targets, shifted by one
        y = np.zeros_like(x)
        y[:, :-1], y[:, -1] = x[:, 1:], x[:, 0]
        yield x, y

在这里把序列元素的个数设置成10，把序列的数量也设置成10，所以一个批次的维度就是10x10的矩阵

In [7]:
batches = get_batches(encoded, 10, 50)
x, y = next(batches)

In [8]:
print('x\n', x[:10, :10])
print('\ny\n', y[:10, :10])

x
 [[2831   41  773   34 3539 4068 1204 2328 2964  484]
 [3539 4068 1204  805   48 4456 4072 3925 4459   30]
 [ 583 3856 1826  954 1451 3989 1464  105  506  304]
 [ 533 3040   36   31    1    0    0   34   34 1194]
 [ 149  868 2511  976  913 2057  875  932 4456  793]
 [ 132 4456 1094 4050 3474   84 3094 4456  149 1038]
 [1719   36 3539 4068 1204  773 3205  518   47 4456]
 [4459   30 1438 1006 2510  347  100   50 2206 1016]
 [1782 1943 1001  833   69  538  130 2632 2624 3543]
 [2632 1775 4456   49 3570 2586  206   36   31 3870]]

y
 [[  41  773   34 3539 4068 1204 2328 2964  484 1249]
 [4068 1204  805   48 4456 4072 3925 4459   30 1323]
 [3856 1826  954 1451 3989 1464  105  506  304 4456]
 [3040   36   31    1    0    0   34   34 1194 1016]
 [ 868 2511  976  913 2057  875  932 4456  793 4068]
 [4456 1094 4050 3474   84 3094 4456  149 1038 2753]
 [  36 3539 4068 1204  773 3205  518   47 4456 1070]
 [  30 1438 1006 2510  347  100   50 2206 1016  397]
 [1943 1001  833   69  538  130 2632 2

测试了一下函数输出的矩阵，在输入输出分别得到一个10x10的矩阵 
```
x
 [[55 63 69 22  6 76 45  5 16 35]
 [ 5 69  1  5 12 52  6  5 56 52]
 [48 29 12 61 35 35  8 64 76 78]
 [12  5 24 39 45 29 12 56  5 63]
 [ 5 29  6  5 29 78 28  5 78 29]
 [ 5 13  6  5 36 69 78 35 52 12]
 [63 76 12  5 18 52  1 76  5 58]
 [34  5 73 39  6  5 12 52 36  5]
 [ 6  5 29 78 12 79  6 61  5 59]
 [ 5 78 69 29 24  5  6 52  5 63]]

y
 [[63 69 22  6 76 45  5 16 35 35]
 [69  1  5 12 52  6  5 56 52 29]
 [29 12 61 35 35  8 64 76 78 28]
 [ 5 24 39 45 29 12 56  5 63 29]
 [29  6  5 29 78 28  5 78 29 45]
 [13  6  5 36 69 78 35 52 12 43]
 [76 12  5 18 52  1 76  5 58 52]
 [ 5 73 39  6  5 12 52 36  5 78]
 [ 5 29 78 12 79  6 61  5 59 63]
 [78 69 29 24  5  6 52  5 63 76]]
 ```
 although the exact numbers will be different. Check to make sure the data is shifted over one step for `y`.

## 构建模型

把数据预处理好以后，就开始要构建模型了。为了方便使用GPU去进行并行运算，我这里用选择的TensorFlow。TensorFlow提供了比较不错的接口，并且封装了许多常用的数据算法。.

<img src="assets/charRNN.png" width=500px>


### Inputs

这里讲一下keep_prob这个参数。keep_prob是指保存权重的概率。就是说，模型在训练的时候，只有特定的一些修改得以保存下来，其他的修改都被随机性地丢弃。为什么要这样呢？因为在模型的训练的时候，SGD(Stochastic Gradient Descent 随机梯度下降)算法在处理大规模数据的时候很容易陷入一个局部最佳值，造成模型的过拟合状况。更往后的的epoch能获取更多的上下文(因为越往后，batch数量越多，信息也更完整)，随机丢掉一些值能让是更往后的epoch去更新权重，更能找到一个全局的最优值。 First off we'll create our input placeholders. As usual we need placeholders for the training data and the targets. We'll also create a placeholder for dropout layers called `keep_prob`.

In [9]:
def build_inputs(batch_size, num_steps):
    ''' Define placeholders for inputs, targets, and dropout 
    
        Arguments
        ---------
        batch_size: Batch size, number of sequences per batch
        num_steps: Number of sequence steps in a batch
        
    '''
    # Declare placeholders we'll feed into the graph
    inputs = tf.placeholder(tf.int32, [batch_size, num_steps], name='inputs')
    targets = tf.placeholder(tf.int32, [batch_size, num_steps], name='targets')
    
    # Keep probability placeholder for drop out layers
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    
    return inputs, targets, keep_prob

### LSTM Cell(Long Short-Term Memory 长短期记忆单元)

这里可以通过TensorFlow构建循环神经网络的LSTM Cell。我个人觉得'细胞'虽然语义上不太一样。但实际上含义是一样的。这里大概说明一下LSTM的含义。LSTM Cell实际是带有一系列的门运算的神经元。如果神经网络就一块电路板的话，LSTM Cell它实际上就是一组门电路的并联。它有'三重门', 分别是:遗忘门，输入门, 输出门。遗忘门: 负责对一些不必要记住的信息选择性遗忘，然后放到它的上下文那里，这样才能保证有用的信息被记下来哈。 输入门: 没用的信息被选择性遗忘以后，输入门就对有用的信息处理。tanh函数产生的向量C与Sigmoid函数产生向量i在这里聚集，共同控制最终的输入值(至于为啥需要两个函数共同控制呢？求大神告知)。 输出门: 当数据通过了过纠结的输入门后，最终来到了输出门。输出门实际上也是一个滤波器，通过sigmoid函数筛选出合适的类(汉字)。别忘了前文提过，整个模型训练的就是每个神经元筛选分类的能力(处理分类问题通常就是要一个牛逼的筛子)，最终数据离开细胞的时候，它也会结合之前的上下文，一同输出到下一个节点。一头雾水吧，下面有图，看图:
<img src="https://gss1.bdstatic.com/-vo3dSag_xI4khGkpoWK1HF6hhy/baike/c0%3Dbaike116%2C5%2C5%2C116%2C38/sign=8c5d1cc9daa20cf4529df68d17602053/91ef76c6a7efce1b48c95384a551f3deb58f659a.jpg">
如果看完图还是满脑浆糊(我何尝不是)，那就看人家的博客吧。[理解LSTM](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) 反正最主要的就是要知道，LSTM 最大的用处在于, <b>它不像它的前辈们一样，往后的节点会忘记前面的节点所处理过的数据; 它的每个节点都提供上下文给后面的节点使用。</b> 这就使得它能保证输出的结果是连贯的，与前面部分是有关系的。 

We first create a basic LSTM cell with

```python
lstm = tf.contrib.rnn.BasicLSTMCell(num_units)
```

where `num_units` is the number of units in the hidden layers in the cell. Then we can add dropout by wrapping it with 

```python
tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
```
要构建lstm单元的网络，我们需要把每个单元堆叠到一个高维空间(啥)，并且嵌入dropout模块(参考前文)。TensorFlow就是方便，啥方法都提供好了，码农堆积木就行了。 'You pass in a cell and it will automatically add dropout to the inputs or outputs. Finally, we can stack up the LSTM cells into layers with [`tf.contrib.rnn.MultiRNNCell`](https://www.tensorflow.org/versions/r1.0/api_docs/python/tf/contrib/rnn/MultiRNNCell). With this, you pass in a list of cells and it will send the output of one cell into the next cell. Previously with TensorFlow 1.0, you could do this'

```python
tf.contrib.rnn.MultiRNNCell([cell]*num_layers)
```

This might look a little weird if you know Python well because this will create a list of the same `cell` object. However, TensorFlow 1.0 will create different weight matrices for all `cell` objects. But, starting with TensorFlow 1.1 you actually need to create new cell objects in the list. To get it to work in TensorFlow 1.1, it should look like

```python
def build_cell(num_units, keep_prob):
    lstm = tf.contrib.rnn.BasicLSTMCell(num_units)
    drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
    
    return drop
    
tf.contrib.rnn.MultiRNNCell([build_cell(num_units, keep_prob) for _ in range(num_layers)])
```

Even though this is actually multiple LSTM cells stacked on each other, you can treat the multiple layers as one cell.

We also need to create an initial cell state of all zeros. This can be done like so

```python
initial_state = cell.zero_state(batch_size, tf.float32)
```

实现`build_lstm`函数bababah 'Below, we implement the `build_lstm` function to create these LSTM cells and the initial state'.

In [10]:
def build_lstm(lstm_size, num_layers, batch_size, keep_prob):
    ''' Build LSTM cell.
    
        Arguments
        ---------
        keep_prob: Scalar tensor (tf.placeholder) for the dropout keep probability
        lstm_size: Size of the hidden layers in the LSTM cells
        num_layers: Number of LSTM layers
        batch_size: Batch size

    '''
    ### Build the LSTM Cell
    
    def build_cell(lstm_size, keep_prob):
        # Use a basic LSTM cell
        lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
        
        # Add dropout to the cell
        drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
        return drop
    
    
    # Stack up multiple LSTM layers, for deep learning
    cell = tf.contrib.rnn.MultiRNNCell([build_cell(lstm_size, keep_prob) for _ in range(num_layers)])
    initial_state = cell.zero_state(batch_size, tf.float32)
    
    return cell, initial_state

### RNN Output

构件神经网络的输出层。通过对每个神经元输出的值实现矩阵叉乘。神经元的数量跟batch_size一致。加上中间经过lstm这样深套路，每一步运算都要存一个中间值，终于理解了显存为啥要爆炸了。 Here we'll create the output layer. We need to connect the output of the RNN cells to a full connected layer with a softmax output. The softmax output gives us a probability distribution we can use to predict the next character.

当我们通过lstm以后，我们的输出变成了一个三维的张量，为了数据能一致地输出，这里还需要把数据的形状变回一张2次元的矩阵。接着我的二向箔！ If our input has batch size $N$, number of steps $M$, and the hidden layer has $L$ hidden units, then the output is a 3D tensor with size $N \times M \times L$. The output of each LSTM cell has size $L$, we have $M$ of them, one for each sequence step, and we have $N$ sequences. So the total size is $N \times M \times L$.

We are using the same fully connected layer, the same weights, for each of the outputs. Then, to make things easier, we should reshape the outputs into a 2D tensor with shape $(M * N) \times L$. That is, one row for each sequence and step, where the values of each row are the output from the LSTM cells.

One we have the outputs reshaped, we can do the matrix multiplication with the weights. We need to wrap the weight and bias variables in a variable scope with `tf.variable_scope(scope_name)` because there are weights being created in the LSTM cells. TensorFlow will throw an error if the weights created here have the same names as the weights created in the LSTM cells, which they will be default. To avoid this, we wrap the variables in a variable scope so we can give them unique names.

In [11]:
def build_output(lstm_output, in_size, out_size):
    ''' Build a softmax layer, return the softmax output and logits.
    
        Arguments
        ---------
        
        x: Input tensor
        in_size: Size of the input tensor, for example, size of the LSTM cells
        out_size: Size of this softmax layer
    
    '''

    # Reshape output so it's a bunch of rows, one row for each step for each sequence.
    # That is, the shape should be batch_size*num_steps rows by lstm_size columns
    seq_output = tf.concat(lstm_output, axis=1)
    x = tf.reshape(seq_output, [-1, in_size])
    
    # Connect the RNN outputs to a softmax layer
    with tf.variable_scope('softmax'):
        softmax_w = tf.Variable(tf.truncated_normal((in_size, out_size), stddev=0.1))
        softmax_b = tf.Variable(tf.zeros(out_size))
    
    # Since output is a bunch of rows of RNN cell outputs, logits will be a bunch
    # of rows of logit outputs, one for each step and sequence
    logits = tf.matmul(x, softmax_w) + softmax_b
    
    # Use softmax to get the probabilities for predicted characters
    out = tf.nn.softmax(logits, name='predictions')
    
    return out, logits

### Training loss

拿到输出以后，就要计算输出的结果与输入数据之间的损耗了。损耗函数是Softmax 最大交叉熵损耗，具体看这里，篇幅有限，不细说了。[A Friendly Introduction to Cross-Entropy Loss](https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/) Next up is the training loss. We get the logits and targets and calculate the softmax cross-entropy loss. First we need to one-hot encode the targets, we're getting them as encoded characters. Then, reshape the one-hot targets so it's a 2D tensor with size $(M*N) \times C$ where $C$ is the number of classes/characters we have. Remember that we reshaped the LSTM outputs and ran them through a fully connected layer with $C$ units. So our logits will also have size $(M*N) \times C$.

Then we run the logits and targets through `tf.nn.softmax_cross_entropy_with_logits` and find the mean to get the loss.

In [12]:
def build_loss(logits, targets, lstm_size, num_classes):
    ''' Calculate the loss from the logits and the targets.
    
        Arguments
        ---------
        logits: Logits from final fully connected layer
        targets: Targets for supervised learning
        lstm_size: Number of LSTM hidden units
        num_classes: Number of classes in targets
        
    '''
    
    # One-hot encode targets and reshape to match logits, one row per batch_size per step
    y_one_hot = tf.one_hot(targets, num_classes)
    y_reshaped = tf.reshape(y_one_hot, logits.get_shape())
    
    # Softmax cross entropy loss
    loss = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y_reshaped)
    loss = tf.reduce_mean(loss)
    return loss

### Optimizer

顺手给梯度加个优化器，因为LSTM修复了老RNN的梯度消失问题，从而造成它的梯度值大的离谱。这个梯度优化器就是按照某个标准给梯度设置一个最大值，否则某些高峰会严重影响斜率，让梯度下降的过程绕弯路 Here we build the optimizer. Normal RNNs have have issues gradients exploding and disappearing. LSTMs fix the disappearance problem, but the gradients can still grow without bound. To fix this, we can clip the gradients above some threshold. That is, if a gradient is larger than that threshold, we set it to the threshold. This will ensure the gradients never grow overly large. Then we use an AdamOptimizer for the learning step.

In [13]:
def build_optimizer(loss, learning_rate, grad_clip):
    ''' Build optmizer for training, using gradient clipping.
    
        Arguments:
        loss: Network loss
        learning_rate: Learning rate for optimizer
    
    '''
    
    # Optimizer for training, using gradient clipping to control exploding gradients
    tvars = tf.trainable_variables()
    grads, _ = tf.clip_by_global_norm(tf.gradients(loss, tvars), grad_clip)
    train_op = tf.train.AdamOptimizer(learning_rate)
    optimizer = train_op.apply_gradients(zip(grads, tvars))
    
    return optimizer

### Build the network

现在我们把网络的结构做好了，接下来就可以正式开始训练神经网络了 Now we can put all the pieces together and build a class for the network. To actually run data through the LSTM cells, we will use [`tf.nn.dynamic_rnn`](https://www.tensorflow.org/versions/r1.0/api_docs/python/tf/nn/dynamic_rnn). This function will pass the hidden and cell states across LSTM cells appropriately for us. It returns the outputs for each LSTM cell at each step for each sequence in the mini-batch. It also gives us the final LSTM state. We want to save this state as `final_state` so we can pass it to the first LSTM cell in the the next mini-batch run. For `tf.nn.dynamic_rnn`, we pass in the cell and initial state we get from `build_lstm`, as well as our input sequences. Also, we need to one-hot encode the inputs before going into the RNN. 

In [14]:
class CharRNN:
    
    def __init__(self, num_classes, batch_size=64, num_steps=50, 
                       lstm_size=128, num_layers=2, learning_rate=0.001, 
                       grad_clip=5, sampling=False):
    
        # When we're using this network for sampling later, we'll be passing in
        # one character at a time, so providing an option for that
        if sampling == True:
            batch_size, num_steps = 1, 1
        else:
            batch_size, num_steps = batch_size, num_steps

        tf.reset_default_graph()
        
        # Build the input placeholder tensors
        self.inputs, self.targets, self.keep_prob = build_inputs(batch_size, num_steps)

        # Build the LSTM cell
        cell, self.initial_state = build_lstm(lstm_size, num_layers, batch_size, self.keep_prob)

        ### Run the data through the RNN layers
        # First, one-hot encode the input tokens
        x_one_hot = tf.one_hot(self.inputs, num_classes)
        
        # Run each sequence step through the RNN and collect the outputs
        outputs, state = tf.nn.dynamic_rnn(cell, x_one_hot, initial_state=self.initial_state)
        self.final_state = state
        
        # Get softmax predictions and logits
        self.prediction, self.logits = build_output(outputs, lstm_size, num_classes)
        
        # Loss and optimizer (with gradient clipping)
        self.loss = build_loss(self.logits, self.targets, lstm_size, num_classes)
        self.optimizer = build_optimizer(self.loss, learning_rate, grad_clip)

## Hyperparameters

这里是最考验黑魔法师水平的地方了，这里要调节神经网络的超参数。因为其他的都是套路，这里靠的是先天启发和后天经验。在这里，码农们，不对，数据工程师们，可以摇身一变成中医大夫，对神经网络望闻问切，然后拍拍脑袋，没准真能拍出一套让神经网络得出最优值的超参数。 Here I'm defining the hyperparameters for the network. 

* `batch_size` - Number of sequences running through the network in one pass.
* `num_steps` - Number of characters in the sequence the network is trained on. Larger is better typically, the network will learn more long range dependencies. But it takes longer to train. 100 is typically a good number here.
* `lstm_size` - The number of units in the hidden layers.
* `num_layers` - Number of hidden LSTM layers to use
* `learning_rate` - Learning rate for training
* `keep_prob` - The dropout keep probability when training. If you're network is overfitting, try decreasing this.

Here's some good advice from Andrej Karpathy on training the network. I'm going to copy it in here for your benefit, but also link to [where it originally came from](https://github.com/karpathy/char-rnn#tips-and-tricks).

> ## Tips and Tricks

>### Monitoring Validation Loss vs. Training Loss
>If you're somewhat new to Machine Learning or Neural Networks it can take a bit of expertise to get good models. The most important quantity to keep track of is the difference between your training loss (printed during training) and the validation loss (printed once in a while when the RNN is run on the validation data (by default every 1000 iterations)). In particular:

> - If your training loss is much lower than validation loss then this means the network might be **overfitting**. Solutions to this are to decrease your network size, or to increase dropout. For example you could try dropout of 0.5 and so on.
> - If your training/validation loss are about equal then your model is **underfitting**. Increase the size of your model (either number of layers or the raw number of neurons per layer)

> ### Approximate number of parameters

> The two most important parameters that control the model are `lstm_size` and `num_layers`. I would advise that you always use `num_layers` of either 2/3. The `lstm_size` can be adjusted based on how much data you have. The two important quantities to keep track of here are:

> - The number of parameters in your model. This is printed when you start training.
> - The size of your dataset. 1MB file is approximately 1 million characters.

>These two should be about the same order of magnitude. It's a little tricky to tell. Here are some examples:

> - I have a 100MB dataset and I'm using the default parameter settings (which currently print 150K parameters). My data size is significantly larger (100 mil >> 0.15 mil), so I expect to heavily underfit. I am thinking I can comfortably afford to make `lstm_size` larger.
> - I have a 10MB dataset and running a 10 million parameter model. I'm slightly nervous and I'm carefully monitoring my validation loss. If it's larger than my training loss then I may want to try to increase dropout a bit and see if that helps the validation loss.

> ### Best models strategy

>The winning strategy to obtaining very good models (if you have the compute time) is to always err on making the network larger (as large as you're willing to wait for it to compute) and then try different dropout values (between 0,1). Whatever model has the best validation performance (the loss, written in the checkpoint filename, low is good) is the one you should use in the end.

>It is very common in deep learning to run many different models with many different hyperparameter settings, and in the end take whatever checkpoint gave the best validation performance.

>By the way, the size of your training and validation splits are also parameters. Make sure you have a decent amount of data in your validation set or otherwise the validation performance will be noisy and not very informative.


In [17]:
batch_size = 128        # Sequences per batch
num_steps = 100         # Number of sequence steps per batch
lstm_size = 512         # Size of hidden layers in LSTMs
num_layers = 2          # Number of LSTM layers
learning_rate = 0.0003   # Learning rate
keep_prob = 0.5         # Dropout keep probability

## Time for training

This is typical training code, passing inputs and targets into the network, then running the optimizer. Here we also get back the final LSTM state for the mini-batch. Then, we pass that state back into the network so the next batch can continue the state from the previous batch. And every so often (set by `save_every_n`) I save a checkpoint.

Here I'm saving checkpoints with the format

`i{iteration number}_l{# hidden layer units}.ckpt`

In [18]:
epochs = 50
# Save every N iterations
save_every_n = 200

model = CharRNN(len(vocab), batch_size=batch_size, num_steps=num_steps,
                lstm_size=lstm_size, num_layers=num_layers, 
                learning_rate=learning_rate)

saver = tf.train.Saver(max_to_keep=100)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    # Use the line below to load a checkpoint and resume training
    #saver.restore(sess, 'checkpoints/______.ckpt')
    counter = 0
    for e in range(epochs):
        # Train network
        new_state = sess.run(model.initial_state)
        loss = 0
        for x, y in get_batches(encoded, batch_size, num_steps):
            counter += 1
            start = time.time()
            feed = {model.inputs: x,
                    model.targets: y,
                    model.keep_prob: keep_prob,
                    model.initial_state: new_state}
            batch_loss, new_state, _ = sess.run([model.loss, 
                                                 model.final_state, 
                                                 model.optimizer], 
                                                 feed_dict=feed)
            
            end = time.time()
            print('Epoch: {}/{}... '.format(e+1, epochs),
                  'Training Step: {}... '.format(counter),
                  'Training loss: {:.4f}... '.format(batch_loss),
                  '{:.4f} sec/batch'.format((end-start)))
        
            if (counter % save_every_n == 0):
                saver.save(sess, "checkpoints/i{}_l{}.ckpt".format(counter, lstm_size))
    
    saver.save(sess, "checkpoints/i{}_l{}.ckpt".format(counter, lstm_size))

Epoch: 1/50...  Training Step: 1...  Training loss: 8.4041...  1.4952 sec/batch
Epoch: 1/50...  Training Step: 2...  Training loss: 8.3998...  1.4524 sec/batch
Epoch: 1/50...  Training Step: 3...  Training loss: 8.3945...  1.3991 sec/batch
Epoch: 1/50...  Training Step: 4...  Training loss: 8.3873...  1.4213 sec/batch
Epoch: 1/50...  Training Step: 5...  Training loss: 8.3733...  1.4446 sec/batch
Epoch: 1/50...  Training Step: 6...  Training loss: 8.3482...  1.3928 sec/batch
Epoch: 1/50...  Training Step: 7...  Training loss: 8.2876...  1.4095 sec/batch
Epoch: 1/50...  Training Step: 8...  Training loss: 8.1361...  1.4096 sec/batch
Epoch: 1/50...  Training Step: 9...  Training loss: 7.8635...  1.3634 sec/batch
Epoch: 1/50...  Training Step: 10...  Training loss: 7.8812...  1.4309 sec/batch
Epoch: 1/50...  Training Step: 11...  Training loss: 7.7826...  1.4422 sec/batch
Epoch: 1/50...  Training Step: 12...  Training loss: 7.6051...  1.4290 sec/batch
Epoch: 1/50...  Training Step: 13... 

#### Saved checkpoints

Read up on saving and loading checkpoints here: https://www.tensorflow.org/programmers_guide/variables

In [19]:
tf.train.get_checkpoint_state('checkpoints')

model_checkpoint_path: "checkpoints\\i3050_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i200_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i400_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i600_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i800_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i1000_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i1200_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i1400_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i1600_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i1800_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i2000_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i2200_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i2400_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i2600_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i2800_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i3000_l512.ckpt"
all_model_checkpoint_paths: "checkpoints\\i3050_l512.ckpt"

## Sampling

训练了很长时间，等到花儿也谢了。终于训练好了。是时候保存一下进度了，不然丢了数据就要砸鼠标了。Now that the network is trained, we'll can use it to generate new text. The idea is that we pass in a character, then the network will predict the next character. We can use the new one, to predict the next one. And we keep doing this to generate all new text. I also included some functionality to prime the network with some text by passing in a string and building up a state from that.

The network gives us predictions for each character. To reduce noise and make things a little less random, I'm going to only choose a new character from the top N most likely characters.



In [20]:
def pick_top_n(preds, vocab_size, top_n=5):
    p = np.squeeze(preds)
    p[np.argsort(p)[:-top_n]] = 0
    p = p / np.sum(p)
    c = np.random.choice(vocab_size, 1, p=p)[0]
    return c

In [21]:
def sample(checkpoint, n_samples, lstm_size, vocab_size, prime="The "):
    samples = [c for c in prime]
    model = CharRNN(len(vocab), lstm_size=lstm_size, sampling=True)
    saver = tf.train.Saver()
    with tf.Session() as sess:
        saver.restore(sess, checkpoint)
        new_state = sess.run(model.initial_state)
        for c in prime:
            x = np.zeros((1, 1))
            x[0,0] = vocab_to_int[c]
            feed = {model.inputs: x,
                    model.keep_prob: 1.,
                    model.initial_state: new_state}
            preds, new_state = sess.run([model.prediction, model.final_state], 
                                         feed_dict=feed)

        c = pick_top_n(preds, len(vocab))
        samples.append(int_to_vocab[c])

        for i in range(n_samples):
            x[0,0] = c
            feed = {model.inputs: x,
                    model.keep_prob: 1.,
                    model.initial_state: new_state}
            preds, new_state = sess.run([model.prediction, model.final_state], 
                                         feed_dict=feed)

            c = pick_top_n(preds, len(vocab))
            samples.append(int_to_vocab[c])
        
    return ''.join(samples)

吼吼，代笑笑生续写金瓶梅的神经网络已经训练好了，下面看看结果。得先输入一个引子，然后给你一篇金瓶梅2。

In [22]:
tf.train.latest_checkpoint('checkpoints')

'checkpoints\\i3050_l512.ckpt'

In [26]:
checkpoint = tf.train.latest_checkpoint('checkpoints')
samp = sample(checkpoint, 7000, lstm_size, len(vocab), prime="浪")
print(samp)

INFO:tensorflow:Restoring parameters from checkpoints\i3050_l512.ckpt
浪。” 

　　西门庆见了，只见西门庆进入门中，便道：“这里说了，只是一件儿。”西门庆道：“你说不知，我不敢说。”伯爵道：“这等你不知，我还不去。我说我不知你，你不知，你这个不好？”西门庆道：“我不得你。”西门庆道：“你休要说，你不要，我说你，你不要你，我也不要我去，你不肯你。他不肯来，你不知，我就不知我，不知你那里去了。”西门庆道：“你不知道，只是我不的。”西门庆道：“你的我这里，就是我的不好。你这一个小厮儿，不知道，你也要吃他。”西门庆道：“我的不说你，我说你不好，他不知道，不是这般说，我就不知道。”那妇人听了，只见他说话，一直往外边去。只见玳安进房里，一面走了一遍，金莲不见，说道：“他是不知，你这个好个儿！我不在这里，只是你不好。”西门庆道：“怪狗才，他这等你来。”西门庆道：“他不知，你这个不在我。”西门庆道：“我不要他，你不要，你还要我去，教我拿着他去了。”李瓶儿道：“你不好，只顾我来。”西门庆道：“我不知你，我不在家。那个不好不好？”妇人道：“我也不知，我也不是你这般，你不知道。”那妇人听着，笑道：“你的不是，说的是谁。”那妇人道：“我不知道，我不知道。”西门庆道：“他也不是，你说我，你也是个不知道！”于是打了一个，说道：“我不是你，我这里说的是谁的？”月娘道：“他这个不好，你就是我这个儿！不是你这个淫妇儿，我不知你，你怎不知道！”说道：“我的我不知，我怎的不知你？”妇人道：“你不知你，我不在他这里，你不知道。你不知怎的不得？我来家我也不要了，他也没了，我也不好。你说了一场，他也不知你的，我这个不是你的。” 

　　不言语，只见了一回。 

　　西门庆在房里睡，只见他两个儿，不觉一阵风儿，只见一个不知。那日不见，不在西门庆房里，只见李瓶儿来了，说：“姐姐，他不在，我也不知你，你这个是他的。不知我，你不知他。他不是你，你这里，我不是他，他就是你的。”那玉楼听见了，说：“他，你还不在这屋里？”西门庆道：“你不好，我也不要你，不想来我，不要你看，他也要不出来，我也要他，你说他，你还不知道了。”一面走了一遍，又不在家，又问他：“我在那屋里？”金莲道：“你没了，我就是不得他。”月娘道：“你说道：“我，你这个不好。他来，你还不知道

In [None]:
checkpoint = 'checkpoints/i200_l512.ckpt'
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

In [None]:
checkpoint = 'checkpoints/i600_l512.ckpt'
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

In [None]:
checkpoint = 'checkpoints/i1200_l512.ckpt'
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

呵呵，什么鬼，前言不对后语，毫无逻辑可言。是这样的，由于神经网络本身是不识字的，它只是对每个字预估后一个字出现的概率而已。不过可以看到，神经网络其实是能够正确运用标点符号的，而且连对联的格式也模仿得神似。
 后续工作: 建立一个生成对抗网络，来写一遍名篇小说。大概说一下生成对抗网络是什么鬼: 大概就是用一个网络写文章，另一个网络去判别文章的真伪，如果写文章的枪手网络能够骗到判官网络，那就说明写文章的网络的水平也是很高的了。 
