## RNN模型优化中的问题(梯度爆炸和梯度消失)

梯度爆炸解决方案：梯度裁剪

梯度消失产生的原因：
    
    RNN网络中的长期依赖

梯度消失解决方案：
    
    使用固定时间跨度的记忆(滑动平均)
    GRU、LSTM

## RNN模型

<img src="images/RNN.png" style="width:500;height:300px;">
<caption><center> **Figure 1**: 基础RNN模型 </center></caption>

输入 $T_x$ 时间步序列 $x = (x^{\langle 1 \rangle}, x^{\langle 2 \rangle}, ..., x^{\langle T_x \rangle})$ 。模型输出$y = (y^{\langle 1 \rangle}, y^{\langle 2 \rangle}, ..., y^{\langle T_x \rangle})$ 。

<img src="images/rnn_step_forward.png" style="width:700px;height:300px;">
<caption><center> **Figure 2**:  基础RNN cell。当前cell输入 $x^{\langle t \rangle}$ 、前一个cell的隐藏层状态 $a^{\langle t - 1\rangle}$ (包含过去的信息)、当前cell隐藏层状态 $a^{\langle t \rangle}$ (可以传递给下一个cell，也可以用来预测 $y^{\langle t \rangle}$ )</center></caption>

<img src="images/LSTM_rnn.png" style="width:500;height:300px;">
<caption><center> **Figure 3**: LSTM前向传播 </center></caption>

<img src="images/LSTM.png" style="width:500;height:400px;">
<caption><center> **Figure 4**: LSTM-cell. 它在每个时间步保存并更新内部状态和记忆变量 $c^{\langle t \rangle}$ (不同于  $a^{\langle t \rangle}$ )。 </center></caption>

$\begin{bmatrix} a^{\langle t-1 \rangle} , x^{\langle t \rangle} \end{bmatrix}$ 表示 $\begin{bmatrix} a^{\langle t-1 \rangle} \\ x^{\langle t \rangle} \end{bmatrix}$

## RNN反向传播

<img src="images/rnn_cell_backprop.png" style="width:500;height:300px;"> <br>
<caption><center> **Figure 5**: RNN-cell的反向传播。cost function $J$。计算 $(\frac{\partial J}{\partial W_{ax}},\frac{\partial J}{\partial W_{aa}},\frac{\partial J}{\partial b})$ 更新 $(W_{ax}, W_{aa}, b_a)$. </center></caption>

## LSTM反向传播

### gate derivatives

$$ d \Gamma_o^{\langle t \rangle} = da_{next}*\tanh(c_{next}) * \Gamma_o^{\langle t \rangle}*(1-\Gamma_o^{\langle t \rangle})\tag{1} $$

$$ d\tilde c^{\langle t \rangle} = dc_{next}*\Gamma_u^{\langle t \rangle}+ \Gamma_o^{\langle t \rangle} (1-\tanh(c_{next})^2) * i_t * da_{next} * \tilde c^{\langle t \rangle} * (1-\tanh(\tilde c)^2) \tag{2} $$

$$ d\Gamma_u^{\langle t \rangle} = dc_{next}*\tilde c^{\langle t \rangle} + \Gamma_o^{\langle t \rangle} (1-\tanh(c_{next})^2) * \tilde c^{\langle t \rangle} * da_{next}*\Gamma_u^{\langle t \rangle}*(1-\Gamma_u^{\langle t \rangle})\tag{3} $$

$$ d\Gamma_f^{\langle t \rangle} = dc_{next}*\tilde c_{prev} + \Gamma_o^{\langle t \rangle} (1-\tanh(c_{next})^2) * c_{prev} * da_{next}*\Gamma_f^{\langle t \rangle}*(1-\Gamma_f^{\langle t \rangle})\tag{4} $$

### parameter derivatives 

$$ dW_f = d\Gamma_f^{\langle t \rangle} * \begin{pmatrix} a_{prev} \\ x_t\end{pmatrix}^T \tag{5} $$
$$ dW_u = d\Gamma_u^{\langle t \rangle} * \begin{pmatrix} a_{prev} \\ x_t\end{pmatrix}^T \tag{6} $$
$$ dW_c = d\tilde c^{\langle t \rangle} * \begin{pmatrix} a_{prev} \\ x_t\end{pmatrix}^T \tag{7} $$
$$ dW_o = d\Gamma_o^{\langle t \rangle} * \begin{pmatrix} a_{prev} \\ x_t\end{pmatrix}^T \tag{8} $$

$$ db_f = np.sum(d\Gamma_f^{\langle t \rangle}, axis=1, keepdims=True) $$
$$ db_u = np.sum(d\Gamma_u^{\langle t \rangle}, axis=1, keepdims=True) $$
$$ db_c = np.sum(d\tilde c^{\langle t \rangle}, axis=1, keepdims=True) $$ 
$$ db_o = np.sum(d\Gamma_o^{\langle t \rangle}, axis=1, keepdims=True) $$

$a$ 隐藏层状态 $c$记忆状态

$$ da_{prev} = W_f^T*d\Gamma_f^{\langle t \rangle} + W_u^T * d\Gamma_u^{\langle t \rangle}+ W_c^T * d\tilde c^{\langle t \rangle} + W_o^T * d\Gamma_o^{\langle t \rangle} \tag{9} $$
$W_f = W_f[:n_a,:]$

$$ dc_{prev} = dc_{next}\Gamma_f^{\langle t \rangle} + \Gamma_o^{\langle t \rangle} * (1- \tanh(c_{next})^2)*\Gamma_f^{\langle t \rangle}*da_{next} \tag{10} $$
$$ dx^{\langle t \rangle} = W_f^T*d\Gamma_f^{\langle t \rangle} + W_u^T * d\Gamma_u^{\langle t \rangle}+ W_c^T * d\tilde c_t + W_o^T * d\Gamma_o^{\langle t \rangle}\tag{11} $$
$W_f = W_f[n_a:,:]$

## 双向RNN

## Deep RNNs

## RNN实例(PTB数据集)

In [None]:
from __future__ import print_function

import os

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.callbacks import TensorBoard
from keras.datasets import imdb

max_features = 20000
maxlen = 80  # cut texts after this number of words (among top max_features most common words)
batch_size = 32
tensorflow_logs_path = os.path.expanduser("~/Desktop/tensorflow_logs/RNN")

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy'],
)

try:
    os.makedirs(tensorflow_logs_path)
except OSError:
    pass

print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=3,
          validation_data=(x_test, y_test),
          callbacks=[TensorBoard(
              log_dir=tensorflow_logs_path,
              write_graph=True,
              histogram_freq=1,
              write_grads=True,
              batch_size=10,
              write_images=True,
              embeddings_freq=1
          )])
score, acc = model.evaluate(x_test, y_test, batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)