<font size=6>循环神经网络（Recurrent Neural Networks-RNN）</font>
# 为什么需要RNN？
## 序列数据
序列是相互依赖的（有限或无限）数据流，比如时间序列数据、信息性的字符串、对话等。在对话中，一个句子可能有一个意思，但是整体的对话可能又是完全不同的意思。股市数据这样的时间序列数据也是，单个数据表示当前价格，但是全天的数据会有不一样的变化，促使我们作出买进或卖出的决定。

RNN面对的场景是，输入数据具有依赖性且是序列模式。RNN突出的特点是其记忆性，就是 对之前发生在数据序列中的事是有一定的记忆。这有助于系统获取上下文。理论上讲，RNN 有无限的记忆，这意味着有无限回顾的能力。通过回顾可以了解所有之前的输入。
## 应用场景
<img src="images/ae2970d80a119cd341ef31c684bfac49.png">

# one-hot向量
**目标：**识别单词是否名词；

In [1]:
import numpy as np
# 样本
sample="to be a quant you have to be good at coding and finance"
# 分词
words=sample.split(" ")
# 词典
token={j:i for i,j in enumerate(set(words))}
# 词向量
seq=np.zeros(shape=(len(token),len(words)))
for i,j in enumerate(words):
    seq[token.get(j),i]=1

In [2]:
token

{'a': 0,
 'you': 1,
 'have': 2,
 'quant': 3,
 'finance': 4,
 'good': 5,
 'to': 6,
 'be': 7,
 'coding': 8,
 'and': 9,
 'at': 10}

In [3]:
seq

array([[0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.]])

# 符号
$\large x^{(i)}$:第i个样本；

$\large x^{[i]}$:第i层；

$\large x^{\{i\}}$:第i批；

$\large x^{<i>}$:序列的第i个位置；


# RNN Model
## 准备
1. 目标：判断一个句子里面的单词是否是名词；
2. 传统NN只能根据样本标记判断单词词性，忽略了语境的作用，这正是RNN的特长；

## RNN
<img src="images/07e3288e16d576a545d03e5127adbb91.png">

### 特点
1. $\large a^{<0>}$：初始化为零向量；
2. $\large \hat y^{<t>}=f(x^{<t>},a^{<t-1>})$

## RNN前向传播
<img src="images/aad5728eb983cfabfac6fc38e90d28ff.png">

$\large a^{< t >} = g_{1}(W_{aa}a^{< t - 1 >} + W_{ax}x^{< t >} + b_{a})$

$\large \hat y^{<t>} = g_{2}(W_{ya}a^{< t >} + b_{y})$

$\large 令：W_{a}=\left [W_{aa},W_{ax}\right ],\left\lbrack a^{< t - 1 >},x^{< t >}\right\rbrack=\begin{bmatrix}a^{< t-1 >} \\ x^{< t >} \\\end{bmatrix}$

$\large 那么：\\\large a^{< t >} = g_{1}(W_{a}\left\lbrack a^{< t - 1 >},x^{< t >}\right\rbrack + b_{a})\\\large\hat y^{<t>} = g_{2}(W_{y}a^{< t >} + b_{y})$

### 形状
$\because\\x^{<t>}:(n,1)\\a^{t-1}:(m,1)\\W_{a}:(m,n+m)\\b_{a}:(m,1)\\W_{y}:(1,m)\\b_{y}:(1,1)\\\therefore\\a^{<t>}:(m,1)\to a^{<t>}的形状自洽\\\hat y^{<t>}:(1,1)$

## RNN反向传播
1. 链式法则；
2. 传播顺序；
3. **注意：参数在各单元是共享的；**

### 传播流
<img src="images/998c7af4f90cd0de0c88f138b61f0168.png">

### 损失函数
$L^{<t>}( \hat y^{<t>},y^{<t>}) = - y^{<t>}\log\hat  y^{<t>}-( 1-\hat y^{<t>})log(1-\hat y^{<t>})$
### 成本函数
$L(\hat y,y) = \sum_{t = 1}^{T_{x}}{L^{< t >}(\hat  y^{< t >},y^{< t >})}$
### 序列$t$位置/时刻的反向传播
<img src="images/4235dc87f795faabe628a91cd1a681f6.png">


# 不同类型的循环神经网络
1. 按照输入和输出数量来分类；
<img src="images/1daa38085604dd04e91ebc5e609d1179.png">

# 循环神经网络的梯度消失
**原因：**
1. 时间序列的长度；
2. 连乘效应；

**解决方法：**
1. 通用方法,eg.加速下降，L2正则化；
2. 门值调节；

# **GRU**（Gated Recurrent Unit）
$\Gamma_u=\sigma(W_u[a^{<t-1>},x^{<t>}]+b_u)\\
\Gamma_r=\sigma(W_r[a^{<t-1>},x^{<t>}]+b_r)\\
\tilde c^{<t>}=tanh(W_{c}[\Gamma_ra^{<t-1>},x^{<t>}]+b_c)\\
c^{<t>} =\Gamma_{u}{\tilde{c}}^{<t>} + (1-\Gamma_{u})c^{<t-1>}\\
a^{<t>} =c^{<t>} 
$

**注：**
1. **c**：代表细胞（cell），即记忆细胞；
2. $\Gamma_{u}$,代表更新门，这是一个0到1之间的值;
3. $\Gamma_{r}$,代表相关门，这是一个0到1之间的值；

# LSTM（long short term memory unit）
$\Gamma_u=\sigma(W_u[a^{<t-1>},x^{<t>}]+b_u)\\
\Gamma_f=\sigma(W_f[a^{<t-1>},x^{<t>}]+b_f)\\
\Gamma_o=\sigma(W_o[a^{<t-1>},x^{<t>}]+b_o)\\
\tilde c^{<t>}=tanh(W_{c}[a^{<t-1>},x^{<t>}]+b_c)\\
c^{<t>} =\Gamma_{u}{\tilde{c}}^{<t>} + \Gamma_{f}c^{<t-1>}\\
a^{<t>} =\Gamma_{o}c^{<t>} 
$

**注：**
1. **c**：代表细胞（cell），即记忆细胞；
2. $\Gamma_{u}$,代表更新门，这是一个0到1之间的值;
3. $\Gamma_{f}$,代表遗忘门，这是一个0到1之间的值；
4. $\Gamma_{o}$,代表输出门，这是一个0到1之间的值；

#  双向循环神经网络（Bidirectional **RNN**）
RNN和LSTM都只能依据之前时刻的时序信息来预测下一时刻的输出，但在有些问题中，当前时刻的输出不仅和之前的状态有关，还可能和未来的状态有关系。比如预测一句话中缺失的单词不仅需要根据前文来判断，还需要考虑它后面的内容，真正做到基于上下文判断。BRNN有两个RNN上下叠加在一起组成的，输出由这两个RNN的状态共同决定。
<img src="images/5aa6424f0001485b07120276.png">
## 公式
$\large\hat y^{<t>}=g(W_{g}\left\lbrack {\overrightarrow{a}}^{<t>},{\overleftarrow{a}}^{<t>} \right\rbrack +b_y)$

# 深层循环神经网络（Deep **RNN**s）
## 公式
$\large a^{[l]<t>}=g(W_a^{[l]}[a^{[l]<t-1>},a^{[l-1]<t>}]+b_a^{[l]})$
## 单层RNN
<img src="images/616f645e887b51177db8cb6694a5d03b.png">

## 多层RNN
<img src="images/539b99a27a6e87a5cca9d52b5b2c81c5.png">


# Basic $RNN$
$RNN$因为其“记忆”特性，擅长处理$NLP$(自然语言处理)和其他序列型任务。我们接下来使用$numpy$开发我们的第一个简单的$RNN$。$RNN$类型属于$many-to-many$，输入和输出的序列长度相等，即$T_x=T_y$。
<img src="images/07e3288e16d576a545d03e5127adbb91.png">

## 前向传播-Forward Propagation
**步骤：**
1. 实现t时序的计算功能；
2. 循环遍历所有时序处理时序输入；

### RNN cell
一个循环神经网络是由单个网络单元，又叫细胞组合而成。我们先来实现单个细胞，如图所示。
<img src="images/999911c9276922884a7caac8cc28dbff.png" style="width:700px;height:300px;">

**实现RNN cell：**
1. 采用激活函数计算隐藏层的激活值（又叫细胞值）：$a^{\langle t \rangle} = \tanh(W_{aa} a^{\langle t-1 \rangle} + W_{ax} x^{\langle t \rangle} + b_a)$.
2. 使用激活值 $a^{\langle t \rangle}$, 采用`softmax`，计算预测值 $\hat{y}^{\langle t \rangle} = softmax(W_{ya} a^{\langle t \rangle} + b_y)$.
3. 缓存 $(a^{\langle t \rangle}, a^{\langle t-1 \rangle}, x^{\langle t \rangle}, parameters)$ .
4. 返回 $a^{\langle t \rangle}$ , $y^{\langle t \rangle}$ 和 缓存.

这里，我们向量化$m$个样本，$t$时序输入值$x^{<t>}$的形状为：$(n_x,m)$，并且$t$时序激活值$a^{<t>}$的形状为：$(n_a,m)$

In [4]:
def rnn_cell_forward(xt, a_prev, parameters):
    """
    细胞单元
    参数:
    xt -- t时序的输入数据, 形状：(n_x, m).
    a_prev -- t-1时序的激活值, 形状：(n_a, m)
    parameters -- 字典类型:
            Wax -- 计算激活值，与输入相关的权重矩阵, 形状： (n_a, n_x)
            Waa -- 计算激活值，与前激活值相关的权重矩阵, 形状： (n_a, n_a)
            Wya -- 计算预测值的权重矩阵, 形状： (n_y, n_a)
            ba --  计算激活值的偏置项, 形状： (n_a, 1)
            by --  就算预测值的偏置项, 形状： (n_y, 1)
    返回:
    a_next -- 激活值, 形状： (n_a, m)
    yt_pred -- t时序的预测值, 形状： (n_y, m)
    cache -- 反向传播用到的变量, 包含 (a_next, a_prev, xt, parameters)
    """    
    Wax = parameters["Wax"]
    Waa = parameters["Waa"]
    Wya = parameters["Wya"]
    ba = parameters["ba"]
    by = parameters["by"]        
    # 计算t时序的激活值
    a_next = np.tanh(np.dot(Wax,xt)+np.dot(Waa,a_prev)+ba)
    # 计算t时序的预测值
    yt_pred = softmax(np.dot(Wya,a_next)+by)       
    # 缓存反向传播用的的值
    cache = (a_next, a_prev, xt, parameters)    
    return a_next, yt_pred, cache

In [5]:
def softmax(x):
    e_x = np.exp(x)
    return e_x / e_x.sum(axis=0)

In [6]:
np.random.seed(1)
xt = np.random.randn(3,10)
a_prev = np.random.randn(5,10)
Waa = np.random.randn(5,5)
Wax = np.random.randn(5,3)
Wya = np.random.randn(2,5)
ba = np.random.randn(5,1)
by = np.random.randn(2,1)
parameters = {"Waa": Waa, "Wax": Wax, "Wya": Wya, "ba": ba, "by": by}

a_next, yt_pred, cache = rnn_cell_forward(xt, a_prev, parameters)
a_next.shape,yt_pred.shape

((5, 10), (2, 10))

### 单元集成
接下来，我们将上面实现的细胞单元进行集成形成RNN的前向传播。
1. 创建激活值内存空间，存储所有时序的细胞激活值；
2. 初始化$a^{<0>}$;
3. 递增遍历所有时序：
    1. 运算`rnn_cell_forward`函数；
    2. 得到t时序的激活值、预测值和缓存值；
4. 返回所有时序的激活值、预测值和缓存值；

In [7]:
def rnn_forward(x, a0, parameters):
    """
    RNN前向传播.
    参数:
    x -- 输入值, of shape (n_x, m, T_x).
    a0 -- 0时序的初始化值, of shape (n_a, m)
    参数 -- 字典:
           Waa -- 计算激活值，与t-1时序相关的权重矩阵, 形状：(n_a, n_a)
           Wax -- 计算激活值，与t时序输入值相关的权重矩阵, 形状：(n_a, n_x)
           Wya -- 计算预测值，与t时序激活值相关的权重矩阵, 形状：(n_y, n_a)
           ba --  计算激活值的偏置项，形状：(n_a, 1)
           by --  计算预测值的偏置项, 形状： (n_y, 1)

    返回:
    a -- 所有时序的激活值, 形状：(n_a, m, T_x)
    y_pred -- 所有时序的预测值, 形状： (n_y, m, T_x)
    caches -- 反向传播用到的所有时序的缓存值, 包含 (aches, x)
    """    
    # 初始化缓存空间，缓存所有时序的缓存值
    caches = []    
    # 获取维度信息
    n_x, m, T_x = x.shape
    n_y, n_a = parameters["Wya"].shape
    
    # 初始化激活值和预测值
    a = np.zeros((n_a,m,T_x))
    y_pred = np.zeros((n_y,m,T_x))
    
    # 获取0时序的激活值
    a_next = a0
    
    # 升序遍历
    for t in range(T_x):
        a_next, yt_pred, cache = rnn_cell_forward(x[:,:,t],a_next,parameters)
        a[:,:,t] = a_next
        y_pred[:,:,t] = yt_pred
        caches.append(cache)        
    caches = (caches, x)    
    return a, y_pred, caches

In [8]:
np.random.seed(1)
x = np.random.randn(3,10,4)
a0 = np.random.randn(5,10)
Waa = np.random.randn(5,5)
Wax = np.random.randn(5,3)
Wya = np.random.randn(2,5)
ba = np.random.randn(5,1)
by = np.random.randn(2,1)
parameters = {"Waa": Waa, "Wax": Wax, "Wya": Wya, "ba": ba, "by": by}

a, y_pred, caches = rnn_forward(x, a0, parameters)
a.shape,y_pred.shape,len(caches[0])

((5, 10, 4), (2, 10, 4), 4)

## 反向传播-Back Propagation
### RNN cell
同前向传播一样，我们先建立t时序单元细胞的反向传播计算。这里需要用到如下求导等式：
1. $tanh^{'}(x)=1-tanh{(x)}^2$;
<img src="images/4235dc87f795faabe628a91cd1a681f6.png">

In [9]:
def rnn_cell_backward(da_next, cache):
    (a_next, a_prev, xt, parameters) = cache
    
    Wax = parameters["Wax"]
    Waa = parameters["Waa"]
    Wya = parameters["Wya"]
    ba = parameters["ba"]
    by = parameters["by"]
    # 中间变量求导
    dtanh = (1-a_next*a_next)*da_next
    # 输入值和输入相关权重求导
    dxt = np.dot(Wax.T,  dtanh)
    dWax = np.dot(dtanh,xt.T)
    # t-1时序激活值和激活相关权重求导
    da_prev = np.dot(Waa.T, dtanh)
    dWaa = np.dot( dtanh,a_prev.T)
    # 截距项求导
    dba = np.sum( dtanh,keepdims=True,axis=-1)

    gradients = {"dxt": dxt, "da_prev": da_prev, "dWax": dWax, "dWaa": dWaa, "dba": dba}
    
    return gradients

In [10]:
np.random.seed(1)
xt = np.random.randn(3,10)
a_prev = np.random.randn(5,10)
Wax = np.random.randn(5,3)
Waa = np.random.randn(5,5)
Wya = np.random.randn(2,5)
b = np.random.randn(5,1)
by = np.random.randn(2,1)
parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "ba": ba, "by": by}

a_next, yt, cache = rnn_cell_forward(xt, a_prev, parameters)

da_next = np.random.randn(5,10)
gradients = rnn_cell_backward(da_next, cache)
gradients["dxt"].shape,gradients["da_prev"].shape,gradients['dWax'].shape,gradients['dWaa'].shape,gradients['dba'].shape

((3, 10), (5, 10), (5, 3), (5, 5), (5, 1))

### 单元集成


In [11]:
def rnn_backward(da, caches):
    (caches, x) = caches
    (a1, a0, x1, parameters) = caches[0]  # t=1 时的值

    n_a, m, T_x = da.shape
    n_x, m = x1.shape

    dx = np.zeros((n_x, m, T_x))
    dWax = np.zeros((n_a, n_x))
    dWaa = np.zeros((n_a, n_a))
    dba = np.zeros((n_a, 1))
    da0 = np.zeros((n_a, m))
    da_prevt = np.zeros((n_a, m))

    for t in reversed(range(T_x)):
        gradients = rnn_cell_backward(da[:, :, t] + da_prevt, caches[t]) 
        dxt, da_prevt, dWaxt, dWaat, dbat =\
        gradients["dxt"], gradients["da_prev"], gradients["dWax"], gradients["dWaa"], gradients["dba"]
        dx[:, :, t] = dxt
        dWax += dWaxt
        dWaa += dWaat
        dba += dbat
        
    da0 = da_prevt

    gradients = {"dx": dx, "da0": da0, "dWax": dWax, "dWaa": dWaa,"dba": dba}
    
    return gradients

In [12]:
np.random.seed(1)
x = np.random.randn(3,10,4)
a0 = np.random.randn(5,10)
Wax = np.random.randn(5,3)
Waa = np.random.randn(5,5)
Wya = np.random.randn(2,5)
ba = np.random.randn(5,1)
by = np.random.randn(2,1)
parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "ba": ba, "by": by}
a, y, caches = rnn_forward(x, a0, parameters)
da = np.random.randn(5, 10, 4)
gradients = rnn_backward(da, caches)

gradients["dx"].shape,gradients["da0"].shape,gradients["dWax"].shape,gradients["dWaa"].shape,gradients["dba"].shape

((3, 10, 4), (5, 10), (5, 3), (5, 5), (5, 1))

到现在，我们已经初步完成了Basic RNN Model。该算法对于简单的应用已经足够了，但是对于一定时序长度的应用，还存在梯度消失的问题，也就是说，RNN的记忆性更多的表现为“短期记忆”。

下面，我们建立一个相对复杂的`LSTM`模型，它更擅长处理梯度消失的问题。`LSTM`模型在记忆的特性上表现较好。

# Long Short-Term Memory (LSTM) network
## LSTM CELL
<img src="images/eca9158c08600cd86442dfd1c18a3c86.png" style="width:500;height:400px;">

In [13]:
def lstm_cell_forward(xt, a_prev, c_prev, parameters):
    """
    LSTM cell 细胞单元。
    参数:
    xt -- t时序的输入值, 形状：(n_x, m).
    a_prev -- t-1时序的激活值, 形状： (n_a, m)
    c_prev -- t-1时序的细胞值, 形状： (n_a, m)
    parameters -- 字典:
              Wf -- 遗忘门的权重矩阵, 形状： (n_a, n_a + n_x)
              bf -- 遗忘门的截距项, 形状： (n_a, 1)
              Wu -- 更新门的权重矩阵, 形状： (n_a, n_a + n_x)
              bu -- 更新门的截距项, 形状： (n_a, 1)
              Wc -- 计算细胞候选值的权重矩阵, 形状： (n_a, n_a + n_x)
              bc -- 计算细胞候选值的截距项, 形状： (n_a, 1)
              Wo -- 输出门的权重矩阵, 形状： (n_a, n_a + n_x)
              bo -- 输出门的截距项, 形状： (n_a, 1)
              Wy -- 与激活值相关，计算预测值的权重矩阵, 形状： (n_y, n_a)
              by -- 与激活值相关，计算预测值的截距项, 形状： (n_y, 1)                        
    返回:
    a_next -- t时序的激活值, of shape (n_a, m)
    c_next -- t时序的细胞值, of shape (n_a, m)
    yt_pred -- t时序的预测值, numpy array of shape (n_y, m)
    cache -- 反向传播需要的变量, 包含 (a_next, c_next, a_prev, c_prev, xt, parameters)
    
    注意: ft/ut/ot 分别表示 遗忘/更新/输出 门值, cct 表示 细胞值候选值,c 表示细胞值.
    """
    # 检索参数
    Wf = parameters["Wf"]
    bf = parameters["bf"]
    Wu = parameters["Wu"]
    bu = parameters["bu"]
    Wc = parameters["Wc"]
    bc = parameters["bc"]
    Wo = parameters["Wo"]
    bo = parameters["bo"]
    Wy = parameters["Wy"]
    by = parameters["by"]
    
    # 检索形状信息
    n_x, m = xt.shape
    n_y, n_a = Wy.shape

    # 拼接t-1时序激活值和t时序输入值
    concat = np.zeros((n_x+n_a,m))
    concat[: n_a, :] = a_prev
    concat[n_a :, :] = xt

    ft = sigmoid(np.dot(Wf,concat)+bf)
    ut = sigmoid(np.dot(Wu,concat)+bu)
    cct = np.tanh(np.dot(Wc,concat)+bc)
    c_next = ft*c_prev + ut*cct
    ot = sigmoid(np.dot(Wo,concat)+bo)
    a_next = ot*np.tanh(c_next)

    yt_pred = softmax(np.dot(Wy, a_next) + by)

    cache = (a_next, c_next, a_prev, c_prev, ft, ut, cct, ot, xt, parameters)

    return a_next, c_next, yt_pred, cache

In [14]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

In [15]:
np.random.seed(1)
xt = np.random.randn(3,10)
a_prev = np.random.randn(5,10)
c_prev = np.random.randn(5,10)
Wf = np.random.randn(5, 5+3)
bf = np.random.randn(5,1)
Wu = np.random.randn(5, 5+3)
bu = np.random.randn(5,1)
Wo = np.random.randn(5, 5+3)
bo = np.random.randn(5,1)
Wc = np.random.randn(5, 5+3)
bc = np.random.randn(5,1)
Wy = np.random.randn(2,5)
by = np.random.randn(2,1)

parameters = {"Wf": Wf, "Wu": Wu, "Wo": Wo, "Wc": Wc, "Wy": Wy, "bf": bf, "bu": bu, "bo": bo, "bc": bc, "by": by}

a_next, c_next, yt, cache = lstm_cell_forward(xt, a_prev, c_prev, parameters)
a_next.shape,c_next.shape,yt.shape,len(cache)

((5, 10), (5, 10), (2, 10), 10)

## 单元集成

<img src="images/ecf8f17d1a32382fac8022d6fc9423a9.png" style="width:500;height:300px;">


In [16]:
def lstm_forward(x, a0, parameters):
    caches = []
    
    n_x, m, T_x = x.shape
    n_y, n_a = parameters['Wy'].shape
    
    # 初始化激活值、细胞值、预测值
    a = np.zeros((n_a, m, T_x))
    c = np.zeros((n_a, m, T_x))
    y = np.zeros((n_y, m, T_x))
    
    # 初始化t-1时序的激活值和细胞值
    a_next = a0
    c_next = np.zeros((n_a, m))
    
    # 升序遍历
    for t in range(T_x):
        a_next, c_next, yt, cache = lstm_cell_forward(x[:, :, t], a_next, c_next, parameters)
        a[:,:,t] = a_next
        y[:,:,t] = yt
        c[:,:,t]  = c_next
        caches.append(cache)
    caches = (caches, x)

    return a, y, c, caches

In [17]:
np.random.seed(1)
x = np.random.randn(3,10,7)
a0 = np.random.randn(5,10)
Wf = np.random.randn(5, 5+3)
bf = np.random.randn(5,1)
Wu = np.random.randn(5, 5+3)
bu = np.random.randn(5,1)
Wo = np.random.randn(5, 5+3)
bo = np.random.randn(5,1)
Wc = np.random.randn(5, 5+3)
bc = np.random.randn(5,1)
Wy = np.random.randn(2,5)
by = np.random.randn(2,1)

parameters = {"Wf": Wf, "Wu": Wu, "Wo": Wo, "Wc": Wc, "Wy": Wy, "bf": bf, "bu": bu, "bo": bo, "bc": bc, "by": by}

a, y, c, caches = lstm_forward(x, a0, parameters)
a.shape,c.shape,y.shape,len(caches[0])

((5, 10, 7), (5, 10, 7), (2, 10, 7), 7)

## 反向传播
略

# 生成文本
自动给恐龙取名字。

## 环境配置

In [18]:
import numpy as np
import random
from random import shuffle

## 加载处理数据

In [19]:
# 读取文本
with open('datasets/dinos.txt', 'r') as f:
    data =f.read()
data= data.lower()# 转为小写
chars = list(set(data))# 获取字符列表
len(data), len(chars)

(19909, 27)

## 索引字符
"\n"表示名字的结尾。

In [20]:
char_to_ix = { ch:i for i,ch in enumerate(sorted(chars)) }
ix_to_char = { i:ch for i,ch in enumerate(sorted(chars)) }

## 梯度裁剪
防止梯度爆炸。
<img src="images/f6d8b886335927d46fad323c60335d09.png" style="width:400;height:150px;">



In [21]:
x=np.array([-11,4,12])
x=np.clip(x,a_min=-10,a_max=10)
x

array([-10,   4,  10])

In [22]:
def clip(gradients, maxValue):    
    dWaa, dWax, dWya, db, dby = gradients['dWaa'], gradients['dWax'], gradients['dWya'], gradients['db'], gradients['dby']
    for gradient in [dWax, dWaa, dWya, db, dby]:
        np.clip(gradient,-maxValue , maxValue, out=gradient) 
    gradients = {"dWaa": dWaa, "dWax": dWax, "dWya": dWya, "db": db, "dby": dby}    
    return gradients

In [23]:
dWax = np.random.randn(5,3)*10
dWaa = np.random.randn(5,5)*10
dWya = np.random.randn(2,5)*10
db = np.random.randn(5,1)*10
dby = np.random.randn(2,1)*10
gradients = {"dWax": dWax, "dWaa": dWaa, "dWya": dWya, "db": db, "dby": dby}
clip(gradients, 10)

{'dWaa': array([[ 10.        ,   8.70969803,  -5.08457134,   7.77419205,
          -1.18771172],
        [ -1.98998184,  10.        ,  -4.18937898,  -4.79184915,
         -10.        ],
        [-10.        ,   4.51122939,  -6.94920901,   5.15413802,
         -10.        ],
        [ -7.67309826,   6.74570707,  10.        ,   5.92472801,
          10.        ],
        [ 10.        ,  10.        ,  -9.18440038,  -1.05344713,
           6.30195671]]),
 'dWax': array([[  6.74396105,  -7.22391905,  10.        ],
        [ -9.0163449 ,  -8.22467189,   7.21711292],
        [ -6.25342001,  -5.93843067,  -3.43900709],
        [-10.        ,  10.        ,   6.08514698],
        [ -0.69328697,  -1.08392067,   4.50155513]]),
 'dWya': array([[ -4.14846901,   4.51946037, -10.        ,  -8.28627979,
           5.28879746],
        [-10.        , -10.        ,  -0.17718318, -10.        ,
           0.57120996]]),
 'db': array([[-7.99547491],
        [-2.91594596],
        [-2.58982853],
        [ 1.

## 如何生成文本？
<img src="images/b10e5461e536f8c93bbf5f025f686eb8.png" style="width:500;height:300px;">

**步骤：**
1. 初始化$x^{<1>}=\vec 0和a^{<0>}=\vec 0$;
2. 前向传播：
    1. $ a^{\langle t+1 \rangle} = \tanh(W_{ax}  x^{\langle t \rangle } + W_{aa} a^{\langle t \rangle } + b)$
    2. $ z^{\langle t + 1 \rangle } = W_{ya}  a^{\langle t + 1 \rangle } + b_y $
    3. $ \hat{y}^{\langle t+1 \rangle } = softmax(z^{\langle t + 1 \rangle })$
3. 根据$ \hat{y}^{\langle t+1 \rangle }$进行概率抽样获取对应字符；

使用`np.random.choice()`:

$P(index = 0) = 0.1, P(index = 1) = 0.0, P(index = 2) = 0.7, P(index = 3) = 0.2$.

In [24]:
from collections import Counter
p = np.array([0.1, 0.0, 0.7, 0.2])
Counter(np.random.choice([0, 1, 2, 3],size=100, p = p))    

Counter({2: 70, 3: 23, 0: 7})

4. 以第三步获得的抽样字符，经过one-hot转换后，作为下一个时序的输入x；

In [25]:
def sample(parameters, char_to_ix):
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    vocab_size = by.shape[0]
    n_a = Waa.shape[1]

    x = np.zeros((vocab_size,1))
    a_prev = np.zeros((n_a,1))
    
    indices = []
    
    idx = -1 

    counter = 0
    newline_character = char_to_ix['\n']
    
    while (idx != newline_character and counter != 50):

        a = np.tanh(np.dot(Wax,x)+np.dot(Waa,a_prev)+b)
        z = np.dot(Wya,a)+by
        y = softmax(z)
                
        idx = np.random.choice(range(len(y)),p=y.ravel())

        indices.append(idx)
        
        x = np.zeros((vocab_size,1))
        x[idx] = 1
        
        a_prev = a
        
        
    if (counter == 50):
        indices.append(char_to_ix['\n'])
    
    return indices

In [26]:
n, n_a = 20, 100
a0 = np.random.randn(n_a, 1)
i0 = 1
vocab_size=len(chars)
Wax, Waa, Wya = np.random.randn(n_a, vocab_size), np.random.randn(n_a, n_a), np.random.randn(vocab_size, n_a)
b, by = np.random.randn(n_a, 1), np.random.randn(vocab_size, 1)
parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "b": b, "by": by}

indices = sample(parameters, char_to_ix)
str(indices),str([ix_to_char[i] for i in indices])

('[5, 11, 10, 26, 25, 15, 8, 17, 11, 17, 14, 19, 1, 20, 6, 5, 16, 5, 19, 26, 3, 1, 11, 4, 22, 15, 26, 26, 25, 5, 5, 5, 0]',
 "['e', 'k', 'j', 'z', 'y', 'o', 'h', 'q', 'k', 'q', 'n', 's', 'a', 't', 'f', 'e', 'p', 'e', 's', 'z', 'c', 'a', 'k', 'd', 'v', 'o', 'z', 'z', 'y', 'e', 'e', 'e', '\\n']")

## 建立模型

In [27]:
def rnn_step_forward(parameters, a_prev, x):    
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    a_next = np.tanh(np.dot(Wax, x) + np.dot(Waa, a_prev) + b)
    p_t = softmax(np.dot(Wya, a_next) + by) 
    
    return a_next, p_t

In [28]:
def rnn_forward(X, Y, a0, parameters, vocab_size = 27):    
    x, a, y_hat = {}, {}, {}
    
    a[-1] = np.copy(a0)
    
    loss = 0
    
    for t in range(len(X)):
        x[t] = np.zeros((vocab_size,1)) 
        if (X[t] != None):
            x[t][X[t]] = 1
        a[t], y_hat[t] = rnn_step_forward(parameters, a[t-1], x[t])
        
        loss -= np.log(y_hat[t][Y[t],0])
        
    cache = (y_hat, a, x)        
    return loss, cache

In [29]:
def rnn_step_backward(dy, gradients, parameters, x, a, a_prev):    
    gradients['dWya'] += np.dot(dy, a.T)
    gradients['dby'] += dy
    da = np.dot(parameters['Wya'].T, dy) + gradients['da_next']
    daraw = (1 - a * a) * da 
    gradients['db'] += daraw
    gradients['dWax'] += np.dot(daraw, x.T)
    gradients['dWaa'] += np.dot(daraw, a_prev.T)
    gradients['da_next'] = np.dot(parameters['Waa'].T, daraw)
    return gradients

In [30]:
def rnn_backward(X, Y, parameters, cache):
    gradients = {}
    
    (y_hat, a, x) = cache
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    
    gradients['dWax'], gradients['dWaa'], gradients['dWya'] = np.zeros_like(Wax), np.zeros_like(Waa), np.zeros_like(Wya)
    gradients['db'], gradients['dby'] = np.zeros_like(b), np.zeros_like(by)
    gradients['da_next'] = np.zeros_like(a[0])
    
    for t in reversed(range(len(X))):
        dy = np.copy(y_hat[t])
        dy[Y[t]] -= 1
        gradients = rnn_step_backward(dy, gradients, parameters, x[t], a[t], a[t-1])
    
    return gradients, a

In [31]:
def update_parameters(parameters, gradients, lr):
    parameters['Wax'] += -lr * gradients['dWax']
    parameters['Waa'] += -lr * gradients['dWaa']
    parameters['Wya'] += -lr * gradients['dWya']
    parameters['b']  += -lr * gradients['db']
    parameters['by']  += -lr * gradients['dby']
    return parameters

In [32]:
def optimize(X, Y, a_prev, parameters, learning_rate = 0.01):
    loss, cache = rnn_forward(X,Y,a_prev,parameters)
    
    gradients, a = rnn_backward(X,Y,parameters,cache)
    
    gradients = clip(gradients,5)
    
    parameters = update_parameters(parameters,gradients,learning_rate)    
  
    return loss, gradients, a[len(X)-1]

In [33]:
vocab_size, n_a = 27, 100
a_prev = np.random.randn(n_a, 1)
Wax, Waa, Wya = np.random.randn(n_a, vocab_size), np.random.randn(n_a, n_a), np.random.randn(vocab_size, n_a)
b, by = np.random.randn(n_a, 1), np.random.randn(vocab_size, 1)
parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "b": b, "by": by}
X = [12,3,5,11,22,3]
Y = [4,14,11,22,25, 26]

loss, gradients, a_last = optimize(X, Y, a_prev, parameters, learning_rate = 0.01)

In [34]:
# loss, gradients, a_last

## 训练模型

In [35]:
def model(data, ix_to_char, char_to_ix, num_iterations = 35000, n_a = 50, dino_names = 7, vocab_size = 27):
    n_x, n_y = vocab_size, vocab_size
    
    parameters = initialize_parameters(n_a, n_x, n_y)
    
    loss = -np.log(1.0/vocab_size)*dino_names
    
    with open("datasets/dinos.txt") as f:
        examples = f.readlines()
    examples = [x.lower().strip() for x in examples]
    
    shuffle(examples)
    
    a_prev = np.zeros((n_a, 1))
    
    for j in range(num_iterations):

        index = j%len(examples)
        X = [None] + [char_to_ix[ch] for ch in examples[index]]
        Y = X[1:] + [char_to_ix["\n"]]
        
        cur_loss, gradients, a_prev = optimize(X,Y,a_prev,parameters,learning_rate=0.01)  
        # 平滑
        loss = loss * 0.999 + cur_loss * 0.001

        if j % 2000 == 0:            
            print('Iteration: %d, Loss: %f' % (j, loss) + '\n')
            
            for name in range(dino_names):
                
                sampled_indices = sample(parameters, char_to_ix)
                txt = ''.join(ix_to_char[ix] for ix in sampled_indices)
                txt = txt[0].upper() + txt[1:]
                print ('%s' % (txt, ), end='')                    
        
    return parameters

In [36]:
def initialize_parameters(n_a, n_x, n_y):
    np.random.seed(1)
    Wax = np.random.randn(n_a, n_x)*0.01 
    Waa = np.random.randn(n_a, n_a)*0.01 
    Wya = np.random.randn(n_y, n_a)*0.01
    b = np.zeros((n_a, 1)) 
    by = np.zeros((n_y, 1))     
    parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "b": b,"by": by}
    
    return parameters

In [37]:
parameters = model(data, ix_to_char, char_to_ix)

Iteration: 0, Loss: 23.084040

Flcaudczarjdgswgikvtentlneqaijpyenupscvhkbrvmbbcmp
Azglqsbbxhsslbxvklnjmlgwaitmumzw
Lwiskbhhqyrhrbjerqiddfhwcdipafvqichfshrkiudlohsxadmuq
Hqrmapmsjvhw
Tvrxeifvtwidzlsqcvzziycyakvssgkojmnvsq
Dwtdadvztu
Wrimyaednfbmvr
Iteration: 2000, Loss: 27.933262

Arasaurus
Hanpa
Onis
Gcanes
Tos
Glen
Otaus
Iteration: 4000, Loss: 25.906777

Ngogitoritys
Jgcanta
Hoptosaura
Doraptor
Octxanoraus
Rachaglytaten
Onodiroroptinlonis
Iteration: 6000, Loss: 24.653347

Prohmolasaurus
Miwenapelichosaurus

Luanechesaurus
Nascelopsobhonorseuruccidysaurus
Biylosaurus

Iteration: 8000, Loss: 24.111200

Mwelashus
Atotorkor
Hieshossynzsaurunus
Anisaurusuonduenodra
Alosnhoconasauauskeldimaloaleileteranosaurus
Chacilus
Ianchiosaurus
Iteration: 10000, Loss: 23.829090

Cheventosaurus
Ishosaurus
Uroptosaurus
Rleneryta
Alynosaurus
Carhyosbs
Rasauson
Iteration: 12000, Loss: 23.368544

Jeetetosaurus
Hugcosaurus
Njnmurus
Aropiamacroeia
Scsaurus
Qaxaradon
Alchatorsaurus
Iteration: 14000, Loss: 23.1

## 结尾
1. 算法一开始生成的名字比较随机，随着迭代次数的增加，成本值不断缩小，生成的字符串看上去也越来越像恐龙的名字。
2. 改进方向，增加迭代次数，增加训练集，调整超参数，使用更好的优化算法，etc；