# 建立循环神经网络-逐步

欢迎来到课程5的第一个作业！在此作业中，您将以numpy实现您的第一个循环神经网络。

递归神经网络（RNN）对自然语言处理和其他序列任务非常有效，因为它们具有“内存”。他们可以一次读取一个输入$x^{\langle t \rangle}$（例如单词），并通过从一个时间步长传递到下一个时间步长的隐藏层激活来记住一些信息/上下文。这允许单向RNN提取过去的信息以处理以后的输入。双向RNN可以借鉴过去和未来的环境。

**符号**：
- 上标$[l]$表示与$l^{th}$层关联的对象。
    - 示例：$a^{[4]}$是$4^{th}$层激活。 $W^{[5]}$和$b^{[5]}$是$5^{th}$层参数。

- 上标$(i)$表示与 $i^{th}$示例关联的对象。
    - 示例：$x^{(i)}$ 是$i^{th}$训练样本输入。

- 上标 $\langle t \rangle$ 表示在$t^{th}$时间步的对象。
    - 示例：$x^{\langle t \rangle}$是在$t^{th}$时间步长处的输入x。 $x^{(i)\langle t \rangle}$是样本$i$的$t^{th}$时间步长的输入。
    
- 下标$i$表示向量的$i^{th}$ 条目。
    - 示例：$a^{[l]}_i$表示层l中激活的ith条目。

我们假设您已经熟悉`numpy`和/或已经完成了以前的专业课程。让我们开始吧！

首先，让我们导入在此分配过程中需要的所有软件包。

In [1]:
import numpy as np
from rnn_utils import *

## 1-基本递归神经网络的正向传播

本周晚些时候，您将使用RNN生成音乐。 您将实现的基本RNN具有以下结构。 在此示例中，$T_x = T_y$。

<img src="images/RNN.png" style="width:500;height:300px;">
<caption><center> **Figure 1**: Basic RNN model </center></caption>

这是实现RNN的方法：

**步骤**：
1.实施RNN的一个时间步所需的计算。
2.在 $T_x$时间步长上实现循环，以便一次处理所有输入。

我们开始吧！

## 1.1-RNN单元

循环神经网络可以看作是单个细胞的重复。您首先要在单个时间步上实现计算。下图描述了RNN单元的单个时间步的操作。

<img src="images/rnn_step_forward.png" style="width:700px;height:300px;">
<caption><center> **图2 **：基本RNN单元。将$x^{\langle t \rangle}$ (当前输入) 和 $a^{\langle t - 1\rangle}$ （包含过去信息的先前隐藏状态）作为输入，并输出$a^{\langle t \rangle}$赋予下一个RNN单元，还用于预测$y^{\langle t \rangle}$ </center></caption>

**练习**：实现图（2）中描述的RNN单元。

**说明**：
1. 使用tanh激活计算隐藏状态： $a^{\langle t \rangle} = \tanh(W_{aa} a^{\langle t-1 \rangle} + W_{ax} x^{\langle t \rangle} + b_a)$。
2. 使用新的隐藏状态$a^{\langle t \rangle}$，计算预测 $\hat{y}^{\langle t \rangle} = softmax(W_{ya} a^{\langle t \rangle} + b_y)$。我们为您提供了一个函数：`softmax`。
3. 将$(a^{\langle t \rangle}, a^{\langle t-1 \rangle}, x^{\langle t \rangle}, parameters)$存储在cache中
4. 返回$a^{\langle t \rangle}$ , $y^{\langle t \rangle}$ 和cache

我们将对$m$个示例进行矢量化处理。因此，$x^{\langle t \rangle}$将具有维度$(n_x,m)$，而$a^{\langle t \rangle}$ 将具有维度$(n_a,m)$。

In [4]:
# GRADED FUNCTION: rnn_cell_forward

def rnn_cell_forward(xt, a_prev, parameters):
    """
    Implements a single forward step of the RNN-cell as described in Figure (2)

    Arguments:
    xt -- your input data at timestep "t", numpy array of shape (n_x, m).
    a_prev -- Hidden state at timestep "t-1", numpy array of shape (n_a, m)
    parameters -- python dictionary containing:
                        Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                        Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                        Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        ba --  Bias, numpy array of shape (n_a, 1)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
    Returns:
    a_next -- next hidden state, of shape (n_a, m)
    yt_pred -- prediction at timestep "t", numpy array of shape (n_y, m)
    cache -- tuple of values needed for the backward pass, contains (a_next, a_prev, xt, parameters)
    """
    
    # Retrieve parameters from "parameters"
    Wax = parameters["Wax"]
    Waa = parameters["Waa"]
    Wya = parameters["Wya"]
    ba = parameters["ba"]
    by = parameters["by"]
    
    ### START CODE HERE ### (≈2 lines)
    # compute next activation state using the formula given above
    a_next = np.tanh(np.dot(Wax, xt) + np.dot(Waa, a_prev) + ba)
    # compute output of the current cell using the formula given above
    yt_pred = softmax(np.dot(Wya, a_prev) + by)   
    ### END CODE HERE ###
    
    # store values you need for backward propagation in cache
    cache = (a_next, a_prev, xt, parameters)
    
    return a_next, yt_pred, cache

In [5]:
np.random.seed(1)
xt = np.random.randn(3,10)
a_prev = np.random.randn(5,10)
Waa = np.random.randn(5,5)
Wax = np.random.randn(5,3)
Wya = np.random.randn(2,5)
ba = np.random.randn(5,1)
by = np.random.randn(2,1)
parameters = {"Waa": Waa, "Wax": Wax, "Wya": Wya, "ba": ba, "by": by}

a_next, yt_pred, cache = rnn_cell_forward(xt, a_prev, parameters)
print("a_next[4] = ", a_next[4])
print("a_next.shape = ", a_next.shape)
print("yt_pred[1] =", yt_pred[1])
print("yt_pred.shape = ", yt_pred.shape)

a_next[4] =  [ 0.59584544  0.18141802  0.61311866  0.99808218  0.85016201  0.99980978
 -0.18887155  0.99815551  0.6531151   0.82872037]
a_next.shape =  (5, 10)
yt_pred[1] = [0.11805736 0.00150244 0.16286607 0.99959053 0.4818786  0.00199492
 0.00664413 0.95503173 0.04582838 0.99990195]
yt_pred.shape =  (2, 10)


**预期输出**: 

<table>
    <tr>
        <td>
            **a_next[4]**:
        </td>
        <td>
           [ 0.59584544  0.18141802  0.61311866  0.99808218  0.85016201  0.99980978
 -0.18887155  0.99815551  0.6531151   0.82872037]
        </td>
    </tr>
        <tr>
        <td>
            **a_next.shape**:
        </td>
        <td>
           (5, 10)
        </td>
    </tr>
        <tr>
        <td>
            **yt[1]**:
        </td>
        <td>
           [ 0.9888161   0.01682021  0.21140899  0.36817467  0.98988387  0.88945212
  0.36920224  0.9966312   0.9982559   0.17746526]
        </td>
    </tr>
        <tr>
        <td>
            **yt.shape**:
        </td>
        <td>
           (2, 10)
        </td>
    </tr>

</table>

## 1.2 - RNN前向

您可以将RNN视为刚刚构建的单元的重复。如果您输入的数据序列经过10个时间步长，则将复制RNN单元10次。每个单元格都将前一个单元格（$a^{\langle t-1 \rangle}$）的隐藏状态和当前时间步的输入数据（$x^{\langle t \rangle}$）作为输入。它为此时间步长输出隐藏状态（$a^{\langle t \rangle}$）和预测（$y^{\langle t \rangle}$）。


<img src="images/rnn.png" style="width:800px;height:300px;">
<caption><center> **图3 **：基本RNN。输入序列$x = (x^{\langle 1 \rangle}, x^{\langle 2 \rangle}, ..., x^{\langle T_x \rangle})$在$T_x$时间步长上结转。网络输出$y = (y^{\langle 1 \rangle}, y^{\langle 2 \rangle}, ..., y^{\langle T_x \rangle})$. </center></caption>



**练习**：对图3所示的RNN的前向传播进行编码。

**说明**：
1. 创建一个零向量（$a$），它将存储RNN计算出的所有隐藏状态。
2. 将“下一个”隐藏状态初始化为$a_0$（初始隐藏状态）。
3. 开始遍历每个时间步，您的增量索引为$t$：
    - 通过运行`rnn_step_forward`更新“下一个”隐藏状态和缓存。
    - 将“下一个”隐藏状态存储在 $a$中（$t^{th}$位置）
    - 将预测存储在y中
    - 将缓存添加到缓存列表中
4. 返回$a$, $y$和缓存

In [12]:
# GRADED FUNCTION: rnn_forward

def rnn_forward(x, a0, parameters):
    """
    Implement the forward propagation of the recurrent neural network described in Figure (3).

    Arguments:
    x -- Input data for every time-step, of shape (n_x, m, T_x).
    a0 -- Initial hidden state, of shape (n_a, m)
    parameters -- python dictionary containing:
                        Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                        Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                        Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        ba --  Bias numpy array of shape (n_a, 1)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)

    Returns:
    a -- Hidden states for every time-step, numpy array of shape (n_a, m, T_x)
    y_pred -- Predictions for every time-step, numpy array of shape (n_y, m, T_x)
    caches -- tuple of values needed for the backward pass, contains (list of caches, x)
    """
    
    # Initialize "caches" which will contain the list of all caches
    caches = []
    
    # Retrieve dimensions from shapes of x and Wy
    n_x, m, T_x = x.shape
    n_y, n_a = parameters["Wya"].shape
    
    ### START CODE HERE ###
    
    # initialize "a" and "y" with zeros (≈2 lines)
    a = np.zeros((n_a, m, T_x))
    y_pred = np.zeros((n_y, m, T_x))
    
    # Initialize a_next (≈1 line)
    a_next = a0
    
    # loop over all time-steps
    for t in range(T_x):
        # Update next hidden state, compute the prediction, get the cache (≈1 line)
        a_next, yt_pred, cache = rnn_cell_forward(x[:,:,t], a_next,parameters)
        # Save the value of the new "next" hidden state in a (≈1 line)
        a[:,:,t] = a_next
        # Save the value of the prediction in y (≈1 line)
        y_pred[:,:,t] = yt_pred
        # Append "cache" to "caches" (≈1 line)
        caches.append(cache)
        
    ### END CODE HERE ###
    
    # store values needed for backward propagation in cache
    caches = (caches, x)
    
    return a, y_pred, caches

In [13]:
np.random.seed(1)
x = np.random.randn(3,10,4)
a0 = np.random.randn(5,10)
Waa = np.random.randn(5,5)
Wax = np.random.randn(5,3)
Wya = np.random.randn(2,5)
ba = np.random.randn(5,1)
by = np.random.randn(2,1)
parameters = {"Waa": Waa, "Wax": Wax, "Wya": Wya, "ba": ba, "by": by}

a, y_pred, caches = rnn_forward(x, a0, parameters)
print("a[4][1] = ", a[4][1])
print("a.shape = ", a.shape)
print("y_pred[1][3] =", y_pred[1][3])
print("y_pred.shape = ", y_pred.shape)
print("caches[1][1][3] =", caches[1][1][3])
print("len(caches) = ", len(caches))

a[4][1] =  [-0.99999375  0.77911235 -0.99861469 -0.99833267]
a.shape =  (5, 10, 4)
y_pred[1][3] = [0.96251083 0.79560373 0.86224861 0.11118257]
y_pred.shape =  (2, 10, 4)
caches[1][1][3] = [-1.1425182  -0.34934272 -0.20889423  0.58662319]
len(caches) =  2


**预期输出**:

<table>
    <tr>
        <td>
            **a[4][1]**:
        </td>
        <td>
           [-0.99999375  0.77911235 -0.99861469 -0.99833267]
        </td>
    </tr>
        <tr>
        <td>
            **a.shape**:
        </td>
        <td>
           (5, 10, 4)
        </td>
    </tr>
        <tr>
        <td>
            **y[1][3]**:
        </td>
        <td>
           [ 0.79560373  0.86224861  0.11118257  0.81515947]
        </td>
    </tr>
        <tr>
        <td>
            **y.shape**:
        </td>
        <td>
           (2, 10, 4)
        </td>
    </tr>
        <tr>
        <td>
            **cache[1][1][3]**:
        </td>
        <td>
           [-1.1425182  -0.34934272 -0.20889423  0.58662319]
        </td>
    </tr>
        <tr>
        <td>
            **len(cache)**:
        </td>
        <td>
           2
        </td>
    </tr>

</table>

恭喜你！ 您已经从头开始成功构建了递归神经网络的正向传播。 这对于某些应用程序将足够好，但是会遇到梯度消失的问题。 因此，当每个输出$y^{\langle t \rangle}$可以主要使用“本地”上下文进行估算（即来自输入$x^{\langle t' \rangle}$的信息，其中$t'$是 距离$t$不太远）。

在下一部分中，您将构建一个更复杂的LSTM模型，该模型更适合解决逐渐消失的梯度。 LSTM将能够更好地记住一条信息并将其保存许多时间。

## 2 - 长短期记忆（LSTM）网络

下图显示了LSTM单元的操作。

<img src="images/LSTM.png" style="width:500;height:400px;">
<caption><center> **图4 **：LSTM单元。这会在每个时间步长跟踪并更新“单元状态”或存储变量$c^{\langle t \rangle}$，这可能与$a^{\langle t \rangle}$不同。</center></caption>

与上面的RNN示例类似，您将以单个时间步实现LSTM单元开始。然后，您可以从for循环内部迭代调用它，使其具有$T_x$时间步长的输入。

### 关于门

#### -遗忘门

为了便于说明，假设我们正在阅读一段文本中的单词，并希望使用LSTM来跟踪语法结构，例如主题是单数还是复数。如果主题从单数变为复数，我们需要找到一种方法来摆脱以前存储的单/复数状态的存储值。在LSTM中，遗忘门使我们可以这样做：

$$\Gamma_f^{\langle t \rangle} = \sigma(W_f[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_f)\tag{1} $$


在这里，$W_f$是控制遗忘门行为的权重。我们将$[a^{\langle t-1 \rangle}, x^{\langle t \rangle}]$连接起来并乘以$W_f$。上面的等式导致向量 $\Gamma_f^{\langle t \rangle}$的值介于0到1之间。该遗忘门向量将逐元素乘以先前的单元状态$c^{\langle t-1 \rangle}$。因此，如果$\Gamma_f^{\langle t \rangle}$的值之一为0（或接近于0），则意味着LSTM应在的相应组件中删除该信息（例如，单数主题）。 $c^{\langle t-1 \rangle}$。如果值之一为1，则它将保留信息。

#### -更新门

一旦我们忘记了所讨论的主题是单数，就需要找到一种更新它的方法，以反映新主题现在是复数。这是更新门的公式：

$$\Gamma_u^{\langle t \rangle} = \sigma(W_u[a^{\langle t-1 \rangle}, x^{\{t\}}] + b_u)\tag{2} $$ 


类似于忘记门，这里 $\Gamma_u^{\langle t \rangle}$再次是0到1之间的值的向量。这将与$\tilde{c}^{\langle t \rangle}$，以便计算$c^{\langle t \rangle}$。

#### -更新单元格

要更新新主题，我们需要创建一个新的数字向量，可以将其添加到先前的单元格状态中。我们使用的等式是：

$$ \tilde{c}^{\langle t \rangle} = \tanh(W_c[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_c)\tag{3} $$


最后，新的单元状态为：

$$ c^{\langle t \rangle} = \Gamma_f^{\langle t \rangle}* c^{\langle t-1 \rangle} + \Gamma_u^{\langle t \rangle} *\tilde{c}^{\langle t \rangle} \tag{4} $$


#### -输出门

为了确定我们将使用哪些输出，我们将使用以下两个公式：

$$ \Gamma_o^{\langle t \rangle}=  \sigma(W_o[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_o)\tag{5}$$ 
$$ a^{\langle t \rangle} = \Gamma_o^{\langle t \rangle}* \tanh(c^{\langle t \rangle})\tag{6} $$


在等式5中，您决定使用S型函数输出什么；在等式6中，您将其乘以先前状态的 tanh tanh。

### 2.1 - LSTM单元

**锻炼**：实施图（3）中描述的LSTM单元。

**说明**：
1. 将 $a^{\langle t-1 \rangle}$和$x^{\langle t \rangle}$连接在一个矩阵中：$concat = \begin{bmatrix} a^{\langle t-1 \rangle} \\ x^{\langle t \rangle} \end{bmatrix}$
2. 计算所有公式2-6。 您可以使用`sigmoid（）`（提供）和`np.tanh（）`。
3. 计算预测$y^{\langle t \rangle}$。 您可以使用`softmax（）`（提供）。

In [20]:
# GRADED FUNCTION: lstm_cell_forward

def lstm_cell_forward(xt, a6prev, c_prev, parameters):
    """
    Implement a single forward step of the LSTM-cell as described in Figure (4)

    Arguments:
    xt -- your input data at timestep "t", numpy array of shape (n_x, m).
    a_prev -- Hidden state at timestep "t-1", numpy array of shape (n_a, m)
    c_prev -- Memory state at timestep "t-1", numpy array of shape (n_a, m)
    parameters -- python dictionary containing:
                        Wf -- Weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
                        bf -- Bias of the forget gate, numpy array of shape (n_a, 1)
                        Wi -- Weight matrix of the save gate, numpy array of shape (n_a, n_a + n_x)
                        bi -- Bias of the save gate, numpy array of shape (n_a, 1)
                        Wc -- Weight matrix of the first "tanh", numpy array of shape (n_a, n_a + n_x)
                        bc --  Bias of the first "tanh", numpy array of shape (n_a, 1)
                        Wo -- Weight matrix of the focus gate, numpy array of shape (n_a, n_a + n_x)
                        bo --  Bias of the focus gate, numpy array of shape (n_a, 1)
                        Wy -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
                        
    Returns:
    a_next -- next hidden state, of shape (n_a, m)
    c_next -- next memory state, of shape (n_a, m)
    yt_pred -- prediction at timestep "t", numpy array of shape (n_y, m)
    cache -- tuple of values needed for the backward pass, contains (a_next, c_next, a_prev, c_prev, xt, parameters)
    
    Note: ft/it/ot stand for the forget/update/output gates, cct stands for the candidate value (c tilda),
          c stands for the memory value
    """

    # Retrieve parameters from "parameters"
    Wf = parameters["Wf"]
    bf = parameters["bf"]
    Wi = parameters["Wi"]
    bi = parameters["bi"]
    Wc = parameters["Wc"]
    bc = parameters["bc"]
    Wo = parameters["Wo"]
    bo = parameters["bo"]
    Wy = parameters["Wy"]
    by = parameters["by"]
    
    # Retrieve dimensions from shapes of xt and Wy
    n_x, m = xt.shape
    n_y, n_a = Wy.shape

    ### START CODE HERE ###
    # Concatenate a_prev and xt (≈3 lines)
    concat = np.zeros((n_x+n_a, m))
    concat[: n_a, :] = a_prev
    concat[n_a :, :] = xt

    # Compute values for ft, it, cct, c_next, ot, a_next using the formulas given figure (4) (≈6 lines)
    ft = sigmoid(np.dot(Wf, concat) + bf)
    it = sigmoid(np.dot(Wi, concat) + bi)
    cct = np.tanh(np.dot(Wc, concat) + bc)
    c_next = np.multiply(ft,c_prev) + np.multiply(it,cct)
    ot = sigmoid(np.dot(Wo, concat) + bo)
    a_next = ot * np.tanh(c_next)
    
    # Compute prediction of the LSTM cell (≈1 line)
    yt_pred = softmax(np.dot(Wy, a_next) +by)
    ### END CODE HERE ###

    # store values needed for backward propagation in cache
    cache = (a_next, c_next, a_prev, c_prev, ft, it, cct, ot, xt, parameters)

    return a_next, c_next, yt_pred, cache

In [21]:
np.random.seed(1)
xt = np.random.randn(3,10)
a_prev = np.random.randn(5,10)
c_prev = np.random.randn(5,10)
Wf = np.random.randn(5, 5+3)
bf = np.random.randn(5,1)
Wi = np.random.randn(5, 5+3)
bi = np.random.randn(5,1)
Wo = np.random.randn(5, 5+3)
bo = np.random.randn(5,1)
Wc = np.random.randn(5, 5+3)
bc = np.random.randn(5,1)
Wy = np.random.randn(2,5)
by = np.random.randn(2,1)

parameters = {"Wf": Wf, "Wi": Wi, "Wo": Wo, "Wc": Wc, "Wy": Wy, "bf": bf, "bi": bi, "bo": bo, "bc": bc, "by": by}

a_next, c_next, yt, cache = lstm_cell_forward(xt, a_prev, c_prev, parameters)
print("a_next[4] = ", a_next[4])
print("a_next.shape = ", c_next.shape)
print("c_next[2] = ", c_next[2])
print("c_next.shape = ", c_next.shape)
print("yt[1] =", yt[1])
print("yt.shape = ", yt.shape)
print("cache[1][3] =", cache[1][3])
print("len(cache) = ", len(cache))

a_next[4] =  [-0.66408471  0.0036921   0.02088357  0.22834167 -0.85575339  0.00138482
  0.76566531  0.34631421 -0.00215674  0.43827275]
a_next.shape =  (5, 10)
c_next[2] =  [ 0.63267805  1.00570849  0.35504474  0.20690913 -1.64566718  0.11832942
  0.76449811 -0.0981561  -0.74348425 -0.26810932]
c_next.shape =  (5, 10)
yt[1] = [0.79913913 0.15986619 0.22412122 0.15606108 0.97057211 0.31146381
 0.00943007 0.12666353 0.39380172 0.07828381]
yt.shape =  (2, 10)
cache[1][3] = [-0.16263996  1.03729328  0.72938082 -0.54101719  0.02752074 -0.30821874
  0.07651101 -1.03752894  1.41219977 -0.37647422]
len(cache) =  10


**预期输出**:

<table>
    <tr>
        <td>
            **a_next[4]**:
        </td>
        <td>
           [-0.66408471  0.0036921   0.02088357  0.22834167 -0.85575339  0.00138482
  0.76566531  0.34631421 -0.00215674  0.43827275]
        </td>
    </tr>
        <tr>
        <td>
            **a_next.shape**:
        </td>
        <td>
           (5, 10)
        </td>
    </tr>
        <tr>
        <td>
            **c_next[2]**:
        </td>
        <td>
           [ 0.63267805  1.00570849  0.35504474  0.20690913 -1.64566718  0.11832942
  0.76449811 -0.0981561  -0.74348425 -0.26810932]
        </td>
    </tr>
        <tr>
        <td>
            **c_next.shape**:
        </td>
        <td>
           (5, 10)
        </td>
    </tr>
        <tr>
        <td>
            **yt[1]**:
        </td>
        <td>
           [ 0.79913913  0.15986619  0.22412122  0.15606108  0.97057211  0.31146381
  0.00943007  0.12666353  0.39380172  0.07828381]
        </td>
    </tr>
        <tr>
        <td>
            **yt.shape**:
        </td>
        <td>
           (2, 10)
        </td>
    </tr>
    <tr>
        <td>
            **cache[1][3]**:
        </td>
        <td>
           [-0.16263996  1.03729328  0.72938082 -0.54101719  0.02752074 -0.30821874
  0.07651101 -1.03752894  1.41219977 -0.37647422]
        </td>
    </tr>
        <tr>
        <td>
            **len(cache)**:
        </td>
        <td>
           10
        </td>
    </tr>

</table>

### 2.2 - LSTM的正向传递

现在，您已经实现了LSTM的一个步骤，现在可以使用for循环在$T_x$输入序列上对此进行迭代。

<img src="images/LSTM_rnn.png" style="width:500;height:300px;">
<caption><center> **图4 **：多个时间步上的LSTM。</center></caption>

**练习：**实现`lstm_forward()`以在$T_x$个时间步长上运行LSTM。

**注意**：$c^{\langle 0 \rangle}$用零初始化。

In [30]:
# GRADED FUNCTION: lstm_forward

def lstm_forward(x, a0, parameters):
    """
    Implement the forward propagation of the recurrent neural network using an LSTM-cell described in Figure (3).

    Arguments:
    x -- Input data for every time-step, of shape (n_x, m, T_x).
    a0 -- Initial hidden state, of shape (n_a, m)
    parameters -- python dictionary containing:
                        Wf -- Weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
                        bf -- Bias of the forget gate, numpy array of shape (n_a, 1)
                        Wi -- Weight matrix of the save gate, numpy array of shape (n_a, n_a + n_x)
                        bi -- Bias of the save gate, numpy array of shape (n_a, 1)
                        Wc -- Weight matrix of the first "tanh", numpy array of shape (n_a, n_a + n_x)
                        bc -- Bias of the first "tanh", numpy array of shape (n_a, 1)
                        Wo -- Weight matrix of the focus gate, numpy array of shape (n_a, n_a + n_x)
                        bo -- Bias of the focus gate, numpy array of shape (n_a, 1)
                        Wy -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
                        
    Returns:
    a -- Hidden states for every time-step, numpy array of shape (n_a, m, T_x)
    y -- Predictions for every time-step, numpy array of shape (n_y, m, T_x)
    caches -- tuple of values needed for the backward pass, contains (list of all the caches, x)
    """

    # Initialize "caches", which will track the list of all the caches
    caches = []
    
    ### START CODE HERE ###
    # Retrieve dimensions from shapes of xt and Wy (≈2 lines)
    n_x, m, T_x = x.shape
    n_y, n_a = parameters['Wy'].shape
    
    # initialize "a", "c" and "y" with zeros (≈3 lines)
    a = np.zeros((n_a, m, T_x))
    c = np.zeros((n_a, m, T_x))
    y = np.zeros((n_y, m, T_x))
    
    # Initialize a_next and c_next (≈2 lines)
    a_next = a0
    c_next = np.zeros((n_a, m))
    
    # loop over all time-steps
    for t in range(T_x):
        # Update next hidden state, next memory state, compute the prediction, get the cache (≈1 line)
        a_next, c_next, yt, cache = lstm_cell_forward(x[:, :, t], a_next,c_next,parameters)
        # Save the value of the new "next" hidden state in a (≈1 line)
        a[:,:,t] = a_next
        # Save the value of the prediction in y (≈1 line)
        y[:,:,t] = yt
        # Save the value of the next cell state (≈1 line)
        c[:,:,t]  = c_next
        # Append the cache into caches (≈1 line)
        caches.append(cache)
        
    ### END CODE HERE ###
    
    # store values needed for backward propagation in cache
    caches = (caches, x)

    return a, y, c, caches

In [31]:
np.random.seed(1)
x = np.random.randn(3,10,7)
a0 = np.random.randn(5,10)
Wf = np.random.randn(5, 5+3)
bf = np.random.randn(5,1)
Wi = np.random.randn(5, 5+3)
bi = np.random.randn(5,1)
Wo = np.random.randn(5, 5+3)
bo = np.random.randn(5,1)
Wc = np.random.randn(5, 5+3)
bc = np.random.randn(5,1)
Wy = np.random.randn(2,5)
by = np.random.randn(2,1)

parameters = {"Wf": Wf, "Wi": Wi, "Wo": Wo, "Wc": Wc, "Wy": Wy, "bf": bf, "bi": bi, "bo": bo, "bc": bc, "by": by}

a, y, c, caches = lstm_forward(x, a0, parameters)
print("a[4][3][6] = ", a[4][3][6])
print("a.shape = ", a.shape)
print("y[1][4][3] =", y[1][4][3])
print("y.shape = ", y.shape)
print("caches[1][1[1]] =", caches[1][1][1])
print("c[1][2][1]", c[1][2][1])
print("len(caches) = ", len(caches))

a[4][3][6] =  0.17211776753291672
a.shape =  (5, 10, 7)
y[1][4][3] = 0.9508734618501101
y.shape =  (2, 10, 7)
caches[1][1[1]] = [ 0.82797464  0.23009474  0.76201118 -0.22232814 -0.20075807  0.18656139
  0.41005165]
c[1][2][1] -0.8555449167181981
len(caches) =  2


**预期输出**:

<table>
    <tr>
        <td>
            **a[4][3][6]** =
        </td>
        <td>
           0.172117767533
        </td>
    </tr>
        <tr>
        <td>
            **a.shape** =
        </td>
        <td>
           (5, 10, 7)
        </td>
    </tr>
        <tr>
        <td>
            **y[1][4][3]** =
        </td>
        <td>
           0.95087346185
        </td>
    </tr>
        <tr>
        <td>
            **y.shape** =
        </td>
        <td>
           (2, 10, 7)
        </td>
    </tr>
        <tr>
        <td>
            **caches[1][1][1]** =
        </td>
        <td>
           [ 0.82797464  0.23009474  0.76201118 -0.22232814 -0.20075807  0.18656139
  0.41005165]
        </td>    
     </tr>
        <tr>
        <td>
            **c[1][2][1]** =
        </td>
        <td>
           -0.855544916718
        </td>
    </tr> 
    <tr>
        <tr>
        <td>
            **len(caches)** =
        </td>
        <td>
           2
        </td>
    </tr>

</table>

恭喜你！ 现在，您已经为基本RNN和LSTM实现了前向传递。 使用深度学习框架时，实施前向通过足以构建可实现出色性能的系统。

此笔记本电脑的其余部分是可选的，不会评分。

## 3 - 递归神经网络中的反向传播（可选/取消）

在现代深度学习框架中，您只需要实现前向传递，并且框架会处理后向传递，因此大多数深度学习工程师无需理会后向传递的细节。 但是，如果您是微积分专家并且想查看RNN中反向传播的详细信息，则可以遍历笔记本的此可选部分。

在较早的课程中，当您实现了一个简单的（完全连接的）神经网络时，您就使用了反向传播来计算关于更新参数的成本的导数。 同样，在递归神经网络中，您可以计算成本的导数以更新参数。 反向传播方程非常复杂，我们在讲座中没有导出它们。 但是，我们将在下面简要介绍它们。

### 3.1 - 基本RNN向后传递

我们将从计算基本RNN单元的反向传递开始。

<img src="images/rnn_cell_backprop.png" style="width:500;height:300px;"> <br>
<caption><center> **Figure 5**:RNN单元的后向传递。 就像在完全连接的神经网络中一样，成本函数$J$的导数通过遵循来自计算的链规则在RNN中反向传播。 链规则还用于计算$(\frac{\partial J}{\partial W_{ax}},\frac{\partial J}{\partial W_{aa}},\frac{\partial J}{\partial b})$ 以更新参数以更新参数$(W_{ax}, W_{aa}, b_a)$。 </center></caption>

#### 推后一步功能：

要计算 `rnn_cell_backward`，您需要计算以下方程式。 手工导出它们是一个很好的练习。

 tanh的导数为 $\tanh$ is $1-\tanh(x)^2$。 您可以在[here](https://www.wyzant.com/resources/lessons/math/calculus/derivative_proofs/tanx)。 请注意： $ \sec(x)^2 = 1 - \tanh(x)^2$

同样，对于 $\frac{ \partial a^{\langle t \rangle} } {\partial W_{ax}}, \frac{ \partial a^{\langle t \rangle} } {\partial W_{aa}},  \frac{ \partial a^{\langle t \rangle} } {\partial b}$的导数为$\tanh(u)$ is $(1-\tanh(u)^2)du$。

最后两个方程式也遵循相同的规则，并使用$\tanh$导数得出。 请注意，这种安排是为了获得相同的尺寸来匹配的。

In [None]:
def rnn_cell_backward(da_next, cache):
    """
    Implements the backward pass for the RNN-cell (single time-step).

    Arguments:
    da_next -- Gradient of loss with respect to next hidden state
    cache -- python dictionary containing useful values (output of rnn_step_forward())

    Returns:
    gradients -- python dictionary containing:
                        dx -- Gradients of input data, of shape (n_x, m)
                        da_prev -- Gradients of previous hidden state, of shape (n_a, m)
                        dWax -- Gradients of input-to-hidden weights, of shape (n_a, n_x)
                        dWaa -- Gradients of hidden-to-hidden weights, of shape (n_a, n_a)
                        dba -- Gradients of bias vector, of shape (n_a, 1)
    """
    
    # Retrieve values from cache
    (a_next, a_prev, xt, parameters) = cache
    
    # Retrieve values from parameters
    Wax = parameters["Wax"]
    Waa = parameters["Waa"]
    Wya = parameters["Wya"]
    ba = parameters["ba"]
    by = parameters["by"]

    ### START CODE HERE ###
    # compute the gradient of tanh with respect to a_next (≈1 line)
    dtanh = None

    # compute the gradient of the loss with respect to Wax (≈2 lines)
    dxt = None
    dWax = None

    # compute the gradient with respect to Waa (≈2 lines)
    da_prev = None
    dWaa = None

    # compute the gradient with respect to b (≈1 line)
    dba = None

    ### END CODE HERE ###
    
    # Store the gradients in a python dictionary
    gradients = {"dxt": dxt, "da_prev": da_prev, "dWax": dWax, "dWaa": dWaa, "dba": dba}
    
    return gradients

In [None]:
np.random.seed(1)
xt = np.random.randn(3,10)
a_prev = np.random.randn(5,10)
Wax = np.random.randn(5,3)
Waa = np.random.randn(5,5)
Wya = np.random.randn(2,5)
b = np.random.randn(5,1)
by = np.random.randn(2,1)
parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "ba": ba, "by": by}

a_next, yt, cache = rnn_cell_forward(xt, a_prev, parameters)

da_next = np.random.randn(5,10)
gradients = rnn_cell_backward(da_next, cache)
print("gradients[\"dxt\"][1][2] =", gradients["dxt"][1][2])
print("gradients[\"dxt\"].shape =", gradients["dxt"].shape)
print("gradients[\"da_prev\"][2][3] =", gradients["da_prev"][2][3])
print("gradients[\"da_prev\"].shape =", gradients["da_prev"].shape)
print("gradients[\"dWax\"][3][1] =", gradients["dWax"][3][1])
print("gradients[\"dWax\"].shape =", gradients["dWax"].shape)
print("gradients[\"dWaa\"][1][2] =", gradients["dWaa"][1][2])
print("gradients[\"dWaa\"].shape =", gradients["dWaa"].shape)
print("gradients[\"dba\"][4] =", gradients["dba"][4])
print("gradients[\"dba\"].shape =", gradients["dba"].shape)

**预期输出**:

<table>
    <tr>
        <td>
            **gradients["dxt"][1][2]** =
        </td>
        <td>
           -0.460564103059
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dxt"].shape** =
        </td>
        <td>
           (3, 10)
        </td>
    </tr>
        <tr>
        <td>
            **gradients["da_prev"][2][3]** =
        </td>
        <td>
           0.0842968653807
        </td>
    </tr>
        <tr>
        <td>
            **gradients["da_prev"].shape** =
        </td>
        <td>
           (5, 10)
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dWax"][3][1]** =
        </td>
        <td>
           0.393081873922
        </td>
    </tr>
            <tr>
        <td>
            **gradients["dWax"].shape** =
        </td>
        <td>
           (5, 3)
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dWaa"][1][2]** = 
        </td>
        <td>
           -0.28483955787
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dWaa"].shape** =
        </td>
        <td>
           (5, 5)
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dba"][4]** = 
        </td>
        <td>
           [ 0.80517166]
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dba"].shape** = 
        </td>
        <td>
           (5, 1)
        </td>
    </tr>
</table>

#### 向后通过RNN

在每个时间步长$t$上计算相对于$a^{\langle t \rangle}$的成本梯度非常有用，因为它有助于将梯度反向传播到先前的RNN单元。 为此，您需要从头开始遍历所有时间步骤，并且在每一步中，增加整体$db_a$, $dW_{aa}$, $dW_{ax}$并存储$dx$ 。

**说明**：

实现`rnn_backward`函数。 首先用零初始化返回变量，然后循环遍历所有时间步长，同时在每个时间步长调用`rnn_cell_backward`，并相应地更新其他变量。

In [None]:
def rnn_backward(da, caches):
    """
    Implement the backward pass for a RNN over an entire sequence of input data.

    Arguments:
    da -- Upstream gradients of all hidden states, of shape (n_a, m, T_x)
    caches -- tuple containing information from the forward pass (rnn_forward)
    
    Returns:
    gradients -- python dictionary containing:
                        dx -- Gradient w.r.t. the input data, numpy-array of shape (n_x, m, T_x)
                        da0 -- Gradient w.r.t the initial hidden state, numpy-array of shape (n_a, m)
                        dWax -- Gradient w.r.t the input's weight matrix, numpy-array of shape (n_a, n_x)
                        dWaa -- Gradient w.r.t the hidden state's weight matrix, numpy-arrayof shape (n_a, n_a)
                        dba -- Gradient w.r.t the bias, of shape (n_a, 1)
    """
        
    ### START CODE HERE ###
    
    # Retrieve values from the first cache (t=1) of caches (≈2 lines)
    (caches, x) = None
    (a1, a0, x1, parameters) = None
    
    # Retrieve dimensions from da's and x1's shapes (≈2 lines)
    n_a, m, T_x = None
    n_x, m = None
    
    # initialize the gradients with the right sizes (≈6 lines)
    dx = None
    dWax = None
    dWaa = None
    dba = None
    da0 = None
    da_prevt = None
    
    # Loop through all the time steps
    for t in reversed(range(None)):
        # Compute gradients at time step t. Choose wisely the "da_next" and the "cache" to use in the backward propagation step. (≈1 line)
        gradients = None
        # Retrieve derivatives from gradients (≈ 1 line)
        dxt, da_prevt, dWaxt, dWaat, dbat = gradients["dxt"], gradients["da_prev"], gradients["dWax"], gradients["dWaa"], gradients["dba"]
        # Increment global derivatives w.r.t parameters by adding their derivative at time-step t (≈4 lines)
        dx[:, :, t] = None
        dWax += None
        dWaa += None
        dba += None
        
    # Set da0 to the gradient of a which has been backpropagated through all time-steps (≈1 line) 
    da0 = None
    ### END CODE HERE ###

    # Store the gradients in a python dictionary
    gradients = {"dx": dx, "da0": da0, "dWax": dWax, "dWaa": dWaa,"dba": dba}
    
    return gradients

In [None]:
np.random.seed(1)
x = np.random.randn(3,10,4)
a0 = np.random.randn(5,10)
Wax = np.random.randn(5,3)
Waa = np.random.randn(5,5)
Wya = np.random.randn(2,5)
ba = np.random.randn(5,1)
by = np.random.randn(2,1)
parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "ba": ba, "by": by}
a, y, caches = rnn_forward(x, a0, parameters)
da = np.random.randn(5, 10, 4)
gradients = rnn_backward(da, caches)

print("gradients[\"dx\"][1][2] =", gradients["dx"][1][2])
print("gradients[\"dx\"].shape =", gradients["dx"].shape)
print("gradients[\"da0\"][2][3] =", gradients["da0"][2][3])
print("gradients[\"da0\"].shape =", gradients["da0"].shape)
print("gradients[\"dWax\"][3][1] =", gradients["dWax"][3][1])
print("gradients[\"dWax\"].shape =", gradients["dWax"].shape)
print("gradients[\"dWaa\"][1][2] =", gradients["dWaa"][1][2])
print("gradients[\"dWaa\"].shape =", gradients["dWaa"].shape)
print("gradients[\"dba\"][4] =", gradients["dba"][4])
print("gradients[\"dba\"].shape =", gradients["dba"].shape)

**预期输出**:

<table>
    <tr>
        <td>
            **gradients["dx"][1][2]** =
        </td>
        <td>
           [-2.07101689 -0.59255627  0.02466855  0.01483317]
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dx"].shape** =
        </td>
        <td>
           (3, 10, 4)
        </td>
    </tr>
        <tr>
        <td>
            **gradients["da0"][2][3]** =
        </td>
        <td>
           -0.314942375127
        </td>
    </tr>
        <tr>
        <td>
            **gradients["da0"].shape** =
        </td>
        <td>
           (5, 10)
        </td>
    </tr>
         <tr>
        <td>
            **gradients["dWax"][3][1]** =
        </td>
        <td>
           11.2641044965
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dWax"].shape** =
        </td>
        <td>
           (5, 3)
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dWaa"][1][2]** = 
        </td>
        <td>
           2.30333312658
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dWaa"].shape** =
        </td>
        <td>
           (5, 5)
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dba"][4]** = 
        </td>
        <td>
           [-0.74747722]
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dba"].shape** = 
        </td>
        <td>
           (5, 1)
        </td>
    </tr>
</table>

## 3.2 - LSTM 后向传播

### 3.2.1 向后退一步

LSTM向后传递比向前传递要复杂得多。我们在下面为您提供了LSTM向后传递的所有方程式。 （如果您喜欢微积分练习，请尝试自己从头开始进行演算。）

### 3.2.2 门派生

$$d \Gamma_o^{\langle t \rangle} = da_{next}*\tanh(c_{next}) * \Gamma_o^{\langle t \rangle}*(1-\Gamma_o^{\langle t \rangle})\tag{7}$$

$$d\tilde c^{\langle t \rangle} = dc_{next}*\Gamma_i^{\langle t \rangle}+ \Gamma_o^{\langle t \rangle} (1-\tanh(c_{next})^2) * i_t * da_{next} * \tilde c^{\langle t \rangle} * (1-\tanh(\tilde c)^2) \tag{8}$$

$$d\Gamma_u^{\langle t \rangle} = dc_{next}*\tilde c^{\langle t \rangle} + \Gamma_o^{\langle t \rangle} (1-\tanh(c_{next})^2) * \tilde c^{\langle t \rangle} * da_{next}*\Gamma_u^{\langle t \rangle}*(1-\Gamma_u^{\langle t \rangle})\tag{9}$$

$$d\Gamma_f^{\langle t \rangle} = dc_{next}*\tilde c_{prev} + \Gamma_o^{\langle t \rangle} (1-\tanh(c_{next})^2) * c_{prev} * da_{next}*\Gamma_f^{\langle t \rangle}*(1-\Gamma_f^{\langle t \rangle})\tag{10}$$


### 3.2.3参数导数

$$ dW_f = d\Gamma_f^{\langle t \rangle} * \begin{pmatrix} a_{prev} \\ x_t\end{pmatrix}^T \tag{11} $$
$$ dW_u = d\Gamma_u^{\langle t \rangle} * \begin{pmatrix} a_{prev} \\ x_t\end{pmatrix}^T \tag{12} $$
$$ dW_c = d\tilde c^{\langle t \rangle} * \begin{pmatrix} a_{prev} \\ x_t\end{pmatrix}^T \tag{13} $$
$$ dW_o = d\Gamma_o^{\langle t \rangle} * \begin{pmatrix} a_{prev} \\ x_t\end{pmatrix}^T \tag{14}$$


要计算$db_f, db_u, db_c, db_o$您只需要在$d\Gamma_f^{\langle t \rangle}, d\Gamma_u^{\langle t \rangle}, d\tilde c^{\langle t \rangle}, d\Gamma_o^{\langle t \rangle}$的水平（轴=1）轴上求和，d tildec langlet rangle，d Gammao​​ langlet rangle。注意，您应该有`keep_dims = True`选项。

最后，您将针对先前的隐藏状态，先前的内存状态和输入计算导数。

$$ da_{prev} = W_f^T*d\Gamma_f^{\langle t \rangle} + W_u^T * d\Gamma_u^{\langle t \rangle}+ W_c^T * d\tilde c^{\langle t \rangle} + W_o^T * d\Gamma_o^{\langle t \rangle} \tag{15}$$

在这里，等式13的权重是前n_a个（即$W_f = W_f[:n_a,:]$等...）

$$ dc_{prev} = dc_{next}\Gamma_f^{\langle t \rangle} + \Gamma_o^{\langle t \rangle} * (1- \tanh(c_{next})^2)*\Gamma_f^{\langle t \rangle}*da_{next} \tag{16}$$
$$ dx^{\langle t \rangle} = W_f^T*d\Gamma_f^{\langle t \rangle} + W_u^T * d\Gamma_u^{\langle t \rangle}+ W_c^T * d\tilde c_t + W_o^T * d\Gamma_o^{\langle t \rangle}\tag{17} $$

等式15的权重是从n_a到末尾（即$W_f = W_f[n_a:,:]$等...）

**练习：**通过执行下面的等式$7-17$来实现`lstm_cell_backward`。祝好运！ :)

In [None]:
def lstm_cell_backward(da_next, dc_next, cache):
    """
    Implement the backward pass for the LSTM-cell (single time-step).

    Arguments:
    da_next -- Gradients of next hidden state, of shape (n_a, m)
    dc_next -- Gradients of next cell state, of shape (n_a, m)
    cache -- cache storing information from the forward pass

    Returns:
    gradients -- python dictionary containing:
                        dxt -- Gradient of input data at time-step t, of shape (n_x, m)
                        da_prev -- Gradient w.r.t. the previous hidden state, numpy array of shape (n_a, m)
                        dc_prev -- Gradient w.r.t. the previous memory state, of shape (n_a, m, T_x)
                        dWf -- Gradient w.r.t. the weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
                        dWi -- Gradient w.r.t. the weight matrix of the input gate, numpy array of shape (n_a, n_a + n_x)
                        dWc -- Gradient w.r.t. the weight matrix of the memory gate, numpy array of shape (n_a, n_a + n_x)
                        dWo -- Gradient w.r.t. the weight matrix of the save gate, numpy array of shape (n_a, n_a + n_x)
                        dbf -- Gradient w.r.t. biases of the forget gate, of shape (n_a, 1)
                        dbi -- Gradient w.r.t. biases of the update gate, of shape (n_a, 1)
                        dbc -- Gradient w.r.t. biases of the memory gate, of shape (n_a, 1)
                        dbo -- Gradient w.r.t. biases of the save gate, of shape (n_a, 1)
    """

    # Retrieve information from "cache"
    (a_next, c_next, a_prev, c_prev, ft, it, cct, ot, xt, parameters) = cache
    
    ### START CODE HERE ###
    # Retrieve dimensions from xt's and a_next's shape (≈2 lines)
    n_x, m = None
    n_a, m = None
    
    # Compute gates related derivatives, you can find their values can be found by looking carefully at equations (7) to (10) (≈4 lines)
    dot = None
    dcct = None
    dit = None
    dft = None
    
    # Code equations (7) to (10) (≈4 lines)
    dit = None
    dft = None
    dot = None
    dcct = None

    # Compute parameters related derivatives. Use equations (11)-(14) (≈8 lines)
    dWf = None
    dWi = None
    dWc = None
    dWo = None
    dbf = None
    dbi = None
    dbc = None
    dbo = None

    # Compute derivatives w.r.t previous hidden state, previous memory state and input. Use equations (15)-(17). (≈3 lines)
    da_prev = None
    dc_prev = None
    dxt = None
    ### END CODE HERE ###
    
    # Save gradients in dictionary
    gradients = {"dxt": dxt, "da_prev": da_prev, "dc_prev": dc_prev, "dWf": dWf,"dbf": dbf, "dWi": dWi,"dbi": dbi,
                "dWc": dWc,"dbc": dbc, "dWo": dWo,"dbo": dbo}

    return gradients

In [None]:
np.random.seed(1)
xt = np.random.randn(3,10)
a_prev = np.random.randn(5,10)
c_prev = np.random.randn(5,10)
Wf = np.random.randn(5, 5+3)
bf = np.random.randn(5,1)
Wi = np.random.randn(5, 5+3)
bi = np.random.randn(5,1)
Wo = np.random.randn(5, 5+3)
bo = np.random.randn(5,1)
Wc = np.random.randn(5, 5+3)
bc = np.random.randn(5,1)
Wy = np.random.randn(2,5)
by = np.random.randn(2,1)

parameters = {"Wf": Wf, "Wi": Wi, "Wo": Wo, "Wc": Wc, "Wy": Wy, "bf": bf, "bi": bi, "bo": bo, "bc": bc, "by": by}

a_next, c_next, yt, cache = lstm_cell_forward(xt, a_prev, c_prev, parameters)

da_next = np.random.randn(5,10)
dc_next = np.random.randn(5,10)
gradients = lstm_cell_backward(da_next, dc_next, cache)
print("gradients[\"dxt\"][1][2] =", gradients["dxt"][1][2])
print("gradients[\"dxt\"].shape =", gradients["dxt"].shape)
print("gradients[\"da_prev\"][2][3] =", gradients["da_prev"][2][3])
print("gradients[\"da_prev\"].shape =", gradients["da_prev"].shape)
print("gradients[\"dc_prev\"][2][3] =", gradients["dc_prev"][2][3])
print("gradients[\"dc_prev\"].shape =", gradients["dc_prev"].shape)
print("gradients[\"dWf\"][3][1] =", gradients["dWf"][3][1])
print("gradients[\"dWf\"].shape =", gradients["dWf"].shape)
print("gradients[\"dWi\"][1][2] =", gradients["dWi"][1][2])
print("gradients[\"dWi\"].shape =", gradients["dWi"].shape)
print("gradients[\"dWc\"][3][1] =", gradients["dWc"][3][1])
print("gradients[\"dWc\"].shape =", gradients["dWc"].shape)
print("gradients[\"dWo\"][1][2] =", gradients["dWo"][1][2])
print("gradients[\"dWo\"].shape =", gradients["dWo"].shape)
print("gradients[\"dbf\"][4] =", gradients["dbf"][4])
print("gradients[\"dbf\"].shape =", gradients["dbf"].shape)
print("gradients[\"dbi\"][4] =", gradients["dbi"][4])
print("gradients[\"dbi\"].shape =", gradients["dbi"].shape)
print("gradients[\"dbc\"][4] =", gradients["dbc"][4])
print("gradients[\"dbc\"].shape =", gradients["dbc"].shape)
print("gradients[\"dbo\"][4] =", gradients["dbo"][4])
print("gradients[\"dbo\"].shape =", gradients["dbo"].shape)

预期输出**:

<table>
    <tr>
        <td>
            **gradients["dxt"][1][2]** =
        </td>
        <td>
           3.23055911511
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dxt"].shape** =
        </td>
        <td>
           (3, 10)
        </td>
    </tr>
        <tr>
        <td>
            **gradients["da_prev"][2][3]** =
        </td>
        <td>
           -0.0639621419711
        </td>
    </tr>
        <tr>
        <td>
            **gradients["da_prev"].shape** =
        </td>
        <td>
           (5, 10)
        </td>
    </tr>
         <tr>
        <td>
            **gradients["dc_prev"][2][3]** =
        </td>
        <td>
           0.797522038797
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dc_prev"].shape** =
        </td>
        <td>
           (5, 10)
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dWf"][3][1]** = 
        </td>
        <td>
           -0.147954838164
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dWf"].shape** =
        </td>
        <td>
           (5, 8)
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dWi"][1][2]** = 
        </td>
        <td>
           1.05749805523
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dWi"].shape** = 
        </td>
        <td>
           (5, 8)
        </td>
    </tr>
    <tr>
        <td>
            **gradients["dWc"][3][1]** = 
        </td>
        <td>
           2.30456216369
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dWc"].shape** = 
        </td>
        <td>
           (5, 8)
        </td>
    </tr>
    <tr>
        <td>
            **gradients["dWo"][1][2]** = 
        </td>
        <td>
           0.331311595289
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dWo"].shape** = 
        </td>
        <td>
           (5, 8)
        </td>
    </tr>
    <tr>
        <td>
            **gradients["dbf"][4]** = 
        </td>
        <td>
           [ 0.18864637]
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dbf"].shape** = 
        </td>
        <td>
           (5, 1)
        </td>
    </tr>
    <tr>
        <td>
            **gradients["dbi"][4]** = 
        </td>
        <td>
           [-0.40142491]
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dbi"].shape** = 
        </td>
        <td>
           (5, 1)
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dbc"][4]** = 
        </td>
        <td>
           [ 0.25587763]
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dbc"].shape** = 
        </td>
        <td>
           (5, 1)
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dbo"][4]** = 
        </td>
        <td>
           [ 0.13893342]
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dbo"].shape** = 
        </td>
        <td>
           (5, 1)
        </td>
    </tr>
</table>

### 3.3 向后传递LSTM RNN

这部分与您在上面实现的`rnn_backward`函数非常相似。 首先，您将创建与返回变量相同维的变量。 然后，您将从头开始遍历所有时间步骤，并在每次迭代中调用为LSTM实现的一步功能。 然后，您将通过分别汇总参数来更新参数。 最后返回带有新渐变的字典。

**指令**：实现`lstm_backward`函数。 创建一个从$T_x$开始并向后的for循环。 对于每个步骤，请调用`lstm_cell_backward`，并通过向其添加新渐变来更新旧渐变。 请注意，`dxt`不会更新而是存储。

In [None]:
def lstm_backward(da, caches):
    
    """
    Implement the backward pass for the RNN with LSTM-cell (over a whole sequence).

    Arguments:
    da -- Gradients w.r.t the hidden states, numpy-array of shape (n_a, m, T_x)
    dc -- Gradients w.r.t the memory states, numpy-array of shape (n_a, m, T_x)
    caches -- cache storing information from the forward pass (lstm_forward)

    Returns:
    gradients -- python dictionary containing:
                        dx -- Gradient of inputs, of shape (n_x, m, T_x)
                        da0 -- Gradient w.r.t. the previous hidden state, numpy array of shape (n_a, m)
                        dWf -- Gradient w.r.t. the weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
                        dWi -- Gradient w.r.t. the weight matrix of the update gate, numpy array of shape (n_a, n_a + n_x)
                        dWc -- Gradient w.r.t. the weight matrix of the memory gate, numpy array of shape (n_a, n_a + n_x)
                        dWo -- Gradient w.r.t. the weight matrix of the save gate, numpy array of shape (n_a, n_a + n_x)
                        dbf -- Gradient w.r.t. biases of the forget gate, of shape (n_a, 1)
                        dbi -- Gradient w.r.t. biases of the update gate, of shape (n_a, 1)
                        dbc -- Gradient w.r.t. biases of the memory gate, of shape (n_a, 1)
                        dbo -- Gradient w.r.t. biases of the save gate, of shape (n_a, 1)
    """

    # Retrieve values from the first cache (t=1) of caches.
    (caches, x) = caches
    (a1, c1, a0, c0, f1, i1, cc1, o1, x1, parameters) = caches[0]
    
    ### START CODE HERE ###
    # Retrieve dimensions from da's and x1's shapes (≈2 lines)
    n_a, m, T_x = None
    n_x, m = None
    
    # initialize the gradients with the right sizes (≈12 lines)
    dx = None
    da0 = None
    da_prevt = None
    dc_prevt = None
    dWf = None
    dWi = None
    dWc = None
    dWo = None
    dbf = None
    dbi = None
    dbc = None
    dbo = None
    
    # loop back over the whole sequence
    for t in reversed(range(None)):
        # Compute all gradients using lstm_cell_backward
        gradients = None
        # Store or add the gradient to the parameters' previous step's gradient
        dx[:,:,t] = None
        dWf = None
        dWi = None
        dWc = None
        dWo = None
        dbf = None
        dbi = None
        dbc = None
        dbo = None
    # Set the first activation's gradient to the backpropagated gradient da_prev.
    da0 = None
    
    ### END CODE HERE ###

    # Store the gradients in a python dictionary
    gradients = {"dx": dx, "da0": da0, "dWf": dWf,"dbf": dbf, "dWi": dWi,"dbi": dbi,
                "dWc": dWc,"dbc": dbc, "dWo": dWo,"dbo": dbo}
    
    return gradients

In [None]:
np.random.seed(1)
x = np.random.randn(3,10,7)
a0 = np.random.randn(5,10)
Wf = np.random.randn(5, 5+3)
bf = np.random.randn(5,1)
Wi = np.random.randn(5, 5+3)
bi = np.random.randn(5,1)
Wo = np.random.randn(5, 5+3)
bo = np.random.randn(5,1)
Wc = np.random.randn(5, 5+3)
bc = np.random.randn(5,1)

parameters = {"Wf": Wf, "Wi": Wi, "Wo": Wo, "Wc": Wc, "Wy": Wy, "bf": bf, "bi": bi, "bo": bo, "bc": bc, "by": by}

a, y, c, caches = lstm_forward(x, a0, parameters)

da = np.random.randn(5, 10, 4)
gradients = lstm_backward(da, caches)

print("gradients[\"dx\"][1][2] =", gradients["dx"][1][2])
print("gradients[\"dx\"].shape =", gradients["dx"].shape)
print("gradients[\"da0\"][2][3] =", gradients["da0"][2][3])
print("gradients[\"da0\"].shape =", gradients["da0"].shape)
print("gradients[\"dWf\"][3][1] =", gradients["dWf"][3][1])
print("gradients[\"dWf\"].shape =", gradients["dWf"].shape)
print("gradients[\"dWi\"][1][2] =", gradients["dWi"][1][2])
print("gradients[\"dWi\"].shape =", gradients["dWi"].shape)
print("gradients[\"dWc\"][3][1] =", gradients["dWc"][3][1])
print("gradients[\"dWc\"].shape =", gradients["dWc"].shape)
print("gradients[\"dWo\"][1][2] =", gradients["dWo"][1][2])
print("gradients[\"dWo\"].shape =", gradients["dWo"].shape)
print("gradients[\"dbf\"][4] =", gradients["dbf"][4])
print("gradients[\"dbf\"].shape =", gradients["dbf"].shape)
print("gradients[\"dbi\"][4] =", gradients["dbi"][4])
print("gradients[\"dbi\"].shape =", gradients["dbi"].shape)
print("gradients[\"dbc\"][4] =", gradients["dbc"][4])
print("gradients[\"dbc\"].shape =", gradients["dbc"].shape)
print("gradients[\"dbo\"][4] =", gradients["dbo"][4])
print("gradients[\"dbo\"].shape =", gradients["dbo"].shape)

**预期输出**:

<table>
    <tr>
        <td>
            **gradients["dx"][1][2]** =
        </td>
        <td>
           [-0.00173313  0.08287442 -0.30545663 -0.43281115]
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dx"].shape** =
        </td>
        <td>
           (3, 10, 4)
        </td>
    </tr>
        <tr>
        <td>
            **gradients["da0"][2][3]** =
        </td>
        <td>
           -0.095911501954
        </td>
    </tr>
        <tr>
        <td>
            **gradients["da0"].shape** =
        </td>
        <td>
           (5, 10)
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dWf"][3][1]** = 
        </td>
        <td>
           -0.0698198561274
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dWf"].shape** =
        </td>
        <td>
           (5, 8)
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dWi"][1][2]** = 
        </td>
        <td>
           0.102371820249
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dWi"].shape** = 
        </td>
        <td>
           (5, 8)
        </td>
    </tr>
    <tr>
        <td>
            **gradients["dWc"][3][1]** = 
        </td>
        <td>
           -0.0624983794927
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dWc"].shape** = 
        </td>
        <td>
           (5, 8)
        </td>
    </tr>
    <tr>
        <td>
            **gradients["dWo"][1][2]** = 
        </td>
        <td>
           0.0484389131444
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dWo"].shape** = 
        </td>
        <td>
           (5, 8)
        </td>
    </tr>
    <tr>
        <td>
            **gradients["dbf"][4]** = 
        </td>
        <td>
           [-0.0565788]
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dbf"].shape** = 
        </td>
        <td>
           (5, 1)
        </td>
    </tr>
    <tr>
        <td>
            **gradients["dbi"][4]** = 
        </td>
        <td>
           [-0.06997391]
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dbi"].shape** = 
        </td>
        <td>
           (5, 1)
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dbc"][4]** = 
        </td>
        <td>
           [-0.27441821]
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dbc"].shape** = 
        </td>
        <td>
           (5, 1)
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dbo"][4]** = 
        </td>
        <td>
           [ 0.16532821]
        </td>
    </tr>
        <tr>
        <td>
            **gradients["dbo"].shape** = 
        </td>
        <td>
           (5, 1)
        </td>
    </tr>
</table>

### 恭喜！

祝贺您完成此作业。 您现在了解了递归神经网络的工作原理！

让我们继续下一个练习，在该练习中，您将使用RNN来构建字符级语言模型。