# 构建深度神经网络：逐步实现

欢迎来到第四周作业（第一部分）！之前，您已经训练了一个带有单隐藏层的2层神经网络。本周，您将构建一个具有任意层数的深度神经网络！

- 在本笔记本中，您将实现构建深度神经网络所需的所有函数。

- 对于下一项任务，您将使用这些函数构建用于图像分类的深度神经网络。

**通过本次任务，您将能够：**

- 使用ReLU等非线性单元来提高模型性能

- 构建更深的神经网络（具有多个隐藏层）

- 实现易于使用的神经网络类

**符号说明**：

- 上标$[l]$表示与第$l$层相关的量。

- 例如：$a^{[L]}$是第$L$层的激活。$W^{[L]}$和$b^{[L]}$是第$L$层的参数。

- 上标$(i)$表示与第$i$个示例相关的量。

- 例如：$x^{(i)}$是第$i$个训练示例。

- 小写$i$表示向量的第$i$个条目。

- 例如：$a^{[l]}_i$表示$l$层激活的第$i$个条目。

让我们开始吧！

## 提交到自动评分器前请注意以下事项：

1. 您没有在作业中添加任何“额外”的 `print` 语句。

2. 您没有在作业中添加任何“额外”的代码单元格。

3. 您没有更改任何函数参数。

4. 您未在评分练习中使用任何全局变量。除非特别指示，否则请避免使用全局变量，改用局部变量。

5. 您没有在不需要的情况下更改作业代码，比如创建“额外”的变量。

如果您做了以上任何一项，提交作业后您会收到类似于“评分器错误：未找到评分反馈”（或类似意外的）错误。在寻求帮助/调试作业中的错误之前，请先检查这些内容。如果是这种情况，并且您不记得您所做的更改，请按照这些[说明](https://www.coursera.org/learn/neural-networks-deep-learning/supplement/iLwon/h-ow-to-refresh-your-workspace)获取作业的新副本。

## Table of Contents
- [1 - Packages](#1)
- [2 - Outline](#2)
- [3 - Initialization](#3)
    - [3.1 - 2-layer Neural Network](#3-1)
        - [Exercise 1 - initialize_parameters](#ex-1)
    - [3.2 - L-layer Neural Network](#3-2)
        - [Exercise 2 - initialize_parameters_deep](#ex-2)
- [4 - Forward Propagation Module](#4)
    - [4.1 - Linear Forward](#4-1)
        - [Exercise 3 - linear_forward](#ex-3)
    - [4.2 - Linear-Activation Forward](#4-2)
        - [Exercise 4 - linear_activation_forward](#ex-4)
    - [4.3 - L-Layer Model](#4-3)
        - [Exercise 5 - L_model_forward](#ex-5)
- [5 - Cost Function](#5)
    - [Exercise 6 - compute_cost](#ex-6)
- [6 - Backward Propagation Module](#6)
    - [6.1 - Linear Backward](#6-1)
        - [Exercise 7 - linear_backward](#ex-7)
    - [6.2 - Linear-Activation Backward](#6-2)
        - [Exercise 8 - linear_activation_backward](#ex-8)
    - [6.3 - L-Model Backward](#6-3)
        - [Exercise 9 - L_model_backward](#ex-9)
    - [6.4 - Update Parameters](#6-4)
        - [Exercise 10 - update_parameters](#ex-10)

<a name='1'></a>
## 1 - Packages

首先，导入你在本次任务中需要使用的所有包。

- [numpy](www.numpy.org) 是 Python 科学计算的主要包。

- [matplotlib](http://matplotlib.org) 是 Python 中绘制图形的库。

- dnn_utils 提供了一些本笔记本需要使用的必要函数。

- testCases 提供一些测试用例来评估您的函数的正确性。

- np.random.seed(1) 用于保持所有的随机函数调用的一致性。它有助于评分您的工作。请不要更改种子！

In [None]:
import numpy as np
import h5py
import matplotlib.pyplot as plt
from testCases import *
from dnn_utils import sigmoid, sigmoid_backward, relu, relu_backward
from public_tests import *

%matplotlib inline
plt.rcParams['figure.figsize'] = (5.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

%load_ext autoreload
%autoreload 2

np.random.seed(1)

<a name='2'></a>
## 2 - 大纲

为了构建神经网络，您将实现几个“帮助函数”。下一个任务将使用这些辅助函数来构建一个两层神经网络和一个 L 层神经网络。

每个小的辅助函数都将有详细的说明，带领您完成必要的步骤。以下是本任务的步骤概述：

- 初始化两层网络和 L 层神经网络的参数

- 实现前向传播模块（下图中的紫色部分）

- 完成层的前向传播步骤的 LINEAR 部分（得到 $Z^{[l]}$）。

- ACTIVATION 函数已为您提供（relu/sigmoid）。

- 将前两个步骤组合成一个新的 [LINEAR->ACTIVATION] 前向函数。

- 堆叠 [LINEAR->RELU] 前向函数 L-1 次（用于第 1 层到第 L-1 层），并在最后添加 [LINEAR->SIGMOID]（用于最后一层 L）。这样就可以得到一个新的 L_model_forward 函数。

- 计算损失

- 实现反向传播模块（下图中的红色部分）

- 完成层的反向传播步骤的 LINEAR 部分。

- ACTIVATION 函数的梯度已为您提供（relu_backward/sigmoid_backward）。

- 将前两个步骤组合成一个新的 [LINEAR->ACTIVATION] 反向函数。

- 堆叠 [LINEAR->RELU] 反向函数 L-1 次，并在新的 L_model_backward 函数中添加 [LINEAR->SIGMOID] 反向函数。

- 最后，更新参数

<img src="images/final outline.png" style="width:800px;height:500px;">

<caption><center><b>图1</b></center></caption><br>


**Note**:

对于每个前向函数，都有一个相应的反向函数。这就是为什么在前向模块的每个步骤中，您都会在缓存中存储一些值。这些缓存的值对于计算梯度非常有用。

在反向传播模块中，您可以使用缓存来计算梯度。别担心，这个任务会向您展示每个步骤的具体操作！

<a name='3'></a>
## 3 - 初始化

您将编写两个辅助函数来初始化模型的参数。第一个函数将用于初始化两层模型的参数。第二个函数将这个初始化过程推广到$L$层。

<a name='3-1'></a>

### 3.1 - 2层神经网络

<a name='ex-1'></a>

### 练习1 - 初始化参数

创建并初始化2层神经网络的参数。

**说明**：

- 模型的结构为：*LINEAR -> RELU -> LINEAR -> SIGMOID*。

- 对于权重矩阵，请使用此随机初始化：`np.random.randn(d0, d1, ..., dn) * 0.01`，并使用正确的形状。有关[np.random.randn](https://numpy.org/doc/stable/reference/random/generated/numpy.random.randn.html)的文档

- 对于偏置，请使用零初始化：`np.zeros(shape)`。有关[np.zeros](https://numpy.org/doc/stable/reference/generated/numpy.zeros.html)的文档

In [None]:
# GRADED FUNCTION: initialize_parameters

def initialize_parameters(n_x, n_h, n_y):
    """
    Argument:
    n_x -- size of the input layer
    n_h -- size of the hidden layer
    n_y -- size of the output layer
    
    Returns:
    parameters -- python dictionary containing your parameters:
                    W1 -- weight matrix of shape (n_h, n_x)
                    b1 -- bias vector of shape (n_h, 1)
                    W2 -- weight matrix of shape (n_y, n_h)
                    b2 -- bias vector of shape (n_y, 1)
    """
    
    np.random.seed(1)
    
    #(≈ 4 lines of code)
    # W1 = ...
    # b1 = ...
    # W2 = ...
    # b2 = ...
    # YOUR CODE STARTS HERE
    W1 = np.random.randn(n_h,n_x)*0.01
    b1 = np.zeros((n_h,1))
    W2 = np.random.randn(n_y,n_h)*0.01
    b2 = np.zeros((n_y,1))
    
    # YOUR CODE ENDS HERE
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters    

In [None]:
parameters = initialize_parameters(3,2,1)

print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

initialize_parameters_test(initialize_parameters)

***Expected output***
```
W1 = [[ 0.01624345 -0.00611756 -0.00528172]
 [-0.01072969  0.00865408 -0.02301539]]
b1 = [[0.]
 [0.]]
W2 = [[ 0.01744812 -0.00761207]]
b2 = [[0.]]
```

<a name='3-2'></a>
### 3.2 - L层神经网络

更深的L层神经网络的初始化更复杂，因为有更多的权重矩阵和偏置向量。在完成`initialize_parameters_deep`函数时，您应确保每个层之间的维度匹配。回想一下，$n^{[l]}$是第$l$层中的单元数。例如，如果您的输入$X$的大小为$(12288, 209)$（其中$m=209$个示例），则：

<table style="width:100%">
    <tr>
        <td>  </td> 
        <td> <b>Shape of W</b> </td> 
        <td> <b>Shape of b</b>  </td> 
        <td> <b>Activation</b> </td>
        <td> <b>Shape of Activation</b> </td> 
    <tr>
    <tr>
        <td> <b>Layer 1</b> </td> 
        <td> $(n^{[1]},12288)$ </td> 
        <td> $(n^{[1]},1)$ </td> 
        <td> $Z^{[1]} = W^{[1]}  X + b^{[1]} $ </td> 
        <td> $(n^{[1]},209)$ </td> 
    <tr>
    <tr>
        <td> <b>Layer 2</b> </td> 
        <td> $(n^{[2]}, n^{[1]})$  </td> 
        <td> $(n^{[2]},1)$ </td> 
        <td>$Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}$ </td> 
        <td> $(n^{[2]}, 209)$ </td> 
    <tr>
       <tr>
        <td> $\vdots$ </td> 
        <td> $\vdots$  </td> 
        <td> $\vdots$  </td> 
        <td> $\vdots$</td> 
        <td> $\vdots$  </td> 
    <tr>  
   <tr>
       <td> <b>Layer L-1</b> </td> 
        <td> $(n^{[L-1]}, n^{[L-2]})$ </td> 
        <td> $(n^{[L-1]}, 1)$  </td> 
        <td>$Z^{[L-1]} =  W^{[L-1]} A^{[L-2]} + b^{[L-1]}$ </td> 
        <td> $(n^{[L-1]}, 209)$ </td> 
   <tr>
   <tr>
       <td> <b>Layer L</b> </td> 
        <td> $(n^{[L]}, n^{[L-1]})$ </td> 
        <td> $(n^{[L]}, 1)$ </td>
        <td> $Z^{[L]} =  W^{[L]} A^{[L-1]} + b^{[L]}$</td>
        <td> $(n^{[L]}, 209)$  </td> 
    <tr>
</table>

请记住，在Python中计算 $W X + b$ 时，它会执行广播。例如，如果：

$$ W = \begin{bmatrix}
    w_{00}  & w_{01} & w_{02} \\
    w_{10}  & w_{11} & w_{12} \\
    w_{20}  & w_{21} & w_{22} 
\end{bmatrix}\;\;\; X = \begin{bmatrix}
    x_{00}  & x_{01} & x_{02} \\
    x_{10}  & x_{11} & x_{12} \\
    x_{20}  & x_{21} & x_{22} 
\end{bmatrix} \;\;\; b =\begin{bmatrix}
    b_0  \\
    b_1  \\
    b_2
\end{bmatrix}\tag{2}$$

Then $WX + b$ will be:

$$ WX + b = \begin{bmatrix}
    (w_{00}x_{00} + w_{01}x_{10} + w_{02}x_{20}) + b_0 & (w_{00}x_{01} + w_{01}x_{11} + w_{02}x_{21}) + b_0 & \cdots \\
    (w_{10}x_{00} + w_{11}x_{10} + w_{12}x_{20}) + b_1 & (w_{10}x_{01} + w_{11}x_{11} + w_{12}x_{21}) + b_1 & \cdots \\
    (w_{20}x_{00} + w_{21}x_{10} + w_{22}x_{20}) + b_2 &  (w_{20}x_{01} + w_{21}x_{11} + w_{22}x_{21}) + b_2 & \cdots
\end{bmatrix}\tag{3}  $$


<a name='ex-2'></a>
### 实现L层神经网络的初始化。

**说明**：
- 模型的结构是*[LINEAR -> RELU] $\times$ (L-1) -> LINEAR -> SIGMOID*。即，它有$L-1$层，使用ReLU激活函数，后跟一个使用Sigmoid激活函数的输出层。
- 使用随机初始化权重矩阵。使用`np.random.randn(d0, d1, ..., dn) * 0.01`。
- 使用零初始化偏置。使用`np.zeros(shape)`。
- 你将会把不同层的单元数$n^{[l]}$存储在一个变量`layer_dims`中。例如，上周的Planar Data分类模型的`layer_dims`将是[2,4,1]：有两个输入，一个隐藏层有4个隐藏单元，一个输出层有1个输出单元。这意味着`W1`的形状是(4,2)，`b1`是(4,1)，`W2`是(1,4)，`b2`是(1,1)。现在，你将把这个通用化到$L$层！
- 下面是$L=1$（单层神经网络）的实现。它应该启发你实现一般情况（L层神经网络）。
```python
    if L == 1:
        parameters["W" + str(L)] = np.random.randn(layer_dims[1], layer_dims[0]) * 0.01
        parameters["b" + str(L)] = np.zeros((layer_dims[1], 1))
```

In [None]:
# GRADED FUNCTION: initialize_parameters_deep

def initialize_parameters_deep(layer_dims):
    """
    Arguments:
    layer_dims -- python array (list) containing the dimensions of each layer in our network
    
    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    Wl -- weight matrix of shape (layer_dims[l], layer_dims[l-1])
                    bl -- bias vector of shape (layer_dims[l], 1)
    """
    
    np.random.seed(3)
    parameters = {}
    L = len(layer_dims) # number of layers in the network

    for l in range(1, L):
        #(≈ 2 lines of code)
        # parameters['W' + str(l)] = ...
        # parameters['b' + str(l)] = ...
        # YOUR CODE STARTS HERE
        parameters['W'+str(l)] = np.random.randn(layer_dims[l],layer_dims[l-1])*0.01
        parameters['b'+str(l)] = np.zeros((layer_dims[l],1))
        
        # YOUR CODE ENDS HERE
        
        assert(parameters['W' + str(l)].shape == (layer_dims[l], layer_dims[l - 1]))
        assert(parameters['b' + str(l)].shape == (layer_dims[l], 1))

        
    return parameters

In [None]:
parameters = initialize_parameters_deep([5,4,3])

print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

initialize_parameters_deep_test(initialize_parameters_deep)

***Expected output***
```
W1 = [[ 0.01788628  0.0043651   0.00096497 -0.01863493 -0.00277388]
 [-0.00354759 -0.00082741 -0.00627001 -0.00043818 -0.00477218]
 [-0.01313865  0.00884622  0.00881318  0.01709573  0.00050034]
 [-0.00404677 -0.0054536  -0.01546477  0.00982367 -0.01101068]]
b1 = [[0.]
 [0.]
 [0.]
 [0.]]
W2 = [[-0.01185047 -0.0020565   0.01486148  0.00236716]
 [-0.01023785 -0.00712993  0.00625245 -0.00160513]
 [-0.00768836 -0.00230031  0.00745056  0.01976111]]
b2 = [[0.]
 [0.]
 [0.]]
```

<a name='4'></a>
## 4 - 前向传播模块

<a name='4-1'></a>

### 4.1 - 线性前向传播

现在您已经初始化了参数，可以进行前向传播模块。开始实现一些基本函数，稍后在实现模型时可以再次使用它们。现在，您将按以下顺序完成三个函数：

- 线性函数

- 线性函数->激活函数，其中激活函数将是ReLU或Sigmoid。

- [线性->ReLU]$\times$(L-1)->线性->Sigmoid(整个模型)

线性前向模块（向量化所有示例）计算以下方程：

$$Z^{[l]} = W^{[l]}A^{[l-1]} +b^{[l]}\tag{4}$$

其中$A^{[0]} = X$。

<a name='ex-3'></a>

### 练习3 - linear_forward

构建前向传播的线性部分。

**提醒**：

这个单元的数学表示是$Z^{[l]} = W^{[l]}A^{[l-1]} +b^{[l]}$。您可能会发现`np.dot()`很有用。如果您的尺寸不匹配，打印`W.shape`可能会有所帮助。

In [None]:
# GRADED FUNCTION: linear_forward

def linear_forward(A, W, b):
    """
    Implement the linear part of a layer's forward propagation.

    Arguments:
    A -- activations from previous layer (or input data): (size of previous layer, number of examples)
    W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
    b -- bias vector, numpy array of shape (size of the current layer, 1)

    Returns:
    Z -- the input of the activation function, also called pre-activation parameter 
    cache -- a python tuple containing "A", "W" and "b" ; stored for computing the backward pass efficiently
    """
    
    #(≈ 1 line of code)
    # Z = ...
    # YOUR CODE STARTS HERE
    Z = np.dot(W,A)+b
    
    # YOUR CODE ENDS HERE
    cache = (A, W, b)
    
    return Z, cache

In [None]:
t_A, t_W, t_b = linear_forward_test_case()
t_Z, t_linear_cache = linear_forward(t_A, t_W, t_b)
print("Z = " + str(t_Z))

linear_forward_test(linear_forward)

***Expected output***
```
Z = [[ 3.26295337 -1.23429987]]
```

在这个笔记本中，您将使用两个激活函数：

- **Sigmoid函数**：$\sigma(Z) = \sigma(W A + b) = \frac{1}{ 1 + e^{-(W A + b)}}$。您已经获得了`sigmoid`函数，它返回**两个**项目：激活值"`a`"和包含"`Z`"的"`cache`"（这是我们将馈送到相应的反向函数中的内容）。要使用它，您只需调用：
``` python
A, activation_cache = sigmoid(Z)
```

- **ReLU函数**：ReLU的数学公式为$A = RELU(Z) = max(0, Z)$。您已经获得了`relu`函数。此函数返回**两个**项目：激活值"`A`"和包含"`Z`"的"`cache`"（这是您将馈送到相应的反向函数中的内容）。要使用它，您只需调用：
``` python
A, activation_cache = relu(Z)
```

为了增加便利性，您将把两个功能（线性和激活）合并成一个功能（线性->激活）。 因此，您将实现一个函数，该函数执行线性前向步骤，然后执行激活前向步骤。
<a name='ex-4'></a>

### 练习4 - linear_activation_forward

实现* LINEAR-> ACTIVATION *层的前向传播。 数学关系是：$ A ^ {[l]} = g（Z ^ {[l]}）= g（W ^ {[l]} A ^ {[l-1]} + b ^ {[l]}）$，其中激活“g”可以是 sigmoid（）或 relu（）使用`linear_forward（）`和正确的激活函数。

In [None]:
# GRADED FUNCTION: linear_activation_forward

def linear_activation_forward(A_prev, W, b, activation):
    """
    Implement the forward propagation for the LINEAR->ACTIVATION layer

    Arguments:
    A_prev -- activations from previous layer (or input data): (size of previous layer, number of examples)
    W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
    b -- bias vector, numpy array of shape (size of the current layer, 1)
    activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"

    Returns:
    A -- the output of the activation function, also called the post-activation value 
    cache -- a python tuple containing "linear_cache" and "activation_cache";
             stored for computing the backward pass efficiently
    """
    
    if activation == "sigmoid":
        #(≈ 2 lines of code)
        # Z, linear_cache = ...
        # A, activation_cache = ...
        # YOUR CODE STARTS HERE
        Z, linear_cache = linear_forward(A_prev,W,b)
        
        A,activation_cache = sigmoid(Z)
        # YOUR CODE ENDS HERE
    
    elif activation == "relu":
        #(≈ 2 lines of code)
        # Z, linear_cache = ...
        # A, activation_cache = ...
        # YOUR CODE STARTS HERE
        Z, linear_cache = linear_forward(A_prev,W,b)
        
        A,activation_cache = relu(Z)
        # YOUR CODE ENDS HERE
    cache = (linear_cache, activation_cache)

    return A, cache

In [None]:
t_A_prev, t_W, t_b = linear_activation_forward_test_case()

t_A, t_linear_activation_cache = linear_activation_forward(t_A_prev, t_W, t_b, activation = "sigmoid")
print("With sigmoid: A = " + str(t_A))

t_A, t_linear_activation_cache = linear_activation_forward(t_A_prev, t_W, t_b, activation = "relu")
print("With ReLU: A = " + str(t_A))

linear_activation_forward_test(linear_activation_forward)

***Expected output***
```
With sigmoid: A = [[0.96890023 0.11013289]]
With ReLU: A = [[3.43896131 0.        ]]
```

**Note**: In deep learning, the "[LINEAR->ACTIVATION]" computation is counted as a single layer in the neural network, not two layers. 

<a name='4-3'></a>
### 4.3 - L层模型 

实现 $L$ 层神经网络时，为了更加方便，你需要一个函数，该函数将以 RELU 为激活函数的 `linear_activation_forward` 函数复制 $L-1$ 次，然后再接一个以 SIGMOID 为激活函数的 `linear_activation_forward` 函数。

<img src="images/model_architecture_kiank.png" style="width:600px;height:300px;">
<caption><center> <b>图 2</b> : *[LINEAR -> RELU] $\times$ (L-1) -> LINEAR -> SIGMOID* 模型</center></caption><br>

<a name='ex-5'></a>
### 练习 5 -  L_model_forward

实现上述模型的前向传播。

**说明**: 在下面的代码中，变量 `AL` 将表示 $A^{[L]} = \sigma(Z^{[L]}) = \sigma(W^{[L]} A^{[L-1]} + b^{[L]})$。（有时也称为 `Yhat`，即 $\hat{Y}$。）

**提示**:
- 使用之前编写的函数
- 使用循环复制 [LINEAR->RELU] (L-1) 次
- 不要忘记在 "caches" 列表中跟踪缓存。要向 `list` 添加新值 `c`，可以使用 `list.append(c)`。

In [None]:
# GRADED FUNCTION: L_model_forward

def L_model_forward(X, parameters):
    """
    Implement forward propagation for the [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID computation
    
    Arguments:
    X -- data, numpy array of shape (input size, number of examples)
    parameters -- output of initialize_parameters_deep()
    
    Returns:
    AL -- last post-activation value
    caches -- list of caches containing:
                every cache of linear_activation_forward() (there are L-1 of them, indexed from 0 to L-1)
    """

    caches = []
    A = X
    L = len(parameters) // 2                  # number of layers in the neural network
    
    # Implement [LINEAR -> RELU]*(L-1). Add "cache" to the "caches" list.
    # The for loop starts at 1 because layer 0 is the input
    for l in range(1, L):
        A_prev = A 
        #(≈ 2 lines of code)
        # A, cache = ...
        # caches ...
        # YOUR CODE STARTS HERE
        A,cache= linear_activation_forward(A_prev,parameters['W'+str(l)],parameters['b'+str(l)],'relu')
        caches.append(cache)
        # YOUR CODE ENDS HERE
    
    # Implement LINEAR -> SIGMOID. Add "cache" to the "caches" list.
    #(≈ 2 lines of code)
    # AL, cache = ...
    # caches ...
    # YOUR CODE STARTS HERE
    AL,cache = linear_activation_forward(A,parameters['W'+str(L)],parameters['b'+str(L)],'sigmoid')
    caches.append(cache)
    # YOUR CODE ENDS HERE
          
    return AL, caches

In [None]:
t_X, t_parameters = L_model_forward_test_case_2hidden()
t_AL, t_caches = L_model_forward(t_X, t_parameters)

print("AL = " + str(t_AL))

L_model_forward_test(L_model_forward)

***Expected output***
```
AL = [[0.03921668 0.70498921 0.19734387 0.04728177]]
```

**Awesome!** You've implemented a full forward propagation that takes the input X and outputs a row vector $A^{[L]}$ containing your predictions. It also records all intermediate values in "caches". Using $A^{[L]}$, you can compute the cost of your predictions.

<a name='5'></a>
## 5 - Cost Function

Now you can implement forward and backward propagation! You need to compute the cost, in order to check whether your model is actually learning.

<a name='ex-6'></a>
### Exercise 6 - compute_cost
Compute the cross-entropy cost $J$, using the following formula: $$-\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right)) \tag{7}$$


In [None]:
# GRADED FUNCTION: compute_cost

def compute_cost(AL, Y):
    """
    Implement the cost function defined by equation (7).

    Arguments:
    AL -- probability vector corresponding to your label predictions, shape (1, number of examples)
    Y -- true "label" vector (for example: containing 0 if non-cat, 1 if cat), shape (1, number of examples)

    Returns:
    cost -- cross-entropy cost
    """
    
    m = Y.shape[1]

    # Compute loss from aL and y.
    # (≈ 1 lines of code)
    # cost = ...
    # YOUR CODE STARTS HERE
    cost = -1/m * np.sum(Y*np.log(AL)+(1-Y)*np.log(1-AL))
    
    # YOUR CODE ENDS HERE
    
    cost = np.squeeze(cost)      # To make sure your cost's shape is what we expect (e.g. this turns [[17]] into 17).

    
    return cost

In [None]:
t_Y, t_AL = compute_cost_test_case()
t_cost = compute_cost(t_AL, t_Y)

print("Cost: " + str(t_cost))

compute_cost_test(compute_cost)

**Expected Output**:

<table>
    <tr>
        <td><b>cost</b> </td>
    <td> 0.2797765635793422</td> 
    </tr>
</table>

<a name='6'></a>
## 6 - 反向传播模块

与正向传播一样，您将实现反向传播的辅助函数。请记住，反向传播用于计算损失函数相对于参数的梯度。

**提醒**：

<img src="images/backprop_kiank.png" style="width:650px;height:250px;">

<caption><center><font color='purple'><b>图3</b>：LINEAR->RELU->LINEAR->SIGMOID的正向和反向传播<br> <i>紫色块表示正向传播，红色块表示反向传播。</font></center></caption>

现在，与正向传播类似，您将分三个步骤构建反向传播：

1. 线性反向

2. 线性->激活反向，其中ACTIVATION计算ReLU或sigmoid激活的导数

3. [LINEAR->RELU] $\times$（L-1）-> LINEAR->SIGMOID反向（整个模型）

对于下一个练习，您需要记住：

- `b` 是具有1列和n行的矩阵(np.ndarray)，即：b = [[1.0]，[2.0]]（请记住`b`是一个常数）
- np.sum 对ndarray的元素执行求和
- axis = 1或axis = 0分别指定按行或按列进行求和
- keepdims指定是否必须保留矩阵的原始维度。
- 查看以下示例以澄清：

In [None]:
A = np.array([[1, 2], [3, 4]])

print('axis=1 and keepdims=True')
print(np.sum(A, axis=1, keepdims=True))
print('axis=1 and keepdims=False')
print(np.sum(A, axis=1, keepdims=False))
print('axis=0 and keepdims=True')
print(np.sum(A, axis=0, keepdims=True))
print('axis=0 and keepdims=False')
print(np.sum(A, axis=0, keepdims=False))

<a name='6-1'></a>
### 6.1 - Linear Backward

For layer $l$, the linear part is: $Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}$ (followed by an activation).

Suppose you have already calculated the derivative $dZ^{[l]} = \frac{\partial \mathcal{L} }{\partial Z^{[l]}}$. You want to get $(dW^{[l]}, db^{[l]}, dA^{[l-1]})$.

<img src="images/linearback_kiank.png" style="width:250px;height:300px;">
<caption><center><font color='purple'><b>Figure 4</b></font></center></caption>

The three outputs $(dW^{[l]}, db^{[l]}, dA^{[l-1]})$ are computed using the input $dZ^{[l]}$.

Here are the formulas you need:
$$ dW^{[l]} = \frac{\partial \mathcal{J} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T} \tag{8}$$
$$ db^{[l]} = \frac{\partial \mathcal{J} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l](i)}\tag{9}$$
$$ dA^{[l-1]} = \frac{\partial \mathcal{L} }{\partial A^{[l-1]}} = W^{[l] T} dZ^{[l]} \tag{10}$$


$A^{[l-1] T}$ is the transpose of $A^{[l-1]}$. 

<a name='ex-7'></a>
### Exercise 7 - linear_backward 

Use the 3 formulas above to implement `linear_backward()`.

**Hint**:

- In numpy you can get the transpose of an ndarray `A` using `A.T` or `A.transpose()`

In [None]:
# GRADED FUNCTION: linear_backward

def linear_backward(dZ, cache):
    """
    Implement the linear portion of backward propagation for a single layer (layer l)

    Arguments:
    dZ -- Gradient of the cost with respect to the linear output (of current layer l)
    cache -- tuple of values (A_prev, W, b) coming from the forward propagation in the current layer

    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
    """
    A_prev, W, b = cache
    m = A_prev.shape[1]

    ### START CODE HERE ### (≈ 3 lines of code)
    # dW = ...
    # db = ... sum by the rows of dZ with keepdims=True
    # dA_prev = ...
    # YOUR CODE STARTS HERE
    
    
    # YOUR CODE ENDS HERE
    
    return dA_prev, dW, db

In [None]:
t_dZ, t_linear_cache = linear_backward_test_case()
t_dA_prev, t_dW, t_db = linear_backward(t_dZ, t_linear_cache)

print("dA_prev: " + str(t_dA_prev))
print("dW: " + str(t_dW))
print("db: " + str(t_db))

linear_backward_test(linear_backward)

**Expected Output**:
```
dA_prev: [[-1.15171336  0.06718465 -0.3204696   2.09812712]
 [ 0.60345879 -3.72508701  5.81700741 -3.84326836]
 [-0.4319552  -1.30987417  1.72354705  0.05070578]
 [-0.38981415  0.60811244 -1.25938424  1.47191593]
 [-2.52214926  2.67882552 -0.67947465  1.48119548]]
dW: [[ 0.07313866 -0.0976715  -0.87585828  0.73763362  0.00785716]
 [ 0.85508818  0.37530413 -0.59912655  0.71278189 -0.58931808]
 [ 0.97913304 -0.24376494 -0.08839671  0.55151192 -0.10290907]]
db: [[-0.14713786]
 [-0.11313155]
 [-0.13209101]]
 ```

<a name='6-2'></a>
### 6.2 - 线性激活层的反向传播

接下来，您将创建一个函数，将两个辅助函数**`linear_backward`**和激活的反向步骤**`linear_activation_backward`**合并起来。

为了帮助您实现`linear_activation_backward`，提供了两个反向函数：
- **`sigmoid_backward`**：实现SIGMOID单元的反向传播。您可以按以下方式调用它：

```python
dZ = sigmoid_backward(dA, activation_cache)
```

- **`relu_backward`**：实现RELU单元的反向传播。您可以按以下方式调用它：

```python
dZ = relu_backward(dA, activation_cache)
```

如果$g(.)$是激活函数，则`sigmoid_backward`和`relu_backward`计算$$dZ^{[l]} = dA^{[l]} * g'(Z^{[l]}). \tag{11}$$  

<a name='ex-8'></a>
### 练习8 - 线性激活层的反向传播

实现*LINEAR-> ACTIVATION*层的反向传播。

In [None]:
# GRADED FUNCTION: linear_activation_backward

def linear_activation_backward(dA, cache, activation):
    """
    Implement the backward propagation for the LINEAR->ACTIVATION layer.
    
    Arguments:
    dA -- post-activation gradient for current layer l 
    cache -- tuple of values (linear_cache, activation_cache) we store for computing backward propagation efficiently
    activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"
    
    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
    """
    linear_cache, activation_cache = cache
    
    if activation == "relu":
        #(≈ 2 lines of code)
        # dZ =  ...
        # dA_prev, dW, db =  ...
        # YOUR CODE STARTS HERE
        dZ = relu_backward(dA,activation_cache)
        dA_prev, dW, db =linear_backward(dZ,linear_cache)
        # YOUR CODE ENDS HERE
        
    elif activation == "sigmoid":
        #(≈ 2 lines of code)
        # dZ =  ...
        # dA_prev, dW, db =  ...
        # YOUR CODE STARTS HERE
        
        dZ = sigmoid_backward(dA,activation_cache)
        dA_prev, dW, db =linear_backward(dZ,linear_cache)
        
        # YOUR CODE ENDS HERE
    
    return dA_prev, dW, db

In [None]:
t_dAL, t_linear_activation_cache = linear_activation_backward_test_case()

t_dA_prev, t_dW, t_db = linear_activation_backward(t_dAL, t_linear_activation_cache, activation = "sigmoid")
print("With sigmoid: dA_prev = " + str(t_dA_prev))
print("With sigmoid: dW = " + str(t_dW))
print("With sigmoid: db = " + str(t_db))

t_dA_prev, t_dW, t_db = linear_activation_backward(t_dAL, t_linear_activation_cache, activation = "relu")
print("With relu: dA_prev = " + str(t_dA_prev))
print("With relu: dW = " + str(t_dW))
print("With relu: db = " + str(t_db))

linear_activation_backward_test(linear_activation_backward)

**Expected output:**

```
With sigmoid: dA_prev = [[ 0.11017994  0.01105339]
 [ 0.09466817  0.00949723]
 [-0.05743092 -0.00576154]]
With sigmoid: dW = [[ 0.10266786  0.09778551 -0.01968084]]
With sigmoid: db = [[-0.05729622]]
With relu: dA_prev = [[ 0.44090989  0.        ]
 [ 0.37883606  0.        ]
 [-0.2298228   0.        ]]
With relu: dW = [[ 0.44513824  0.37371418 -0.10478989]]
With relu: db = [[-0.20837892]]
```

### 6.3 - L-Model Backward 

现在你将实现整个网络的反向传播函数！

回想一下，当你实现`L_model_forward`函数时，在每次迭代中，你都会存储一个包含（X，W，b和z）的缓存。在反向传播模块中，你将使用这些变量来计算梯度。因此，在`L_model_backward`函数中，你将反向迭代所有隐藏层，从层$L$开始。在每个步骤中，您将使用层$l$的缓存值来反向传播层$l$。下面的图5显示了向后传递。

**初始化反向传播**：

为了通过这个网络进行反向传播，你知道输出是：$A^{[L]} = \sigma(Z^{[L]})$。因此，你的代码需要计算`dAL`$= \frac{\partial \mathcal{L}}{\partial A^{[L]}}$。为此，请使用此公式（使用微积分推导得出，你不需要深入了解！）：
```python
dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL)) # derivative of cost with respect to AL
```

然后，你可以使用此后激活梯度`dAL`继续向后传播。如图5所示，现在可以将`dAL`馈送到你实现的LINEAR->SIGMOID向后函数中（它将使用L_model_forward函数存储的缓存值）。

之后，你将使用`for`循环通过所有其他层使用LINEAR->RELU向后函数进行迭代。你应该在grads字典中存储每个dA，dW和db。为此，请使用以下公式：

$$grads["dW" + str(l)] = dW^{[l]}\tag{15} $$

例如，对于$l=3$，这将把$dW^{[l]}$存储在`grads["dW3"]`中。

<a name='ex-9'></a>
### Exercise 9 -  L_model_backward

为*[LINEAR->RELU] $\times$ (L-1) -> LINEAR -> SIGMOID*模型实现反向传播。

In [None]:
# GRADED FUNCTION: L_model_backward

def L_model_backward(AL, Y, caches):
    """
    Implement the backward propagation for the [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID group
    
    Arguments:
    AL -- probability vector, output of the forward propagation (L_model_forward())
    Y -- true "label" vector (containing 0 if non-cat, 1 if cat)
    caches -- list of caches containing:
                every cache of linear_activation_forward() with "relu" (it's caches[l], for l in range(L-1) i.e l = 0...L-2)
                the cache of linear_activation_forward() with "sigmoid" (it's caches[L-1])
    
    Returns:
    grads -- A dictionary with the gradients
             grads["dA" + str(l)] = ... 
             grads["dW" + str(l)] = ...
             grads["db" + str(l)] = ... 
    """
    grads = {}
    L = len(caches) # the number of layers
    m = AL.shape[1]
    Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL
    
    # Initializing the backpropagation
    #(1 line of code)
    # dAL = ...
    # YOUR CODE STARTS HERE
    
    dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
    
    # YOUR CODE ENDS HERE
    
    # Lth layer (SIGMOID -> LINEAR) gradients. Inputs: "dAL, current_cache". Outputs: "grads["dAL-1"], grads["dWL"], grads["dbL"]
    #(approx. 5 lines)
    # current_cache = ...
    # dA_prev_temp, dW_temp, db_temp = ...
    # grads["dA" + str(L-1)] = ...
    # grads["dW" + str(L)] = ...
    # grads["db" + str(L)] = ...
    # YOUR CODE STARTS HERE
    current_cache = caches[L-1]
    dA_prev_temp, dW_temp, db_temp = linear_activation_backward(dAL,current_cache,'sigmoid')
    grads["dA" + str(L-1)] = dA_prev_temp
    grads["dW" + str(L)] = dW_temp
    grads["db" + str(L)] = db_temp
    
    # YOUR CODE ENDS HERE
    
    # Loop from l=L-2 to l=0
    for l in reversed(range(L-1)):
        # lth layer: (RELU -> LINEAR) gradients.
        # Inputs: "grads["dA" + str(l + 1)], current_cache". Outputs: "grads["dA" + str(l)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)] 
        #(approx. 5 lines)
        # current_cache = ...
        # dA_prev_temp, dW_temp, db_temp = ...
        # grads["dA" + str(l)] = ...
        # grads["dW" + str(l + 1)] = ...
        # grads["db" + str(l + 1)] = ...
        # YOUR CODE STARTS HERE
        current_cache = caches[l]
        dA_prev_temp, dW_temp, db_temp = linear_activation_backward(dA_prev_temp,current_cache,"relu")
        grads["dA" + str(l)] = dA_prev_temp
        grads["dW" + str(l + 1)] = dW_temp
        grads["db" + str(l + 1)] = db_temp
        
        
        
        # YOUR CODE ENDS HERE

    return grads

In [None]:
t_AL, t_Y_assess, t_caches = L_model_backward_test_case()
grads = L_model_backward(t_AL, t_Y_assess, t_caches)

print("dA0 = " + str(grads['dA0']))
print("dA1 = " + str(grads['dA1']))
print("dW1 = " + str(grads['dW1']))
print("dW2 = " + str(grads['dW2']))
print("db1 = " + str(grads['db1']))
print("db2 = " + str(grads['db2']))

L_model_backward_test(L_model_backward)

**Expected output:**

```
dA0 = [[ 0.          0.52257901]
 [ 0.         -0.3269206 ]
 [ 0.         -0.32070404]
 [ 0.         -0.74079187]]
dA1 = [[ 0.12913162 -0.44014127]
 [-0.14175655  0.48317296]
 [ 0.01663708 -0.05670698]]
dW1 = [[0.41010002 0.07807203 0.13798444 0.10502167]
 [0.         0.         0.         0.        ]
 [0.05283652 0.01005865 0.01777766 0.0135308 ]]
dW2 = [[-0.39202432 -0.13325855 -0.04601089]]
db1 = [[-0.22007063]
 [ 0.        ]
 [-0.02835349]]
db2 = [[0.15187861]]
```

<a name='6-4'></a>
### 6.4 - 更新参数

在本节中，您将使用梯度下降更新模型的参数：

$$ W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]} \tag{16}$$
$$ b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]} \tag{17}$$

其中 $\alpha$ 是学习率。

计算更新的参数后，请将它们存储在参数字典中。

### 练习 10 - update_parameters

实现 `update_parameters()` 函数，使用梯度下降法更新参数。

**说明**：

对于每个 $l = 1, 2, ..., L$，使用梯度下降法更新参数 $W^{[l]}$ 和 $b^{[l]}$。

In [None]:
# GRADED FUNCTION: update_parameters

def update_parameters(params, grads, learning_rate):
    """
    Update parameters using gradient descent
    
    Arguments:
    params -- python dictionary containing your parameters 
    grads -- python dictionary containing your gradients, output of L_model_backward
    
    Returns:
    parameters -- python dictionary containing your updated parameters 
                  parameters["W" + str(l)] = ... 
                  parameters["b" + str(l)] = ...
    """
    parameters = params.copy()
    L = len(parameters) // 2 # number of layers in the neural network

    # Update rule for each parameter. Use a for loop.
    #(≈ 2 lines of code)
    for l in range(L):
        # parameters["W" + str(l+1)] = ...
        # parameters["b" + str(l+1)] = ...
        # YOUR CODE STARTS HERE
        
        
        # YOUR CODE ENDS HERE
    return parameters

In [None]:
t_parameters, grads = update_parameters_test_case()
t_parameters = update_parameters(t_parameters, grads, 0.1)

print ("W1 = "+ str(t_parameters["W1"]))
print ("b1 = "+ str(t_parameters["b1"]))
print ("W2 = "+ str(t_parameters["W2"]))
print ("b2 = "+ str(t_parameters["b2"]))

update_parameters_test(update_parameters)

**Expected output:**

```
W1 = [[-0.59562069 -0.09991781 -2.14584584  1.82662008]
 [-1.76569676 -0.80627147  0.51115557 -1.18258802]
 [-1.0535704  -0.86128581  0.68284052  2.20374577]]
b1 = [[-0.04659241]
 [-1.28888275]
 [ 0.53405496]]
W2 = [[-0.55569196  0.0354055   1.32964895]]
b2 = [[-0.84610769]]
```

### Congratulations! 

You've just implemented all the functions required for building a deep neural network, including: 

- Using non-linear units improve your model
- Building a deeper neural network (with more than 1 hidden layer)
- Implementing an easy-to-use neural network class

This was indeed a long assignment, but the next part of the assignment is easier. ;) 

In the next assignment, you'll be putting all these together to build two models:

- A two-layer neural network
- An L-layer neural network

You will in fact use these models to classify cat vs non-cat images! (Meow!) Great work and see you next time. 