# 实验4 正则化

虽然深度学习模型具有强大的灵活性和性能，但如果训练数据集不足时，可能存在**过拟合**问题（这是一种很严重的问题）。即它在训练集上能表现出很好的性能，但在测试集上可能表现不佳，换句话说，模型的**泛化能力**不足 !

目标：在你的深度模型中使用正规化（regularization）方法，防止过拟合.

<a name='1'></a>
## 1 - 包

In [None]:
# import packages
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
import scipy.io
from lib_reg_utils import sigmoid, relu, plot_decision_boundary, initialize_parameters, load_2D_dataset, predict_dec
from lib_reg_utils import compute_cost, predict, forward_propagation, backward_propagation, update_parameters
from lib_testCases import *
from lib_public_tests import *

%matplotlib inline
plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

%load_ext autoreload
%autoreload 2

<a name='2'></a>
## 2 - 问题描述

你刚刚被法国足球公司聘为人工智能专家。他们想让你给法国队守门员推荐踢球的落点位置，使得法国球员能够尽可能抢到第一落点。

<img src="images/field_kiank.png" style="width:600px;height:350px;">
<caption><center> <u> <b>Figure 1</b> </u>: <b>Football field</b><br> The goal keeper kicks the ball in the air, the players of each team are fighting to hit the ball with their head </center></caption>

他们给出了法国过去10场比赛的2D数据集。

<a name='3'></a>
## 3 - 加载数据

In [None]:
train_X, train_Y, test_X, test_Y = load_2D_dataset()

法国门将从左向右发出球，每个点对应的是足球场上的一个位置：
- 如果该点是蓝色，意味着法国球员能够得到球
- 如果该点是红色，意味着对方球员能够得到球

**你的目标**: 使用深度学习模型找出门将应该将球发至场上的位置。

**数据集分析**: 该数据集虽然包含一定噪声, 但看起来可以用一条对角线将蓝色区域和红色区域进行分隔。

你将首先尝试使用一个没有正规化的模型，然后你将学习如何施加正规化，并确定哪个模型更适合解决该问题。

<a name='4'></a>
## 4 - 无正规化模型

你将使用下面的神经网络（相关代码已实现），其中：
- 当`lambd`的值非0时，模型将进行*正规化*（注：由于"`lambda`"是python的保留关键字，因此这里使用"`lambd`"来替代）
- 当`keep_prob`的值小于1时，模型将进行*dropout*

你将首先尝试不适用正规化，然后你将实现：
- *L2 正规化* -- 涉及函数: "`compute_cost_with_regularization()`" 和 "`backward_propagation_with_regularization()`"
- *Dropout* -- 涉及函数: "`forward_propagation_with_dropout()`" 和 "`backward_propagation_with_dropout()`"

在每个部分，你使用恰当的输入来运行模型，仔细阅读下面的代码。

In [None]:
def model(X, Y, learning_rate = 0.3, num_iterations = 30000, print_cost = True, lambd = 0, keep_prob = 1):
    """
    Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID.
    
    Arguments:
    X -- input data, of shape (input size, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (output size, number of examples)
    learning_rate -- learning rate of the optimization
    num_iterations -- number of iterations of the optimization loop
    print_cost -- If True, print the cost every 10000 iterations
    lambd -- regularization hyperparameter, scalar
    keep_prob - probability of keeping a neuron active during drop-out, scalar.
    
    Returns:
    parameters -- parameters learned by the model. They can then be used to predict.
    """
        
    grads = {}
    costs = []                            # to keep track of the cost
    m = X.shape[1]                        # number of examples
    layers_dims = [X.shape[0], 20, 3, 1]
    
    # Initialize parameters dictionary.
    parameters = initialize_parameters(layers_dims)

    # Loop (gradient descent)

    for i in range(0, num_iterations):

        # Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.
        if keep_prob == 1:
            a3, cache = forward_propagation(X, parameters)
        elif keep_prob < 1:
            a3, cache = forward_propagation_with_dropout(X, parameters, keep_prob)
        
        # Cost function
        if lambd == 0:
            cost = compute_cost(a3, Y)
        else:
            cost = compute_cost_with_regularization(a3, Y, parameters, lambd)
            
        # Backward propagation.
        assert (lambd == 0 or keep_prob == 1)   # it is possible to use both L2 regularization and dropout, 
                                                # but this assignment will only explore one at a time
        if lambd == 0 and keep_prob == 1:
            grads = backward_propagation(X, Y, cache)
        elif lambd != 0:
            grads = backward_propagation_with_regularization(X, Y, cache, lambd)
        elif keep_prob < 1:
            grads = backward_propagation_with_dropout(X, Y, cache, keep_prob)
        
        # Update parameters.
        parameters = update_parameters(parameters, grads, learning_rate)
        
        # Print the loss every 10000 iterations
        if print_cost and i % 10000 == 0:
            print("Cost after iteration {}: {}".format(i, cost))
        if print_cost and i % 1000 == 0:
            costs.append(cost)
    
    # plot the cost
    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('iterations (x1,000)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()
    
    return parameters

默认参数情况下将不进行正规化和Dropout操作，让我们训练模型并观察在数据集上的准确率。

In [None]:
parameters = model(train_X, train_Y)
print ("On the training set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

训练集和测试集上的准确率分别为94.8%和91.5%.
这是我们的基准模型（**baseline model**），运行下列代码来绘制问题的决策边界。

In [None]:
plt.title("Model without regularization")
axes = plt.gca()
axes.set_xlim([-0.75,0.40])
axes.set_ylim([-0.75,0.65])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

该模型显然在训练集上产生了过拟合. 它拟合了一些干扰点! 让我们接下来看看2种降低过拟合的方法。

<a name='5'></a>
## 5 - L2 正规化

避免过拟合的标准方法被称为 **L2 正规化**. 即将原始的代价函数:
$$J = -\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small  y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} \tag{1}$$
修改为:
$$J_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} }_\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W_{k,j}^{[l]2} }_\text{L2 regularization cost} \tag{2}$$

让我们修改你的代价函数并观察修改后的结果。

<a name='ex-1'></a>
### 练习 1 - compute_cost_with_regularization
实现 `compute_cost_with_regularization()` 用来计算公式（2）中描述的代价函数。
*提示*：可以使用下面的python代码来计算 $\sum\limits_k\sum\limits_j W_{k,j}^{[l]2}$。
```python
np.sum(np.square(Wl))
```

In [None]:
"""
Function：
    Implement the cost function with L2 regularization. See formula (2) above.
Arguments:
    A3 -- post-activation, output of forward propagation, of shape (output size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    parameters -- python dictionary containing parameters of the model
Returns:
    cost - value of the regularized loss function (formula (2))
"""
def compute_cost_with_regularization(A3, Y, parameters, lambd):
    m = Y.shape[1]
    W1 = parameters["W1"]
    W2 = parameters["W2"]
    W3 = parameters["W3"]
    cross_entropy_cost = compute_cost(A3, Y) # This gives you the cross-entropy part of the cost
    
    #(≈ 1 lines of code)
    # L2_regularization_cost = 
    # YOUR CODE STARTS HERE

    # YOUR CODE ENDS HERE
    
    cost = cross_entropy_cost + L2_regularization_cost
    
    return cost

In [None]:
A3, t_Y, parameters = compute_cost_with_regularization_test_case()
cost = compute_cost_with_regularization(A3, t_Y, parameters, lambd=0.1)
print("cost = " + str(cost))

compute_cost_with_regularization_test(compute_cost_with_regularization)

由于你修改了代价函数，也需要同步修改反向传播，即修改各个偏导计算方法。

<a name='ex-2'></a>
### 练习 2 - backward_propagation_with_regularization
修改 dW1, dW2 以及 dW3的计算方法，添加正规项梯度 ($\frac{d}{dW} ( \frac{1}{2}\frac{\lambda}{m}  W^2) = \frac{\lambda}{m} W$).

In [None]:
"""
Function：
    Implements the backward propagation of our baseline model to which we added an L2 regularization.
Arguments:
    X -- input dataset, of shape (input size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation()
    lambd -- regularization hyperparameter, scalar
Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
"""
def backward_propagation_with_regularization(X, Y, cache, lambd):
    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
    dZ3 = A3 - Y
    #(≈ 1 lines of code)
    # dW3 = 1./m * np.dot(dZ3, A2.T) + ？
    # YOUR CODE STARTS HERE

    # YOUR CODE ENDS HERE
    db3 = 1. / m * np.sum(dZ3, axis=1, keepdims=True)
    
    dA2 = np.dot(W3.T, dZ3)
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    #(≈ 1 lines of code)
    # dW2 = 1./m * np.dot(dZ2, A1.T) + ?
    # YOUR CODE STARTS HERE

    # YOUR CODE ENDS HERE
    db2 = 1. / m * np.sum(dZ2, axis=1, keepdims=True)
    
    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    #(≈ 1 lines of code)
    # dW1 = 1./m * np.dot(dZ1, X.T) + ?
    # YOUR CODE STARTS HERE

    # YOUR CODE ENDS HERE
    db1 = 1. / m * np.sum(dZ1, axis=1, keepdims=True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

In [None]:
t_X, t_Y, cache = backward_propagation_with_regularization_test_case()

grads = backward_propagation_with_regularization(t_X, t_Y, cache, lambd = 0.7)
print ("dW1 = \n"+ str(grads["dW1"]))
print ("dW2 = \n"+ str(grads["dW2"]))
print ("dW3 = \n"+ str(grads["dW3"]))
backward_propagation_with_regularization_test(backward_propagation_with_regularization)

运行使用 L2 正规化的模型 $(\lambda = 0.7)$. 函数 `model()` 将调用:
- `compute_cost_with_regularization` 替代 `compute_cost`
- `backward_propagation_with_regularization` 替代 `backward_propagation`

In [None]:
parameters = model(train_X, train_Y, lambd = 0.7)
print ("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

恭喜你, 测试集的准确率上升至93%. 让我们再次绘制决策边界

In [None]:
plt.title("Model with L2-regularization")
axes = plt.gca()
axes.set_xlim([-0.75,0.40])
axes.set_ylim([-0.75,0.65])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

**观察**:
- $\lambda$ 是一个超参数，可以根据验证集来进行调试
- L2 正规化使得你的决策边界变得平滑。如果 $\lambda$ 过大, 可能很过分平滑 "oversmooth", 使得模型具有较高的偏差(bias).

**L2正规化做了什么？**:

L2-正规化依据下面的假设：一个模型的权重越小则越简单. 因此，通过在代价函数中加入权重的平方作为惩罚，可以使得权重值变小，模型更平滑（当输入改变时，输出改变得更慢）

<br>
<font color='blue'>
    
**What you should remember:** the implications of L2-regularization on:
- The cost computation:
    - A regularization term is added to the cost.
- The backpropagation function:
    - There are extra terms in the gradients with respect to weight matrices.
- Weights end up smaller ("weight decay"): 
    - Weights are pushed to smaller values.

<a name='6'></a>
## 6 - Dropout

最后，**dropout** 是一种深度学习中被广泛应用的正规化技术
**它在每次迭代中随机失活一些神经元.**

<!--
To understand drop-out, consider this conversation with a friend:
- Friend: "Why do you need all these neurons to train your network and classify images?". 
- You: "Because each neuron contains a weight and can learn specific features/details/shape of an image. The more neurons I have, the more featurse my model learns!"
- Friend: "I see, but are you sure that your neurons are learning different features and not all the same features?"
- You: "Good point... Neurons in the same layer actually don't talk to each other. It should be definitly possible that they learn the same image features/shapes/forms/details... which would be redundant. There should be a solution."
!--> 


<center>
<video width="620" height="440" src="images/dropout1_kiank.mp4" type="video/mp4" controls>
</video>
</center>
<br>
<caption><center> <u> <b>Figure 2 </b></u>: <b>Drop-out on the second hidden layer.</b> <br> At each iteration, you shut down (= set to zero) each neuron of a layer with probability $1 - keep\_prob$ or keep it with probability $keep\_prob$ (50% here). The dropped neurons don't contribute to the training in both the forward and backward propagations of the iteration. </center></caption>

<center>
<video width="620" height="440" src="images/dropout2_kiank.mp4" type="video/mp4" controls>
</video>
</center>

<caption><center> <u> <b>Figure 3</b> </u>:<b> Drop-out on the first and third hidden layers. </b><br> $1^{st}$ layer: we shut down on average 40% of the neurons.  $3^{rd}$ layer: we shut down on average 20% of the neurons. </center></caption>

当你关闭一些神经元时，你实际上改变了模型。Dropout的核心思想是：在每次迭代中你训练一个不同的模型，它仅使用部分的神经元。
通过dropout，你的神经元对于某些特定的神经元变得不那么敏感，因为其他神经元可能随时失活。

<a name='6-1'></a>
### 6.1 - 具有Dropout的前向传播

<a name='ex-3'></a>
#### 练习 3 - forward_propagation_with_dropout

在前向传播中实现dropout。在3层神经网络中，将在第1、2层上进行dropout，不改变输入输出层。

**说明**:
按照下面的4个步骤实现上述目标：
1. 创建一个变量 $d^{[1]}$ 与$a^{[1]}$具有相同的形状尺寸，使用 `np.random.rand()` 来得到[0,1]随机数。这里你使用一个向量化的实现, 因此将创建一个随机矩阵 $D^{[1]} = [d^{[1](1)} d^{[1](2)} ... d^{[1](m)}] $ 与 $A^{[1]}$ 具有相同的维度
2. $D^{[1]}$ 中的元素按照一定的概率(`keep_prob`)取值1, 否则取0

**提示:**
当 keep_prob = 0.8, 意味着希望保留 80% 的神经元，其他20%将失活，那么$D^{[1]}$ 中有80%的元素为1，20%的元素为0
Python代码
`X = (X < keep_prob).astype(int)`
与下面的if-else代码具有相同的效果(for the simple case of a one-dimensional array) :
```
for i,v in enumerate(x):
    if v < keep_prob:
        x[i] = 1
    else: # v >= keep_prob
        x[i] = 0
```
注意：
- `X = (X < keep_prob).astype(int)` 可用于多维数组, 结果将保持与输入一直的维度。
- 这里的`.astype(int)`起到强制类型转换的作用，即将原来的布尔值结果显式地转换为整形结果，尽管Python具有自动转换功能（如将布尔值乘以某数值，将自动把True转换为1，False转换为0）。


3. 将 $A^{[1]}$ 修改为 $A^{[1]} * D^{[1]}$. 可以将 $D^{[1]}$ 想象为一个 mask, 即保留1对应的元素，丢地0对应的元素
4. 将 $A^{[1]}$ 除以 `keep_prob`。该技术被称为inverted dropout。

In [None]:
"""
Function:
    Implements the forward propagation: LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID.
Arguments:
    X -- input dataset, of shape (2, number of examples)
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
        W1 -- weight matrix of shape (20, 2)
        b1 -- bias vector of shape (20, 1)
        W2 -- weight matrix of shape (3, 20)
        b2 -- bias vector of shape (3, 1)
        W3 -- weight matrix of shape (1, 3)
        b3 -- bias vector of shape (1, 1)
    keep_prob - probability of keeping a neuron active during drop-out, scalar
Returns:
    A3 -- last activation value, output of the forward propagation, of shape (1,1)
    cache -- tuple, information stored for computing the backward propagation
"""
def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5):
    np.random.seed(1)
    # retrieve parameters
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]
    
    # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    #(≈ 4 lines of code)         # Steps 1-4 below correspond to the Steps 1-4 described above. 
    # D1 =                       # Step 1: initialize matrix D1 = np.random.rand(..., ...)
    # D1 =                       # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
    # A1 =                       # Step 3: shut down some neurons of A1
    # A1 =                       # Step 4: scale the value of neurons that haven't been shut down
    # YOUR CODE STARTS HERE

    # YOUR CODE ENDS HERE
    Z2 = np.dot(W2, A1) + b2
    A2 = relu(Z2)
    #(≈ 4 lines of code)
    # D2 =                       # Step 1: initialize matrix D2 = np.random.rand(..., ...)
    # D2 =                       # Step 2: convert entries of D2 to 0 or 1 (using keep_prob as the threshold)
    # A2 =                       # Step 3: shut down some neurons of A2
    # A2 =                       # Step 4: scale the value of neurons that haven't been shut down
    # YOUR CODE STARTS HERE

    # YOUR CODE ENDS HERE
    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)
    
    cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)
    
    return A3, cache

In [None]:
t_X, parameters = forward_propagation_with_dropout_test_case()

A3, cache = forward_propagation_with_dropout(t_X, parameters, keep_prob=0.7)
print ("A3 = " + str(A3))

forward_propagation_with_dropout_test(forward_propagation_with_dropout)

<a name='6-2'></a>
### 6.2 - 具有Dropout的反向传播

<a name='ex-4'></a>
#### 练习 4 - backward_propagation_with_dropout
在反向传播中实现dropout，将用到cache中保存的masks $D^{[1]}$ 和 $D^{[2]}$。

**说明**:
按照下面的2个步骤实现上述目标：
1. 在前向传播中通过mask $D^{[1]}$ 作用于 `A1`实现失活，在反向传播中需要进行同样的操作。
2. 在前向传播中将 `A1` 除以 `keep_prob`，在反向传播中需要进行同样的操作(解释：如果 $A^{[1]}$ 被放缩了`keep_prob`, 则它的倒数$dA^{[1]}$将被放缩相同的倍数).

In [None]:
"""
Function：
    Implements the backward propagation of our baseline model to which we added dropout.
Arguments:
    X -- input dataset, of shape (2, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation_with_dropout()
    keep_prob - probability of keeping a neuron active during drop-out, scalar
Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
"""
def backward_propagation_with_dropout(X, Y, cache, keep_prob):
    m = X.shape[1]
    (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache
    dZ3 = A3 - Y
    dW3 = 1./m * np.dot(dZ3, A2.T)
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims=True)
    dA2 = np.dot(W3.T, dZ3)
    #(≈ 2 lines of code)
    # dA2 =                # Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation
    # dA2 =                # Step 2: Scale the value of neurons that haven't been shut down
    # YOUR CODE STARTS HERE

    # YOUR CODE ENDS HERE
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1./m * np.dot(dZ2, A1.T)
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims=True)
    
    dA1 = np.dot(W2.T, dZ2)
    #(≈ 2 lines of code)
    # dA1 =                # Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation
    # dA1 =                # Step 2: Scale the value of neurons that haven't been shut down
    # YOUR CODE STARTS HERE

    # YOUR CODE ENDS HERE
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1./m * np.dot(dZ1, X.T)
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims=True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

In [None]:
t_X, t_Y, cache = backward_propagation_with_dropout_test_case()

gradients = backward_propagation_with_dropout(t_X, t_Y, cache, keep_prob=0.8)

print ("dA1 = \n" + str(gradients["dA1"]))
print ("dA2 = \n" + str(gradients["dA2"]))

backward_propagation_with_dropout_test(backward_propagation_with_dropout)

让我们运行具有 dropout 功能的模型 (`keep_prob = 0.86`).

In [None]:
parameters = model(train_X, train_Y, keep_prob = 0.86, learning_rate = 0.3)

print ("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

可见Dropout的功能很好，测试集的准确率进一步提升了 (达到95%)!
运行下面的代码绘制决策边界。

In [None]:
plt.title("Model with dropout")
axes = plt.gca()
axes.set_xlim([-0.75,0.40])
axes.set_ylim([-0.75,0.65])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

**注意**:
- 一个常见的**错误方式**是在训练和测试时均使用dropout，正确的方式是仅在训练时进行dropout.
- 深度学习框架例如 [tensorflow](https://www.tensorflow.org/api_docs/python/tf/nn/dropout), [PaddlePaddle](http://doc.paddlepaddle.org/release_doc/0.9.0/doc/ui/api/trainer_config_helpers/attrs.html), [keras](https://keras.io/layers/core/#dropout) 或者 [caffe](http://caffe.berkeleyvision.org/tutorial/layers/dropout.html) 均实现了 dropout 层.

<font color='blue'>
    
**What you should remember about dropout:**
- Dropout is a regularization technique.
- You only use dropout during training. Don't use dropout (randomly eliminate nodes) during test time.
- Apply dropout both during forward and backward propagation.
- During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.  

<a name='7'></a>
## 7 - 结论

**三个模型结果的统计表**:

<table> 
    <tr>
        <td>
        <b>model</b>
        </td>
        <td>
        <b>train accuracy</b>
        </td>
        <td>
        <b>test accuracy</b>
        </td>
    </tr>
        <td>
        3-layer NN without regularization
        </td>
        <td>
        95%
        </td>
        <td>
        91.5%
        </td>
    <tr>
        <td>
        3-layer NN with L2-regularization
        </td>
        <td>
        94%
        </td>
        <td>
        93%
        </td>
    </tr>
    <tr>
        <td>
        3-layer NN with dropout
        </td>
        <td>
        93%
        </td>
        <td>
        95%
        </td>
    </tr>
</table> 

Note that regularization hurts training set performance! This is because it limits the ability of the network to overfit to the training set. But since it ultimately gives better test accuracy, it is helping your system. 

Congratulations for finishing this assignment! And also for revolutionizing French football. :-) 

<font color='blue'>
    
**What we want you to remember from this notebook**:
- Regularization will help you reduce overfitting.
- Regularization will drive your weights to lower values.
- L2 regularization and Dropout are two very effective regularization techniques.