##### 丢弃法的概念
对输入层或隐含层做如下操作：

- 随机选择一部分改层的输出作为丢弃元素
- 把丢弃元素乘以0
- 把非丢弃元素拉伸

In [1]:
# 实现
from mxnet import nd

def dropout(X, drop_probability): #drop_probability 定义元素被丢弃的概率
    keep_probability = 1 - drop_probability
    assert 0 <= keep_probability <= 1
    #当概率为1时，将所有元素都丢弃
    if keep_probability == 0:
        return X.zeros_like()
    
    mask = nd.random.uniform(0, 1.0, X.shape, ctx=X.context) < keep_probability
    #print('mask:', mask)
    scale = 1 / keep_probability
    #print('scale:', scale)
    return mask * X * scale

In [2]:
A = nd.arange(20).reshape((5, 4))
dropout(A, 0.0)


[[  0.   1.   2.   3.]
 [  4.   5.   6.   7.]
 [  8.   9.  10.  11.]
 [ 12.  13.  14.  15.]
 [ 16.  17.  18.  19.]]
<NDArray 5x4 @cpu(0)>

In [3]:
dropout(A, 0.5)


[[  0.   0.   0.   6.]
 [  0.  10.   0.   0.]
 [ 16.  18.  20.   0.]
 [ 24.  26.   0.   0.]
 [  0.  34.   0.   0.]]
<NDArray 5x4 @cpu(0)>

In [4]:
dropout(A, 1.0)


[[ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]]
<NDArray 5x4 @cpu(0)>

#### 丢弃法的本质
集成学习：对训练数据集有放回地采样若干次并分别训练若干个不同的分类器，测试时，把这些分类器的结果继承一下作为最终分类结果

而丢弃法在模拟集成学习。

![](http://zh.gluon.ai/_images/dropout.png)

丢弃法实质上是对每一个这样的数据集分别训练一个原神经网络子集的分类器。与一般的集成学习不同，这里每个原神经网络子集的分类器用的是同一套参数。因此丢弃法只是在模拟集成学习。

In [5]:
# 数据获取
import sys
sys.path.append('..')
import utils

batch_size = 256
train_data, test_data = utils.load_data_fashion_mnist(batch_size)

In [6]:
#定义网络各项参数

num_inputs = 28 * 28 #输入28 * 28
num_outputs = 10 #输出10分类

num_hidden1 = 256 #隐含层1
num_hidden2 = 256 #隐含层2
weight_scale = .01 

W1 = nd.random_normal(shape=(num_inputs, num_hidden1), scale=weight_scale)
b1 = nd.zeros(num_hidden1)

W2 = nd.random_normal(shape=(num_hidden1, num_hidden2), scale=weight_scale)
b2 = nd.zeros(num_hidden2)

W3 = nd.random_normal(shape=(num_hidden2, num_outputs), scale=weight_scale)
b3 = nd.zeros(num_outputs)

params = [W1, b1, W2, b2, W3, b3]

for param in params:
    param.attach_grad()

我们的模型就是将 层（全连接）和激活函数（Relu）串起来，并在应用激活函数后添加丢弃层。每个丢弃层的元素丢弃概率可以分别设置。一般情况下，我们推荐把更靠近输入层的元素的元素丢弃概率设置的更小一点。

In [7]:
# 定义网络 
drop_prob1 = 0.2
drop_prob2 = 0.5

def net(X):
    X = X.reshape((-1, num_inputs))
    #第一层全连接
    h1 = nd.relu(nd.dot(X, W1) + b1)
    #在第一层全连接后添加丢弃层
    h1 = dropout(h1, drop_prob1)
    #第二层全连接
    h2 = nd.relu(nd.dot(h1, W2) + b2)
    #在第二层全连接后添加丢弃层
    h2 = dropout(h2, drop_prob2)
    return nd.dot(h2, W3) + b3

In [8]:
# 训练
from mxnet import autograd
from mxnet import gluon

softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()

learning_rate = .5

for epoch in range(5):
    train_loss = 0.
    train_acc = 0.
    for data, label in train_data:
        with autograd.record():
            output = net(data)
            loss = softmax_cross_entropy(output, label)
        loss.backward()
        utils.SGD(params, learning_rate / batch_size)
        
        train_loss += nd.mean(loss).asscalar()
        train_acc += utils.accuracy(output, label)
    test_acc = utils.evaluate_accuracy(test_data, net)
    print('epoch %d. loss: %f, train acc %f, test acc %f' %(epoch, train_loss / len(train_data), 
                                                            train_acc / len(train_data), test_acc))

epoch 0. loss: 1.145889, train acc 0.552317, test acc 0.722656
epoch 1. loss: 0.592222, train acc 0.779197, test acc 0.803586
epoch 2. loss: 0.501117, train acc 0.814203, test acc 0.804788
epoch 3. loss: 0.457477, train acc 0.834168, test acc 0.830529
epoch 4. loss: 0.422016, train acc 0.845186, test acc 0.838341
