# 神经网络入门

算法描述

* 训练每一条数据
    * 对每一层,直到最后一层
        * 计算 $ h = \sum_i w_i x_i $
        * 计算输出 $ \hat y = f(h) $
    * 计算 Error term, $\delta = ( y - \hat y)f'(h) $
    * $ \delta = (y - \hat y)f'(h) $
    * 从后向前，对每一层
        * $ \delta_j^h = \sum{W_{jk}\delta_k^o f'(h_j)} $
        * $ \Delta w_{ij} += \delta_j^h x_i $

* 更新 $ w = w + \eta \Delta w / m $。 $\eta$ 是学习率 $m$ 是数据条数。对权重步长做了平均化，防止数据起伏
* 重复 $e$ 代 (epoch)



![](https://s3.cn-north-1.amazonaws.com.cn/u-img/07e338ce-41fa-4b2a-b1b9-5997261c3f58)

这些独立的节点被称作感知器 或者神经元。它们是构成神经网络的基本单元。

Weights（权值）

当输入给到节点，激活函数可以决定节点的输出。因为它决定了实际输出，我们也把层的输出，称作“激活”。

是单位阶跃函数（Heaviside step function）。如果线性组合小于0，函数返回0，如果线性组合等于或者大于0，函数返回1

偏置 bias 

## 逻辑操作激活函数

![](https://s3.cn-north-1.amazonaws.com.cn/u-img/0aa1e0d3-8440-41b7-b327-925472eaf72e)

通过调整w,b 可以使用 w*X + b > 0 实现 AND OR NOT 操作
* AND w1 = 1, w2 = 1, b = -2
* OR  w1 = 1, w2 = 1, b = -2.1
* NOT  w = -1，bias = 0.9

![](https://s3.cn-north-1.amazonaws.com.cn/u-img/4fb8d10d-5f1b-4557-85ad-6421d5eafafe)
* XOR 用上图构建 A NOT; B AND; C OR

## 其他激活函数

其它常见激活函数还有对数几率（又称作 sigmoid），tanh 和 softmax。这节课中我们主要使用 sigmoid 函数:

$$ \sigma(x) = \frac{1}{1+e^{-x}} $$


### 最简单的神经网络

![](https://s3.cn-north-1.amazonaws.com.cn/u-img/e429472f-a8bf-411a-87e5-6abf1223a725)

In [3]:
import numpy as np

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

inputs = np.array([0.7, -0.3])
weights = np.array([0.1, 0.8])
bias = -0.1

output = sigmoid(np.dot(inputs, weights) + bias)

output

0.43290709503454572

## 梯度下降法

[梯度下降法的优化](http://sebastianruder.com/optimizing-gradient-descent/index.html#momentum)

E 错误，使E最小化，m为样本数，$\hat y$ 为预测结果

$$\mathbf{ E = \frac{1}{2m}\sum_\mu(y^\mu-\hat y^\mu)^2 }$$


基本思想：

$$ W_i = W_i + \Delta W_i $$
$$ \Delta W_i = - \eta \frac{\partial E}{\partial W_i} $$

得出(过程详见下文):
$$ \Delta w_i = \eta (y - \hat y)f'(h)x_i $$


### 定义“误差项”（ERROR TERM）$\delta$

\begin{align}
& \LARGE{\delta = (y - \hat y)f'(h) ＝ (y - \hat y)f'(\sum_i w_i x_i)} \\
& \LARGE{w_i = w_i + \eta \delta x_i}
\end{align}

多条数据时

* 训练每一条数据
    * 计算输出 $ \hat y = f(\sum_i w_i x_i) $
    * 计算 Error term, $\delta = ( y - \hat y)f'(\sum_i w_i x_i) $
    * 计算 $\Delta w_i = \Delta w_i + \delta x_i $

* 更新 $ w_i = w_i + \eta \Delta w_i / m $。 $\eta$ 是学习率 $m$ 是数据条数。对权重步长做了平均化，防止数据起伏
* 重复 $e$ 代 (epoch)



### 推导过程：

\begin{align}
\frac{\partial E}{\partial W_i} & = \frac{\partial}{\partial w_i} \frac{1}{2} (y-\hat y)^2 \\
& = (y - \hat y)\frac{\partial}{\partial w_i}(y - \hat y) \\
& = (y - \hat y)\frac{\partial(y - \hat y)}{\partial \hat y}\frac{\partial \hat y}{\partial w_i} \\
& = -(y-\hat y)\frac{\partial \hat y}{\partial w_i}
\end{align}

已知 $ \hat y = f(h)$ 其中 $f$ 是激活函数，$h$ 是线性组合函数。$h= \sum_{i=0}w_i x_i $

\begin{align}
\frac{\partial E}{\partial w_i} & = - (y-\hat y)\frac{\partial \hat y}{\partial w_i} \\
& = - (y - \hat y)\frac{\partial \hat y}{\partial h} \frac{\partial h}{\partial w_i} \\
& = - (y - \hat y)f'(h)\frac{\partial}{\partial w_i}\sum w_i x_i
\end{align}

而：
\begin{align}
& \frac{\partial}{\partial w_i} \sum_i w_i x_i \\
= & \frac{\partial}{\partial w_1}[w_1 x_1 + w_2 x_2 + ... + w_n x_n] \\
= & x_1 + 0 + 0 ...
\end{align}

得出：
$$ \frac{\partial}{\partial w_i}\sum_i w_i x_i = x_i $$

所以：

\begin{align}
\frac{\partial E}{\partial w_i} & = -(y-\hat y)f'(h)x_i \\
\Delta w_i & = \eta(y - \hat y)f'(h)x_i
\end{align}



## sigmoid 函数的导数

$$ \sigma'(x) = \sigma(x)\cdot(1-\sigma(x)) $$

推导过程：

\begin{align}
\sigma'(x) & = \frac{d}{dx}\frac{1}{(1+e^{-x})} \\
& = \frac{d}{dx}(1+e^{-x})^{-1} \\
& = -(1+e^{-x})^{-2} (-e^{-x}) \\
& = \frac{e^{-x}}{1+e^{-x}}^2 \\
& = \frac{1}{1+e^{-x}}\cdot\frac{e^{-x}}{1+e^{-x}} \\
& = \frac{1}{1+e^{-x}}\cdot\frac{(1+e^{-x})-1}{1+e^{-x}} \\
& = \frac{1}{1+e^{-x}}\cdot(1-\frac{1}{1+e^{-x}}) \\
& = \sigma(x)\cdot(1-\sigma(x))
\end{align}


In [4]:
## y-hat = sigmoid(w.x) 梯度下降计算过程

def sigmoid(x):
    return 1/(1+np.exp(-x))

# Derivative of the sigmoid function
# 激活函数的导数
def sigmoid_prime(x):
    return sigmoid(x) * (1 - sigmoid(x))

x = np.array([0.1, 0.3])
y = 0.2
weights = np.array([-0.8, 0.5])
learnrate = 0.5

# the linear combination performed (h in f(h) and f'(h))
h = np.dot(x, weights)

# y-hat
nn_output = sigmoid(h)

# output error (y - y-hat)
error = y - nn_output

# output gradient (f'(h))
output_grad = sigmoid_prime(h)

# error term (lowercase delta)
error_term = error * output_grad

# Gradient descent step 
del_w = learnrate * error_term * x

---
## 实现梯度下降

我们拿一个研究生学院录取数据来用梯度下降训练一个网络。数据可以在[这里](http://www.ats.ucla.edu/stat/data/binary.csv)找到。数据有三个输入特征，*GRE分数，GPA，和本科院校排名 rank（从1到4）*。数字1代表最好，数字4代表最差。我们的目标是基于这些特征来预测一个学生能否被研究生院录取。

### 数据清理

#### Dummy Variable

rank 是类别特征，数字并不包含任何相对的值。使用 dummy variables 编码，变成4列，对应列为1，其他列为0.

#### 归一化 

GRE, GPA 这些数据过大，会导致训练速度下降，因为sigmoid函数的导数，在两侧梯度较低，使步长接近于0。过大或过小的数据都容易使$h$落入两侧。因此需要对数据进行归一化，此外，归一化也会使学习率的选择更加统一，基本上选择 0.001 到 0.1 之间的学习率。

标准差公式 $ \sigma = \sqrt{\frac{1}{N}\sum_i(x_i - \mu)^2} $ 其中平均值 $\mu = \frac{\sum_i(x_i)}{N} $

对于有概率事件，标准差公式 $ \sigma = \sqrt{\sum_ip_i(x_i - \mu)^2} $ 其中平均值 $\mu = \sum_ip_ix_i $



![](https://s3.cn-north-1.amazonaws.com.cn/u-img/0c580bc2-b0a9-4952-bfd0-f2ce6093efe8)

In [5]:
# 初始化
import numpy as np
import pandas as pd
# 固定随机种子，便于调试
np.random.seed(42)

In [6]:
# 数据清理代码

admissions = pd.read_csv('nn_data_1.csv')

# Make dummy variables for rank
data = pd.concat([admissions, pd.get_dummies(admissions['rank'], prefix='rank')], axis=1)
data = data.drop('rank', axis=1)

# Standarize features
for field in ['gre', 'gpa']:
    mean, std = data[field].mean(), data[field].std()
    data.loc[:,field] = (data[field]-mean)/std
    
    
# Split off random 10% of the data for testing
sample = np.random.choice(data.index, size=int(len(data)*0.9), replace=False)
data, test_data = data.ix[sample], data.drop(sample)

# Split into features and targets
features, targets = data.drop('admit', axis=1), data['admit']
features_test, targets_test = test_data.drop('admit', axis=1), test_data['admit']

In [7]:
# 梯度下降代码

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))

def sigmoid_prime(x):
    return sigmoid(x) * (1 - sigmoid(x))

n_records, n_features = features.shape
last_loss = None

# Initialize weights
weights = np.random.normal(scale=1 / n_features**.5, size=n_features)

# Neural Network hyperparameters
epochs = 1000
learnrate = 0.01

for e in range(epochs):
    del_w = np.zeros(weights.shape)
    for x, y in zip(features.values, targets):
        # Loop through all records, x is the input, y is the target

        # Note: We haven't included the h variable from the previous
        #       lesson. You can add it if you want, or you can calculate
        #       the h together with the output

        # TODO: Calculate the output
        output = np.dot(x, weights)

        # TODO: Calculate the error
        error = y - sigmoid(output)

        # TODO: Calculate the error term
        error_term = error * sigmoid_prime(output)

        # TODO: Calculate the change in weights for this sample
        #       and add it to the total weight change
        del_w += error_term * x

    # TODO: Update weights using the learning rate and the average change in weights
    weights += learnrate * del_w

    # Printing out the mean square error on the training set
    if e % (epochs / 10) == 0:
        out = sigmoid(np.dot(features, weights))
        loss = np.mean((out - targets) ** 2)
        if last_loss and last_loss < loss:
            print("Train loss: ", loss, "  WARNING - Loss Increasing")
        else:
            print("Train loss: ", loss)
        last_loss = loss


# Calculate accuracy on test data
tes_out = sigmoid(np.dot(features_test, weights))
predictions = tes_out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))

('Train loss: ', 0.24642503863155604)
('Train loss: ', 0.19713264867549785)
('Train loss: ', 0.19697021249785832)
('Train loss: ', 0.1969622865055684)
('Train loss: ', 0.1969617694823392)
('Train loss: ', 0.19696173349861354)
('Train loss: ', 0.1969617309521564)
('Train loss: ', 0.1969617307711557)
('Train loss: ', 0.19696173075827528)
('Train loss: ', 0.1969617307573584)
Prediction accuracy: 0.725


---

## 多层网络

![](https://s3.cn-north-1.amazonaws.com.cn/u-img/6f15956b-0cf5-4c07-a7b0-db1c8db26653)

$$ h_j = \sum\limits_iw_{ij}x_i $$

$$
 h_j = 
 \begin{bmatrix}
  w_{11} & w_{12} & w_{13} \\
  w_{21} & w_{22} & w_{23}
 \end{bmatrix}
 \times
 \begin{bmatrix}
  x_1 \\
  x_2 \\
  x_3
 \end{bmatrix}
$$

#### 构造一维列向量

In [8]:
# 有时需要构造一维列向量
a = np.array([1, 2, 3])
a.T # 一维行向量转置还是行向量

array([1, 2, 3])

In [9]:
a[:, None]

array([[1],
       [2],
       [3]])

In [10]:
np.array(a, ndmin=2).T

array([[1],
       [2],
       [3]])

#### 计算多重神经网络的输出

In [11]:
# Network size
N_input = 4
N_hidden = 3
N_output = 2

# Make some fake data
X = np.random.randn(4)
X

array([ 1.66376979, -0.73923643, -0.10719448, -0.48621729])

In [12]:
# 初始化权重
weights_input_to_hidden = np.random.normal(0, scale=0.1, size=(N_input, N_hidden))
weights_hidden_to_output = np.random.normal(0, scale=0.1, size=(N_hidden, N_output))

weights_input_to_hidden, weights_hidden_to_output

(array([[ 0.15929818, -0.04339416, -0.07326922],
        [-0.05782769, -0.01357174, -0.0793505 ],
        [-0.036845  ,  0.01572019, -0.01154734],
        [ 0.08317086, -0.03193551, -0.02046165]]),
 array([[ 0.02710617,  0.22448066],
        [-0.07876019,  0.19236244],
        [-0.11835732,  0.02289704]]))

In [13]:
hidden_layer_in = np.dot(X, weights_input_to_hidden)
hidden_layer_out = sigmoid(hidden_layer_in)

output_layer_in = np.dot(hidden_layer_out, weights_hidden_to_output)
output_layer_out = sigmoid(output_layer_in)

---

## 反向传播

![](assets/img/backpropagation1.png)

![](assets/img/backpropagation2.png)


### 理解反向传播算法

让我们忘记数学公式和推导过程，尝试以“物理”的方式来理解这一过程。

我们有一些 黄金、白银的购买数据，数据是黄金购买量，白银购买量，总价。我们需要用人脑一步步的猜测黄金、白银价格，因为每笔采购价格不同，所以无法解方程求解。

假设我们是一个对黄金白银价格一无所知的人。最初的时候，我们将黄金、白银设为相同的价格，进行预测，当然会产生误差。

我们发现，一笔采购中，90%是黄金，10%是白银，而误差是10000元，我们认为误差主要来自于黄金，所以我们会主要上调黄金的价格，略微上调白银价格。而另一笔采购中10%是黄金，90%是白银，误差-10000元，我们会主要下调白银价格，略微下调黄金价格。**指定维度权重调整量与指定维度输入值成正比** 

如果误差很小，只有100元我们对所有维度权重的调整也较小，如果误差很大，有10000元，我们对所有维度的权重的调整也会比较大。**所有权重调整量与误差成正比**

另外我们有一个性格因素，就是快速的调整，还是慢慢的调整，这叫做**学习率**也是成正比的。

也就是说：**维度调整量与维度输入值、误差、学习率成正比**，因此：

$$ 指定维度调整量 = 学习率 \cdot 误差项 \cdot 指定维度输入值 $$

因此**误差项是有着“物理意义”的**。

$$ \LARGE{\delta = (y - \hat y)f'(h) } $$

如果，只是单纯的线性组合，则激活函数 $ f(x) = x $ 则 $ f'(h)=1 $，则 $\delta = y - \hat y $

$f'(h)$ 可以看作是误差对激活函数的修正。

对于多层网络，**节点的误差项来自上层节点误差的加权总和**，**上层节点误差也等于本层节点误差的加权总和**

$$ \delta_j^h = \sum{W_{jk}\delta_k^o f'(h_j)} $$

$$ \Delta w_{ij} = \eta \delta_j^h x_i $$


这就是反向传播算法


In [14]:
x = np.array([0.5, 0.1, -0.2])
target = 0.6
learnrate = 0.5

weights_input_hidden = np.array([[0.5, -0.6],
                                 [0.1, -0.2],
                                 [0.1, 0.7]])

weights_hidden_output = np.array([0.1, -0.3])

## Forward pass
hidden_layer_input = np.dot(x, weights_input_hidden)
hidden_layer_output = sigmoid(hidden_layer_input)

output_layer_in = np.dot(hidden_layer_output, weights_hidden_output)
output = sigmoid(output_layer_in)

## Backwards pass
## TODO: Calculate output error
error = target - output

# TODO: Calculate error term for output layer
output_error_term = error * output * (1 - output)

# TODO: Calculate error term for hidden layer
hidden_error_term = np.dot(output_error_term, weights_hidden_output) * \
                    hidden_layer_output * (1 - hidden_layer_output)

# TODO: Calculate change in weights for hidden layer to output layer
delta_w_h_o = learnrate * output_error_term * hidden_layer_output

# TODO: Calculate change in weights for input layer to hidden layer
delta_w_i_h = learnrate * hidden_error_term * x[:, None]

print('Change in weights for hidden layer to output layer:')
print(delta_w_h_o)
print('Change in weights for input layer to hidden layer:')
print(delta_w_i_h)


Change in weights for hidden layer to output layer:
[ 0.00804047  0.00555918]
Change in weights for input layer to hidden layer:
[[  1.77005547e-04  -5.11178506e-04]
 [  3.54011093e-05  -1.02235701e-04]
 [ -7.08022187e-05   2.04471402e-04]]


In [17]:
# 实现完整神经网络

# Hyperparameters
n_hidden = 2  # number of hidden units
epochs = 900
learnrate = 0.005

n_records, n_features = features.shape
last_loss = None
# Initialize weights
weights_input_hidden = np.random.normal(scale=1 / n_features ** .5,
                                        size=(n_features, n_hidden))
weights_hidden_output = np.random.normal(scale=1 / n_features ** .5,
                                         size=n_hidden)

for e in range(epochs):
    del_w_input_hidden = np.zeros(weights_input_hidden.shape)
    del_w_hidden_output = np.zeros(weights_hidden_output.shape)
    for x, y in zip(features.values, targets):
        ## Forward pass ##
        # TODO: Calculate the output
        hidden_input = np.dot(x, weights_input_hidden)
        hidden_output = sigmoid(hidden_input)

        output = sigmoid(np.dot(hidden_output,
                                weights_hidden_output))

        ## Backward pass ##
        # TODO: Calculate the network's prediction error
        error = y - output

        # TODO: Calculate error term for the output unit
        output_error_term = error * output * (1 - output)

        ## propagate errors to hidden layer

        # TODO: Calculate the hidden layer's contribution to the error
        hidden_error = np.dot(output_error_term, weights_hidden_output)

        # TODO: Calculate the error term for the hidden layer
        hidden_error_term = hidden_error * hidden_output * (1 - hidden_output)

        # TODO: Update the change in weights
        del_w_hidden_output += output_error_term * hidden_output
        del_w_input_hidden += hidden_error_term * x[:, None]

    # TODO: Update weights
    weights_input_hidden += learnrate * del_w_input_hidden / n_records
    weights_hidden_output += learnrate * del_w_hidden_output / n_records

    # Printing out the mean square error on the training set
    if e % (epochs / 10) == 0:
        hidden_output = sigmoid(np.dot(x, weights_input_hidden))
        out = sigmoid(np.dot(hidden_output,
                             weights_hidden_output))
        loss = np.mean((out - targets) ** 2)

        if last_loss and last_loss < loss:
            print("Train loss: ", loss, "  WARNING - Loss Increasing")
        else:
            print("Train loss: ", loss)
        last_loss = loss

# Calculate accuracy on test data
hidden = sigmoid(np.dot(features_test, weights_input_hidden))
out = sigmoid(np.dot(hidden, weights_hidden_output))
predictions = out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))

('Train loss: ', 0.22916864950741755)
('Train loss: ', 0.22908132482042776)
('Train loss: ', 0.2289954496785163)
('Train loss: ', 0.2289109897070428)
('Train loss: ', 0.22882791148045062)
('Train loss: ', 0.2287461824925033)
('Train loss: ', 0.22866577112757014)
('Train loss: ', 0.2285866466329007)
('Train loss: ', 0.22850877909186046)
('Train loss: ', 0.2284321393980804)
Prediction accuracy: 0.750
