In [None]:
%matplotlib inline

# 2.3 Introduction to Neural Networks
The most widely used definition is Kohonen’s description in 1988. A neural network is a widely parallel interconnected network composed of adaptable simple units. Its organization can simulate the interaction of the biological nervous system with real-world objects. reaction.

## Overview
In a biological neural network, each neuron is connected to other neurons. When it is excited, it will send chemicals to the connected neurons, thereby changing the potential in these neurons; if the potential of a neuron exceeds one Threshold, then it will activate, that is, get excited and send chemicals to other neurons.

In deep learning, this structure is also used for reference. Each neuron (the simple unit mentioned above) accepts input x and transmits it through a connection with weight w. The total input signal is compared with the neuron threshold, and finally passed The activation function process determines whether to activate, and outputs the calculated result y after activation, and what we call training is the weight w inside.

[Reference](http://www.dkriesel.com/en/science/neural_networks)

The structure of each neuron is as follows:
![](6.png)

[Source](https://becominghuman.ai/from-perceptron-to-deep-neural-nets-504b8ff616e)





## Representation of Neural Network
We can splice neurons together, and two layers of neurons, namely input layer + output layer (M-P neurons), constitute a perceptron.
And multiple layers of functional neurons are connected to form a neural network, and all layers of neurons between the input layer and the output layer are called hidden layers:
![](7.png)
As shown in the figure above, there is only one input layer and output layer, and the hidden layer in the middle can have many layers (the output layer can also be multiple, such as the classic GoogleNet, which will be described in detail later)

## Activation function
When introducing neural networks, it was said that neurons will stimulate chemical substances. When it reaches a certain level, neurons will be excited and send information to other neurons. The activation function in the neural network is used to judge whether the information we calculate meets the conditions for later transmission.

### Why are activation functions non-linear
In the calculation process of the neural network, each layer is equivalent to matrix multiplication. No matter how many layers of the neural network, the output is a linear combination of inputs. Even if we have thousands of layers of calculation, it is nothing more than a matrix multiplication and one layer The information obtained by matrix multiplication is not very different, so an activation function is needed to introduce nonlinear factors, so that the neural network can approach any nonlinear function arbitrarily, so that the neural network can be applied to many nonlinear models, adding a neural network The characteristics of model generalization.

Early research on neural networks mainly used sigmoid function or tanh function, the output is bounded, and it is easy to serve as the input of the next layer.
In recent years, Relu functions and their improved versions (such as Leaky-ReLU, P-ReLU, R-ReLU, etc.) have been widely used in multilayer neural networks due to their simple calculations and good effects.

Here is a summary of the more common activation functions:

In [None]:
# Initialize some information
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np
x = torch.linspace(-10,10,60)

### sigmod function
$a=\frac{1}{1+e^{-z}}$ Derivative: $a^\prime =a(1-a)$

In the sigmod function, we can see that its output is in the open interval (0,1), which can transform the continuous real value of the input into an output between 0 and 1. If it is a very large negative number, then the output is 0; If it is a very large positive number, the output is 1, which has a suppressive effect.

In [None]:
ax = plt.gca()
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.spines['bottom'].set_position(('data', 0))
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data', 0))
plt.ylim((0, 1))
sigmod=torch.sigmoid(x)
plt.plot(x.numpy(),sigmod.numpy())

However, because sigmod needs to perform exponential calculations (this is slower for computers, compared to relu), and the function output is not centered at 0 (this will reduce the weight update efficiency), when the input is slightly far away from the coordinate origin , The gradient of the function becomes very small (almost zero). In the process of neural network back propagation, it is not conducive to the optimization of weights. This problem is called gradient saturation, or gradient dispersion. These deficiencies, so sigmod is rarely used now, basically only the output layer when doing binary classification (0, 1).

### tanh function
$a=\frac{e^z-e^{-z}}{e^z+e^{-z}}$ Derivative: $a^\prime =1-a^2$

tanh is a hyperbolic tangent function, the output interval is between (-1,1), and the entire function is centered at 0

In [None]:
ax = plt.gca()
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.spines['bottom'].set_position(('data', 0))
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data', 0))
plt.ylim((-1, 1))
tanh=torch.tanh(x)
plt.plot(x.numpy(),tanh.numpy())

Similar to the sigmoid function, when the input is slightly far away from the origin of the coordinates, the gradient will still be small, but fortunately tanh is 0 as the center point. If tanh is used as the activation function, it can also be normalized (mean value is 0) effect.

In general two classification problems, the hidden layer uses the tanh function, and the output layer uses the sigmod function, but with the advent of Relu, all hidden layers basically use relu as the activation function.

### ReLU function
Relu (Rectified Linear Units) corrected linear unit

$a=max(0,z)$ The derivative is 1 when it is greater than 0, and 0 when it is less than 0.

In other words:
When z>0, the gradient is always 1, which improves the calculation speed of the neural network based on the gradient algorithm. However when
When z<0, the gradient is always 0.
The ReLU function only has a linear relationship (you only need to determine whether the input is greater than 0) whether it is forward propagation or backward propagation, it is much faster than sigmod and tanh.

In [None]:
ax = plt.gca()
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.spines['bottom'].set_position(('data', 0))
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data', 0))
plt.ylim((-3, 10))
relu=F.relu(x)
plt.plot(x.numpy(),relu.numpy())

When the input is a negative number, ReLU is not activated at all, which means that once a negative number is input, ReLU will die. But in the process of backpropagation, if a negative number is input, the gradient will be completely 0. This has the same problem as the sigmod function and tanh function. But in actual application, the influence of this defect is not very big.

### Leaky Relu function
In order to solve the problem when the relu function z<0, the Leaky ReLU function appears, which guarantees that the gradient is still not 0 when z<0.
The first half of ReLU is set to αz instead of 0, usually α=0.01 $ a=max(\alpha z,z)$

In [None]:
ax = plt.gca()
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.spines['bottom'].set_position(('data', 0))
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data', 0))
plt.ylim((-3, 10))
l_relu=F.leaky_relu(x,0.1) # 0.1 here is for convenience of display, theoretically it should be 0.01 or even smaller value
plt.plot(x.numpy(),l_relu.numpy())

In theory, Leaky ReLU has all the advantages of ReLU, but in actual operation, it has not been fully proved that Leaky ReLU is always better than ReLU.

ReLU is still the most commonly used activation function. It is recommended to try first in the hidden layer!

## Deep understanding of forward propagation and back propagation
At the end, we will talk about the forward propagation and back propagation in the neural network in detail. Here we continue to use the blackboard written by Wu Enda
![](8.png)
### Forward communication
For a neural network, the input feature $a^{[0]}$ is our input $x$, put it into the first layer and calculate the activation function of the first layer, using $a^{[ 1]}$ means that the result of training in this layer is represented by $W^{[1]}$ and $b^{[l]}$, these two values ​​and the calculated result $z^{[1 ]}$ values ​​need to be cached, and the calculated results need to be activated through the activation function to generate $a^{[1]}$, that is, the output value of the first layer, this value will be passed as the input of the second layer To the second level, in the second level, you need to use $W^{[2]}$ and $b^{[2]}$, the calculation result is $z^{[2]}$, the second level Activate the function $a^{[2]}$.
The next few layers and so on, until finally calculated $a^{[L]}$, the final output value of the $L$ layer $\hat{y}$, which is the predicted value of our network. Forward propagation is actually the process in which our input $x$ gets $\hat{y}$ through a series of network calculations.

The value we cache in this process will be used in the back propagation.


### Backpropagation
For the back propagation step, it is a series of reverse iterations of the forward propagation, through the backward calculation of the gradient, to optimize the $W$ and $b$ we need to train.
Differentiate the value of ${\delta}a^{[l]}$ to get ${\delta}a^{[l-1]}$, and so on, until we get ${\delta}a^{ [2]}$ and ${\delta}a^{[1]}$. In the backpropagation step, ${\delta}W^{[l]}$ and ${\delta}b^{[l]}$ are also output. In this step, we have obtained the weight change. Next, we need to update the training $W$ and $b$ through the learning rate.

$W=W-\alpha{\delta}W $

$b=b-\alpha{\delta}b $

So backpropagation is complete