# What is a neural network

Neural network is a collection of neurons that are connected by layers. Each neuron is a small
computing unit that performs simple calculations to collectively solve a problem. They are 
organized in layers. There are 3 types of layers: input layer, hidden layer and 
outter layer.  Each layer contains a number of neurons, except for the input layer. Neural networks mimic the way a human brain processes information.

神经网络是一个由神经元组成的集合，这些神经元通过层层连接。每个神经元都是一个小型计算单元，执行简单的计算以共同解决一个问题。它们被组织成层。有3种类型的层：输入层、隐藏层和 外层。 每一层都包含一些神经元，除了输入层。神经网络模仿了人脑处理信息的方式。

<img alt="Diagram showing different types of layers in a neural network" src="images/4-model-1.png" />

## Components of a neural network

- **Activation function** determines whether a neuron should be activated or not. The computations that happen in a neural network include applying an activation function. If a neuron activates, then it means the input is important.  The are different kinds of activation functions. The choice of which activation function to use depends on what you want the output to be. Another important role of an activation function is to add non-linearity to the model.
    - _Binary_ used to set an output node to 1 if function result is positive and 0 if the function result is negative. $f(x)= {\small \begin{cases} 0, & \text{if } x < 0\\ 1, & \text{if } x\geq 0\\ \end{cases}}$
    - _Sigmod_ is used to predict the probability of an output node being between 0 and 1.  $f(x) = {\large \frac{1}{1+e^{-x}}} $
    - _Tanh_ is used to predict if an output node is between 1 and -1.  Used in classification use cases.  $f(x) = {\large \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}} $
    - _ReLU_ used to set the output node to 0 if fuction result is negative and keeps the result value if the result is a positive value.  $f(x)= {\small \begin{cases} 0, & \text{if } x < 0\\ x, & \text{if } x\geq 0\\ \end{cases}}$
- **Weights** influence how well the output of our network will come close to the expected output value. As an input enters the neuron, it gets multiplied by a weight value and the resulting output is either observed, or passed to the next layer in the neural network. Weights for all neurons in a layer are organized into one tensor
- **Bias** makes up the difference between the activation function's output and its intended output. A low bias suggest that the network is making more assumptions about the form of the output, whereas a high bias value makes less assumptions about the form of the output. 

<img alt="Diagram showing neural computation" src="images/4-model-2.png" />

We can say that an output $y$ of a neural network layer with weights $W$ and bias $b$ is computed as summation of the inputs multiply by the weights plus the bias $x = \sum{(weights * inputs) + bias} $, where $f(x)$ is the activation function.

- **激活函数(Activation function )**决定了一个神经元是否应该被激活。发生在神经网络中的计算包括应用激活函数。如果一个神经元被激活，那么这意味着输入是重要的。 有不同种类的激活函数。选择使用哪种激活函数取决于你希望输出是什么。激活函数的另一个重要作用是为模型添加非线性。
    - _Binary_用于在函数结果为正时将输出节点设置为1，在函数结果为负时设置为0。$f(x)= {\small \begin{cases} 0, & \text{if } x < 0\\ 1, & \text{if } x\geq 0\\ \end{cases}}$
    - _Sigmod_用于预测输出节点在0和1之间的概率。$f(x) = {\large \frac{1}{1+e^{-x}}} $
    - _Tanh_用于预测一个输出节点是否在1和-1之间。 在分类用例中使用。 $f(x) = {large\frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}} $ $
    - _ReLU_用于在运算结果为负值时将输出节点设置为0，如果结果为正值，则保留结果值。 $f(x)= {\small \begin{cases} 0, & \text{if } x < 0\\ x, & \text{if } x\geq 0\\ \end{cases}}$
- **权重(Weights)**影响着我们网络的输出会有多大程度的接近预期输出值。当一个输入进入神经元时，它被乘以一个权重值，然后产生的输出被观察，或者被传递到神经网络的下一层。一层中所有神经元的权重被组织成一个张量
- **偏置(Bias)**构成了激活函数的输出和其预期输出之间的差异。低偏置表明网络对输出的形式做了更多的假设，而高偏置值对输出的形式做了更少的假设。

<img alt="显示神经计算的图表" src="images/4-model-2.png" />

我们可以说，具有权重$W$和偏置$b$的神经网络层的输出$y$被计算为输入乘以权重加偏置的总和 $x = sum{(权重*输入)+偏置} $，其中$f(x)$是激活函数。

# Build a neural network

Neural networks are comprised of layers/modules that perform operations on data. 
The `torch.nn` namespace provides all the building blocks you need to 
build your own neural network. Every module in PyTorch subclasses the `nn.Module`. 
A neural network is a module itself that consists of other modules (layers). This nested structure allows for
building and managing complex architectures easily.

In the following sections, we'll build a neural network to classify images in the FashionMNIST dataset.

In [1]:
%matplotlib inline
import os
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

## Get a hardware device for training

We want to be able to train our model on a hardware accelerator like the GPU, if it is available. Let's check to see if 
`torch.cuda` is available, else we continue to use the CPU.
如果可用，我们希望能够在 GPU 等硬件加速器上训练我们的模型。 让我们看看是否
`torch.cuda` 可用，否则我们继续使用 CPU。

In [2]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Using {} device'.format(device))

Using cuda device


## Define the class

We define our neural network by subclassing `nn.Module`, and 
initialize the neural network layers in `__init__`. Every `nn.Module` subclass implements
the operations on input data in the `forward` method.

Our neural network are composed of the following:
- The input layer with 28x28 or 784 features/pixels.
- The first linear module takes the input 784 features and transforms it to a hidden layer with 512 features
- The ReLU activation function will be applied in the transformation
- The second linear module take 512 features as input from the first hidden layer and transforms it to the next hidden layer with 512 features
- The ReLU activation function will be applied in the transformation
- The third linear module take 512 features as input from the second hidden layer and transforms it to the output layer with 10, which is the number of classes
- The ReLU activation function will be applied in the transformation

我们通过子类化`nn.Module`来定义我们的神经网络，并在`__init__`中初始化神经网络层。
在 `__init__`中初始化神经网络层。每个`nn.Module`子类都在`forward`方法中实现了
在``forward``方法中对输入数据进行操作。

我们的神经网络由以下部分组成。
- 输入层有28x28或784个特征/像素。
- 第一个线性模块接收输入的784个特征，并将其转换为具有512个特征的隐藏层。
- ReLU激活函数将被应用于转换中。
- 第二个线性模块将512个特征作为第一个隐藏层的输入，并将其转换到有512个特征的下一个隐藏层。
- 在转换过程中会应用ReLU激活函数。
- 第三个线性模块从第二隐藏层获取512个特征的输入，并将其转换到输出层，输出层有10个特征，这就是类的数量。
- 在转换过程中，将应用ReLU激活函数。

In [3]:
class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10),
            nn.ReLU()
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

We create an instance of `NeuralNetwork`, and move it to the `device`, and print 
it's structure.

我们创建一个 NeuralNetwork 的实例，并将其移动到 device，然后打印
它的结构。


In [5]:
model = NeuralNetwork().to(device)
print(model)

NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
    (5): ReLU()
  )
)


To use the model, we pass it the input data. This executes the model's `forward`, along with some background operations. However, do not call `model.forward()` directly! Calling the model on the input returns a 10-dimensional tensor with raw predicted values for each class.

We get the prediction densities by passing it through an instance of the `nn.Softmax`.


为了使用这个模型，我们把输入数据传给它。这将执行模型的`forward`，以及一些后台操作。然而，不要直接调用`model.forward()`! 在输入数据上调用模型会返回一个10维的张量，其中包含每个类别的原始预测值。

我们通过一个`nn.Softmax`的实例来获得预测密度。

In [6]:
X = torch.rand(1, 28, 28, device=device)
logits = model(X) 
pred_probab = nn.Softmax(dim=1)(logits)
y_pred = pred_probab.argmax(1)
print(f"Predicted class: {y_pred}")

Predicted class: tensor([0], device='cuda:0')


## Weight and Bias


The `nn.Linear` module randomly initializes the ${weights}$  and ${ bias}$ for each layer and internally stores the values in Tensors.

`nn.Linear` 模块随机初始化每一层的 ${weights}$ 和 ${bias}$ 并将值内部存储在 Tensors 中。

In [7]:
print(f"First Linear weights: {model.linear_relu_stack[0].weight} \n")

print(f"First Linear weights: {model.linear_relu_stack[0].bias} \n")

First Linear weights: Parameter containing:
tensor([[ 0.0140,  0.0338,  0.0249,  ..., -0.0208, -0.0341,  0.0157],
        [-0.0241, -0.0277, -0.0175,  ..., -0.0084,  0.0227, -0.0085],
        [-0.0157,  0.0120, -0.0311,  ..., -0.0223, -0.0129,  0.0160],
        ...,
        [ 0.0351,  0.0063, -0.0206,  ..., -0.0268, -0.0244,  0.0356],
        [ 0.0018, -0.0008, -0.0207,  ..., -0.0245, -0.0340,  0.0129],
        [ 0.0204,  0.0179, -0.0313,  ..., -0.0181,  0.0178, -0.0175]],
       device='cuda:0', requires_grad=True) 

First Linear weights: Parameter containing:
tensor([ 0.0148, -0.0243, -0.0101, -0.0226, -0.0127, -0.0003,  0.0335,  0.0145,
         0.0327,  0.0309, -0.0350, -0.0297,  0.0004, -0.0180, -0.0281,  0.0251,
         0.0105,  0.0103, -0.0053, -0.0149,  0.0189,  0.0101,  0.0325,  0.0296,
         0.0186,  0.0081, -0.0234, -0.0252, -0.0098, -0.0150,  0.0254, -0.0209,
         0.0346, -0.0043, -0.0031, -0.0221, -0.0257, -0.0237,  0.0318, -0.0109,
         0.0314, -0.0163,  0.012

## Model layers

Let's break down the layers in the FashionMNIST model. To illustrate it, we 
will take a sample minibatch of 3 images of size **28x28** and see what happens to it as 
we pass it through the network. 

让我们分解 FashionMNIST 模型中的层。 为了说明这一点，我们
将采用 3 张大小为 **28x28** 的图像作为样本 minibatch，看看它会发生什么
我们通过网络传递它。

In [8]:
input_image = torch.rand(3,28,28)
print(input_image.size())

torch.Size([3, 28, 28])


### nn.Flatten

We initialize the `nn.Flatten` layer to convert each 2D 28x28 image into a contiguous array of 784 pixel values (the minibatch dimension (at dim=0) is maintained). Each of the pixels are pass to the input layer of the neural network.  

<img alt="Diagram showing how pixels in image are flatten" src="images/4-model-3.png" />

我们初始化 `nn.Flatten` 层以将每个 2D 28x28 图像转换为 784 个像素值的连续数组（保持小批量维度（dim=0））。 每个像素都传递到神经网络的输入层。

In [9]:
flatten = nn.Flatten()
flat_image = flatten(input_image)
print(flat_image.size())

torch.Size([3, 784])


### nn.Linear 

The linear layer is a module that applies a linear transformation on the input using it's stored weights and biases.  The gayscale value of each pixel in the input layer will be connected to neurons in the hidden layer for calculation.    The calculation used for the transformation is ${{weight * input + bias}} $.

线性层是一个模块，它使用存储的权重和偏差对输入应用线性变换。 输入层每个像素的gayscale值会连接到隐藏层的神经元进行计算。 用于转换的计算是 ${{weight * input + bias}} $。


In [10]:
layer1 = nn.Linear(in_features=28*28, out_features=20)
hidden1 = layer1(flat_image)
print(hidden1.size())

torch.Size([3, 20])


### nn.ReLU

Non-linear activations are what create the complex mappings between the model's inputs and outputs.
They are applied after linear transformations to introduce *nonlinearity*, helping neural networks
learn a wide variety of phenomena. In this model, we use `nn.ReLU` between our linear layers, but there's other activations to introduce non-linearity in your model.

The ReLU activation function takes the output from the linear layer calculation and replaces the negative values with zeros.

非线性激活是在模型的输入和输出之间创建复杂映射的原因。
它们在线性变换之后应用以引入**非线性**，帮助神经网络
学习各种各样的现象。 在此模型中，我们在线性层之间使用 `nn.ReLU`，但还有其他激活在您的模型中引入非线性。

ReLU 激活函数采用线性层计算的输出并将负值替换为零。

Linear output: ${ x = {weight * input + bias}} $.  
ReLU:  $f(x)= 
\begin{cases}
    0, & \text{if } x < 0\\
    x, & \text{if } x\geq 0\\
\end{cases}
$

In [11]:
print(f"Before ReLU: {hidden1}\n\n")
hidden1 = nn.ReLU()(hidden1)
print(f"After ReLU: {hidden1}")

Before ReLU: tensor([[-0.2530,  0.4927, -0.4120,  0.4664,  0.1131, -0.1839,  0.0128,  0.2858,
         -0.1110, -0.4612,  0.0566, -0.6316, -0.0975,  0.0737,  0.0561,  0.0087,
         -0.2511,  0.2316,  0.5643, -0.0448],
        [-0.0083,  0.6768, -0.5621,  0.2820,  0.3256, -0.0543,  0.3050,  0.2195,
         -0.1572, -0.7011, -0.0464, -0.4313, -0.5993, -0.1504,  0.4440, -0.0426,
         -0.2414,  0.6542,  0.0531, -0.0562],
        [-0.3602,  0.2505, -0.4097,  0.1183,  0.4356, -0.3263, -0.0397,  0.2752,
         -0.1516, -0.8040,  0.1152, -0.4537, -0.6999,  0.0703,  0.2744,  0.0032,
         -0.2562,  0.4451,  0.3579, -0.0913]], grad_fn=<AddmmBackward0>)


After ReLU: tensor([[0.0000, 0.4927, 0.0000, 0.4664, 0.1131, 0.0000, 0.0128, 0.2858, 0.0000,
         0.0000, 0.0566, 0.0000, 0.0000, 0.0737, 0.0561, 0.0087, 0.0000, 0.2316,
         0.5643, 0.0000],
        [0.0000, 0.6768, 0.0000, 0.2820, 0.3256, 0.0000, 0.3050, 0.2195, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.44

### nn.Sequential

`nn.Sequential` is an ordered 
container of modules. The data is passed through all the modules in the same order as defined. You can use
sequential containers to put together a quick network like `seq_modules`.

`nn.Sequential` 是有序的
模块的容器。 数据按照定义的相同顺序通过所有模块。 您可以使用
顺序容器将像 seq_modules 这样的快速网络放在一起。

In [12]:
seq_modules = nn.Sequential(
    flatten,
    layer1,
    nn.ReLU(),
    nn.Linear(20, 10)
)
input_image = torch.rand(3,28,28)
logits = seq_modules(input_image)

### nn.Softmax

The last linear layer of the neural network returns `logits` - raw values in \[`-infty`, `infty`], which are passed to the
`nn.Softmax` module. The Softmax activation function is used to calculate the probability of the output from the neural network.  It is only used on the output layer of a neural network.  The results are scaled to values \[0, 1\] representing the model's predicted densities for each class. `dim` parameter indicates the dimension along which the result values must sum to 1.  The node with the highest probability predicts the desired output.

神经网络的最后一个线性层返回 "logits"--其值在[`-infty`, `-infty`]之间，这些值被传递到
`nn.Softmax`模块。Softmax激活函数用于计算神经网络输出的概率。 它只在神经网络的输出层使用。 结果被缩放为数值\[0, 1\]，代表模型对每个类别的预测密度。`dim`参数表示结果值必须和为1的维度。 具有最高概率的节点预测所需的输出。

<img alt="Diagram shows softmax formula" src="images/4-model-4.png" />

In [None]:
softmax = nn.Softmax(dim=1)
pred_probab = softmax(logits)

## Model parameters

Many layers inside a neural network are *parameterized*, i.e. have associated weights 
and biases that are optimized during training. Subclassing `nn.Module` automatically 
tracks all fields defined inside your model object, and makes all parameters 
accessible using your model's `parameters()` or `named_parameters()` methods.

In this example, we iterate over each parameter, and print its size and a preview of its values.

神经网络中的许多层是*参数化*的，也就是说，有相关的权重 
和偏置，在训练过程中被优化。子类`nn.Module`会自动地 
追踪你的模型对象中定义的所有字段，可使用模型的`parameters()`或`named_parameters()`方法访问所有参数。

在这个例子中，我们遍历每个参数，并打印其大小和预览其值。


In [13]:
print("Model structure: ", model, "\n\n")

for name, param in model.named_parameters():
    print(f"Layer: {name} | Size: {param.size()} | Values : {param[:2]} \n")

Model structure:  NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
    (5): ReLU()
  )
) 


Layer: linear_relu_stack.0.weight | Size: torch.Size([512, 784]) | Values : tensor([[ 0.0140,  0.0338,  0.0249,  ..., -0.0208, -0.0341,  0.0157],
        [-0.0241, -0.0277, -0.0175,  ..., -0.0084,  0.0227, -0.0085]],
       device='cuda:0', grad_fn=<SliceBackward0>) 

Layer: linear_relu_stack.0.bias | Size: torch.Size([512]) | Values : tensor([ 0.0148, -0.0243], device='cuda:0', grad_fn=<SliceBackward0>) 

Layer: linear_relu_stack.2.weight | Size: torch.Size([512, 512]) | Values : tensor([[-0.0255,  0.0193,  0.0350,  ...,  0.0002, -0.0421,  0.0360],
        [-0.0381, -0.0297, -0.0005,  ...,  0.0181, -0.0078, -0.0405]],
       device='cu