# 1.neuron

a neuron takes input $x \in \mathbb{R}^{d}$, multiply $x$ by weights $w$ and add bias term $b$, finally use a activation function $g$.

that is:

$$f(x) = g(w^{T}x + b)$$

it is analogous to the functionality of biological neuron.

![jupyter](./neuron.svg)

some useful activation function:

$$
\begin{equation}
\begin{split}
\text{sigmoid:}\quad &g(z) = \frac{1}{1 + e^{-z}} \\
\text{tanh:}\quad &g(z) = \frac{e^{z}-e^{-z}}{e^{z} + e^{-z}} \\
\text{relu:}\quad &g(z) = max(z,0) \\
\text{leaky relu:}\quad &g(z) = max(z, \epsilon{z})\ ,\ \epsilon\text{ is a small positive number}\\
\text{identity:}\quad &g(z) = z
\end{split}
\end{equation}
$$

linear regression's forward process is a neuron with identity activation function.

logistic regression's forward process is a neuron with sigmoid activation function.

# 2.neural network

building neural network is analogous to lego bricks: you take individual bricks and stack them together to build complex structures.

![jupyter](./mlp.svg)

we use bracket to denote layer, we take the above as example

$[0]$ denote input layer, $[1]$ denote hidden layer, $[2]$ denote output layer

$a^{[l]}$ denote the output of layer $l$, set $a^{[0]} := x$

$z^{[l]}$ denote the affine result of layer $l$

we have:

$$z^{[l]} = W^{[l]}a^{[l-1]} + b^{[l]}$$

$$a^{[l]} = g^{[l]}(z^{[l]})$$

where $W^{[l]} \in \mathbb{R}^{d[l] \times d[l-1]}$, $b^{[l]} \in \mathbb{R}^{d[l]}$.

# 3.weight decay

recall that to mitigate overfitting, we use $l_{2}$ and $l_{1}$ regularization in linear and logistic regression.

weight decay is a alias of $l_{2}$ regularization, can be generalize to neural network, we concatenate $W^{[l]}$ and flatten it to get $w$ in this setting.

first adding $l_{2}$ norm penalty:

$$J(w,b) = \sum_{i=1}^{n}l(w, b, x^{(i)}, y^{(i)}) + \frac{\lambda}{2}\left \|  w \right \|^{2} $$

then by gradient descent, we have:

$$
\begin{equation}
\begin{split}
w:=& w-\eta\frac{\partial}{\partial w}J(w, b) \\
=& w-\eta\frac{\partial}{\partial w}\left(\sum_{i=1}^{n}l(w, b, x^{(i)}, y^{(i)}) + \frac{\lambda}{2}\left \|  w \right \|^{2}\right) \\
=& (1 - \eta\lambda)w - \eta\frac{\partial}{\partial w}\sum_{i=1}^{n}l(w, b, x^{(i)}, y^{(i)})
\end{split}
\end{equation}
$$

multiply by $(1 - \eta\lambda)$ is weight decay.

often we do not calculate bias term in regularization, so does weight decay.

# 4.dropout

to strength robustness through perturbation, we can deliberately add perturbation in traning, dropout is one of that skill.

we actually do the following in hidden neuron:

$$
a_{dropout} = 
\begin{cases}
0 &\text{with probability }p \\
\frac{a}{1-p} &\text{otherwise}
\end{cases}
$$

this operation randomly dropout neuron with probability $p$ and keep the expectation unchanged:

$$E(a_{dropout}) = E(a)$$

depict this process below:

![jupyter](./dropout2.svg)

one more thing: we do not use dropout in predicting.

# 5.xavier initialization

to mitigate vanishing and exploding gradient, to insure breaking symmtry, we should carefully initialize weights.

consider a fully connected layer without bias term and activation function:

$$o_{i} = \sum_{j=1}^{n_{in}}w_{ij}x_{j}$$

suppose $w_{ij}$ draw from a distribution of 0 mean and $\sigma^{2}$ variance, not necessarily guassian.

suppose $x_{j}$ draw from a distribution of 0 mean and $\gamma^{2}$ variance, all $w_{ij}, x_{j}$ are independent.

then mean of $o_{i}$ is of course 0, variance:

$$
\begin{equation}
\begin{split}
Var[o_{i}] =& E[o_{i}^{2}] - (E[o_{i}])^{2}\\
=&\sum_{j=1}^{n_{in}}E[w_{ij}^{2}x_{j}^{2}] \\
=&\sum_{j=1}^{n_{in}}E[w_{ij}^{2}]E[x_{j}^{2}] \\
=&n_{in}\sigma^{2}\gamma^{2}
\end{split}
\end{equation}
$$

to keep variance fixed, we need to set $n_{in}\sigma^{2}=1$.

consider backpropagation,

# 6.backpropagation

recall in forward-propagation, we have:

$$z^{[l]} = W^{[l]}a^{[l-1]} + b^{[l]}$$

$$a^{[l]} = g^{[l]}(z^{[l]})$$

consider a $L$ layer network, then prediction is obtain through:

$$
\hat{y} = g^{[L]}(W^{[L]}a[L-1] + b^{[L]})
$$

loss function:

$$L(\hat{y},y)$$

for any given layer index $l$, we update layer $l$'s parameter by gradient descent:

$$W^{[l]} = W^{[l]} - \alpha\frac{\partial{L}}{\partial{W^{[l]}}}$$

$$b^{[l]} = b^{[l]} - \alpha\frac{\partial{L}}{\partial{b^{[l]}}}$$

to proceed, we must compute the gradient with respect to the parameters: 