![gradient](res/gradient.gif)

### 定义

**偏微分**

$$ z = \sqrt{1-x^2-y^2}$$

$${\partial z \over \partial x}=-{x\over \sqrt{1-x^2-y^2}}, {\partial z \over \partial y}=-{y\over \sqrt{1-x^2-y^2}}$$

**梯度**

$$\nabla z = {1\over \sqrt{1-x^2-y^2}}\left[\begin{align}-x\\-y\end{align}\right]$$

### 为什么“梯度”表示最陡方向？

![contours](res/contours.jpg)

原因: 梯度垂直等高线

验证：梯度沿着经线的方向

### 梯度下降算法

问题：求 $J(\theta)$ 的极小值

$$\theta_{t+1} = \theta_t -\eta\nabla J(\theta_t)$$

** Batch  Gradient Descent **

$$\hat{p}_i = \left[\begin{align}\sigma(wx_i+b)\\1-\sigma(wx_i+b)\end{align}\right]$$
$$p_i = \left[\begin{align}y_i\\1-y_i\end{align}\right]$$
$$ J(w,b) = {1\over N}\sum\limits_{i\in batch} \left[ -y_i\log\sigma(wx_i+b) - (1-y_i)\log\left(1-\sigma(wx_i+b)\right)\right]\equiv {1\over N}\sum\limits_{i\in batch} f_i(w,b)$$


$$\theta_{t+1} = \theta_t -\eta\nabla \left[{1\over N}\sum\limits_{i\in batch} f_i(\theta)\right]$$

优点：收敛稳定一致

缺点：计算量巨大、local minimum


** Stochastic Gradient Descent **

$$\theta_{t+1} \approx \theta_t -\eta\nabla \left[{1\over K}\sum\limits_{i\in mini-batch} f_i(\theta)\right]$$

优点：计算量较小，易逃脱 local minumum

缺点：收敛性 fluctuation

问题：slow near saddle point

**Vanilla Momentum**

$$ \begin{align}v_{t} &= \gamma v_{t-1} +\eta \nabla J(\theta_{t})\\
\theta_{t+1} &=\theta_{t} - v_t \end{align}   $$

![momentum](res/momentum.png)

优点：惯性加速

问题："跑过了"

计算举例：

$$v_0=\eta k,v_1=\gamma v_0 +\eta k = \eta k (1 + \gamma), v_2= \eta k (1 +\gamma + \gamma^2),\cdots$$

** Nesterov Accelerated Gradient **

idea: future gradient

$$ \begin{align}v_{t} &= \gamma v_{t-1} +\eta \nabla J(\theta_{t}-\gamma v_{t-1})\\
\theta_{t+1} &=\theta_{t} - v_t \end{align}   $$

优点： 缓解“跑过了”的问题


** adaptive gradient **

问题：各参数学习率一致，期望对于unblanced sparse data学习率应该更高

$$\theta_{t+1,i} = \theta_{t,i} -{\eta\over\sqrt{G_{t,ii}+\epsilon}}\nabla_i J(\theta_t)$$

$G_{t,ii}$ is the sum of squares of the gradients w.r.t. $\theta_i$ up to time $t$.

优点：parameters coupled with dense data （gradient larger）学习率小于sparse data

缺点：learning rate 递减，越来越慢;需要记住全部梯度历史

** RMSprop**

$$ E[g^2]_t = \gamma E[g^2]_{t-1} + (1-\gamma)g_t^2 $$
$$\theta_{t+1,i} = \theta_{t,i} -{\eta\over\sqrt{E[g^2]_t+\epsilon}} g_{t,i}$$

优点：learning rate不再递减

** adam **

$$ m_t = \beta_1m_{t-1} + (1-\beta_1)g_t $$
$$ v_t = \beta_2v_{t-1} + (1-\beta_2)g_t^2 $$
$$ \hat{m}_t = {m_t\over 1-\beta_1}, \quad \hat{v}_t = {v_t\over 1-\beta_2} $$
$$\theta_{t+1} = \theta_{t} -{\eta\over\sqrt{\hat{v}_t}+\epsilon} \hat{m}_t$$


### 逆传播算法

![layers](res/mnist_2layers.png)

forward propagation

$$\begin{align}
z^{(1)}_j &=\sum_ix_iw^{(1)}_{ij}+b^{(1)}_j\\
a^{(1)}_j &=\sigma\left(z^{(1)}_j\right)\\
z^{(2)}_j &=\sum_ia^{(1)}_iw^{(2)}_{ij}+b^{(2)}_j\\
a^{(2)}_j &=\sigma\left(z^{(2)}_j\right)\\
z^{(3)}_j &=\sum_ia^{(2)}_iw^{(3)}_{ij}+b^{(3)}_j\\
\hat{y}_j &={\exp\left(z^{(3)}_j\right)\over \sum\limits_i\exp\left(z^{(3)}_i\right)}\\
J &= -\sum_j\delta_{y_j,1}\ln\hat{y}_j \qquad\text{(单个样本)}\\
J &= -{1\over K}\sum_n\sum_j\delta_{y_{nj},1}\ln\hat{y}_{nj} \qquad\text{(mini-batch)}\\
\end{align}$$ 

backward propagation
$$\begin{align}
{\partial J\over\partial \hat{y}_{nj}} &= -{1\over K}\sum_n\sum_j\delta_{y_{nj},1}{1\over\hat{y}_{nj}}\\
{\partial J\over\partial z^{(3)}_{ni}} &= \sum_j {\partial J\over\partial \hat{y}_{nj}}{\partial \hat{y}_{nj}\over\partial z^{(3)}_{ni} }\\
{\partial J\over\partial w^{(3)}_{ij}} &= \sum_{nk} {\partial J\over\partial z^{(3)}_{nk}}{\partial z^{(3)}_{nk}\over\partial w^{(3)}_{ij}}=
\sum_{n} {\partial J\over\partial z^{(3)}_{nj}}{\partial z^{(3)}_{nj}\over\partial w^{(3)}_{ij}}= \sum_{n}{\partial J\over\partial z^{(3)}_{nj}}a^{(2)}_{ni}\\
{\partial J\over\partial b^{(3)}_{j}} &= \sum_{nk} {\partial J\over\partial z^{(3)}_{nk}}{\partial z^{(3)}_{nk}\over\partial b^{(3)}_{j}}=
\sum_{n} {\partial J\over\partial z^{(3)}_{nj}}{\partial z^{(3)}_{nj}\over\partial b^{(3)}_{j}}= \sum_{n}{\partial J\over\partial z^{(3)}_{nj}}\\
{\partial J\over\partial a^{(2)}_{ni}} &= \sum_{j}{\partial J\over\partial z^{(3)}_{nj}}{\partial z^{(3)}_{nj}\over\partial a^{(2)}_{ni} }=
 \sum_{j}{\partial J\over\partial z^{(3)}_{nj}}w^{(3)}_{ij}\\
 {\partial J\over\partial z^{(2)}_{ni}} &= {\partial J\over\partial a^{(2)}_{ni}} {\partial a^{(2)}_{ni}\over\partial  z^{(2)}_{ni}}\\
 {\partial J\over\partial w^{(2)}_{ij}} &= \sum_{n}{\partial J\over\partial z^{(2)}_{nj}}a^{(1)}_{ni}\\
 {\partial J\over\partial b^{(2)}_{j}} &= \sum_{n}{\partial J\over\partial z^{(2)}_{nj}}\\
 {\partial J\over\partial a^{(1)}_{ni}} &= \sum_{j}{\partial J\over\partial z^{(2)}_{nj}}{\partial z^{(2)}_{nj}\over\partial a^{(1)}_{ni} }=
 \sum_{j}{\partial J\over\partial z^{(2)}_{nj}}w^{(2)}_{ij}\\
 {\partial J\over\partial z^{(1)}_{ni}} &= {\partial J\over\partial a^{(1)}_{ni}} {\partial a^{(1)}_{ni}\over\partial  z^{(1)}_{ni}}\\
 {\partial J\over\partial w^{(1)}_{ij}} &= \sum_{n}{\partial J\over\partial z^{(1)}_{nj}}x_{ni}\\
 {\partial J\over\partial b^{(1)}_{j}} &= \sum_{n}{\partial J\over\partial z^{(1)}_{nj}}\\
\end{align}$$ 
