# Hessian Matrix

$$H = \nabla^2 J = \begin{bmatrix}
\frac{\partial^2 J}{\partial\theta_1^2}&\cdots&\frac{\partial^2 J}{\partial\theta_1\partial \theta_D}\\
\vdots & \ddots &\vdots\\
\frac{\partial^2 J}{\partial\theta_D\theta_1}&\cdots&\frac{\partial^2 J}{\partial\theta^2_D}
\end{bmatrix}$$

Locally, a function can be approximated by second-order Taylor approximation
$$J(\theta) \approx J(\theta_0) + \nabla J(\theta_0)^T (\theta-\theta_0) + \frac{1}{2}(\theta-\theta_0)^T H(\theta_0)(\theta-\theta_0)$$
A critical point is a point where the gradient is zero. i.e. 
$$J(\theta) = J(\theta_0) + \frac{1}{2}(\theta-\theta_0)^T H(\theta_0)(\theta-\theta_0)$$

A lot of important features of the optimization landscape can be characterized by the eigenvalues of the Hessian $H$. 

Recall that a symmetric matrix $H$ has only real eigenvalues and there is an orthogonal basis of eigenvectors, i.e. a __spectral decomposition__ $H = Q\Lambda Q^T$ where $Q$. 

Therefore, refer $H$ as the __curvature__ of a function.   
Suppose you move along a line defined by $\theta + tv$ for some vector $v$.  
Then, second-order Taylor approximation: 
$$J(\theta + tv) \approx J(\theta) + t\nabla J(\theta)^Tv + \frac{t^2}{2}v^TH(\theta)v$$
Hence, in a direction where $v^THv > 0$, the cost function curves upwards, i.e. has positive curvature. Where $v^THv < 0$, it has negative curvature. 

A matrix $A$ is positive definite if $v^TAv > 0$ for all $v\neq 0$. positive semidefinite if $v^TAv \geq 0$. 

Equivalently, a matrix is positive definite IFF all its eigenvalues are positive. 

Therefore, for any critical point $\theta_*$, if $\exists H(\theta_*)$ exists and is positive definite, then $\theta_*$ is a local minimum. 

If $J$ is smooth, then it is convex IFF its $H$ is positive semidefinite everywhere. Therefore, for univariate cases, $H$ it is the second derivative

# Convexity
Training a network with hidden units cannot be convex because of __permutation symmetries__. Then, we can re-order the hidden units in a way that preserves the function computed by the network. 

# Problems with NN
## Saddle Points 
A saddle points is a point where $\nabla J(\theta) = 0$ or $H(\theta)$ has some positive and some negative eigenvalues, i.e. some directions with positive curvature and some with negative curvature. 

#### Example
Suppose two hidden units with identical incoming and outcoming weights, then the GD will always be 0.  
Therefore, do not initialize all the weights to 0, instead, assigning some random values. 

## Plateaux
A flat region from $0-1$ loss, hard threshold activations, and LS logistic activations.

#### Example 
__saturated unit__ when it is in the flat region of its activation function. i.e. large value of $z$ in logistic functions, or negative values in ReLU. 

## Ill-conditioned curvature
Suppose $H$ has some large positive eigenvalues and some eigenvalues close to 0. Then, GD bounces back and forth in high curvature directions and makes slow progress in low curvature directions. However, the actual optimal should follow the "valley". 

### GD dynamics
COnsider a convex quadratic objective $J(\theta) = \frac{1}{2}\theta^TA \theta$ where $A$ is PSD.   
Then, the GD update gives
\begin{align*}
\theta &\leftarrow \theta - \alpha \nabla J(\theta)\\
&= \theta - \alpha A \theta\\
&= (I - \alpha A)\theta
\end{align*}
Solving the recurrence, 
$$\theta = (I - \alpha A)^k \theta_0$$
We can analyze matrix powers such as $(I-\alpha A)^k \theta_0$ using the spectral decomposition.   
Let $A = Q\Lambda Q^T$ be the spectral decomposition of $A$. 
\begin{align*}
(I - \alpha A)^k \theta_0 &= (I - \alpha Q\Lambda Q^T)^k \theta_0\\
&= [Q(I - \alpha\Lambda)Q^T]^k \theta_0\\
&= Q(I-\alpha \Lambda)^k Q^T\theta_0
\end{align*}
Hence, in the $Q$ basis, each coordinate gets multiplied by $(1-\alpha\lambda_i)^k$ where the $\lambda_i$ are the eigenvalues of $A$. 

Therefore, 
- $0 < \alpha \lambda_i \leq 1$, decays to $0$ at a rate that  depends on $\alpha\lambda_i$
- $1 < \alpha \lambda_i \leq 2$, oscillates
- $\alpha\lambda_i > 2$, unstable (diverges)

Hence, we need to set the l.r. $\alpha < 2/\lambda_{max}$ to prevent instability, where $\lambda_{max}$ is the largest eigenvalue, i.e. max curvature. 

Therefore, the rate o progress in another direction
$$\alpha\lambda_i < \frac{2\lambda_i}{\lambda_{max}}$$
The quantity $\lambda_{max}/\lambda_{min}$ is known as the __condition number__ of $A$. Larger condition numbers imply slower convergence of GD.

Then, it can be easily generalized to a quadratic not centered as zero, since the gradient descent dynamics are invariant to translation. 
$$J(\theta) = \frac{1}{2}\theta^TA\theta + b^T\theta + c$$
Since a smooth cost function is well approximated by a convex quadratic in the vicinity of a local optimum, this analysis is a good description of the behavior of GD near a optimum. 

#### Solution
Note that this issue is also common for imbalanced weights. To avoid these problems, center inputs to $N(0,1)$, (similar scale and mean).
$$\tilde x_j = \frac{x_j - \mu_j}{\sigma_j}$$
Hidden units may have non-centered activations, some tricks includes replace logistic units with tanh units.   
A recent method called batch normalization explicitly centers each hidden activation. 

### Solution: Momentum 
$$p\leftarrow \mu p - \alpha \frac{\partial J}{\partial \theta}, \theta \leftarrow \theta + p$$
where $\alpha$ is the learning rate, $\mu$ is the damping param, $\mu < 1$, otherwise, momentum won't diminish. 

Momentum dampens the oscillations. In the low curvature directions, the gradients point in the same direction, allowing the parameters to pick up speed.  
If the gradient is constant, the params will reach a terminal velocity of 
$$-\frac{\alpha}{1-\mu}\frac{\partial J}{\partial \theta}$$
Momentum sometimes helps a lot, and almost never hurts

### Solution: RMSprop
A variant of SGD which rescales each coordinate of the gradient to have norm 1 on average by keeping an exponential moving average $s_j$ on the squared gradients. 
$$s_j \leftarrow (1-\gamma)s_j + \gamma (\frac{\partial J}{\partial \theta_i})^2$$
$$\theta_j \leftarrow \theta_j - \frac{\alpha}{\sqrt{s_j + \epsilon}}\frac{\partial J}{\partial \theta_j}$$
If the eigenvectors of the Hessian are axis-aligned, then RMSprop can correct for the curvature. 

## Mini-batch Training
Each entire pass over the dataset is called an __epoch__.  
Stochastic gradients computed on larger mini-batches have smaller variance
$$var(\frac1S \sum^S \frac{\partial \mathcal L^{(i)}}{\partial \theta_j}) = \frac{1}{S^2}var(\sum^S\frac{\partial \mathcal L^{(i)}}{\partial \theta_j}) = \frac1Svar(\partial_{\theta_j}\mathcal L^{(i)})$$
### Batch size
 - large batches converge in fewer weight updates because each stochastic gradient is less noisy
 - small batches performs more weight updates per second because each one requires less computation

#### Training time and Parallel Computations
- Small batches: an update with size 10 isn't much more expensive than size 1
- One size is large enough to saturate the hardware efficiencies, the cost becomes linear in size. GPUs tend to favor larger batch sizes. 

#### Convergence
- small batches have large gradient noise, so large benefit from increased batch size
- large batches SGD approximates the batch gradient descent update, so no further benefit from variance reduction. 