## Basic Algorithms
Gradient descent can be accelerate considerably by using stochastic gradient descent.

### Stochastic Gradient Descent
SGD and its variants are the most used optimization algo.

follow the gradient of randomly selected minibatches downhill.

it is necessary to gradually decrease the learning rate over time

-> In practice, it is common to decay the learning rate linearly until iteration τ:

$\epsilon_k = (1-\alpha) \epsilon_0 + \alpha \epsilon_\tau$

After iteration τ, it is common to leave $\epsilon$ constant.

**Way to choose the learning rate**: monitoring the learning curve.

**Setting the parameters**:
- τ may be set to the number of iterations required to make a few hundred passes through the training set.
- $\epsilon_\tau$ should be set to roughly $1\%$ the value of $\epsilon_0$
- $\epsilon_0$
    - Too large: the learning curve will show violent oscillations
    - Too low: learning will be slow, or may be it can stuck with a high cost value
    
Properties of SGD:
- Computation time per update does not grow with the number of examples -> allow convergence even when the number of training data is very large.

To study the convergence rate -> measure **excess error**: $J(\theta) - \min_\theta (\theta)$

We should not pursue an optimization that converges faster than $\mathcal{O}(\frac{1}{k})$ to avoid overfitting.

With large datasets, the ability of SGD to make rapid initial progress while evaluating the gradient for only very few examples outweighs its slow asymptotic convergence.

### Momentum
is designed to accelerate learning. by introduce the variable $v$ which play the role of velocity.

-> SGD with momentum.

**Physics perspective**:

The particle experiences net force $f(t)$:
- One force is proportional to the negative gradient of the cost function $−∇_θ J(θ)$
- One force is proportional to $−v(t)$ (viscous drag) -> to make the particle lose its energy over time.

### Nesterov Momentum
The difference between Nesterov momentum and standard momentum is where the gradient is evaluated. With Nesterov momentum the gradient is valuated after the current velocity is applied.

In batch gradient descent -> convergre with $\mathcal{O}(1/k^2)$

In stochastic gradient descent -> not improve at all.

## Parameter Initialization Strategies
Training algorithms for deep learning models are usually iterative in nature and thus require the user to specify some initial point from which to begin the iterations.

-> Training deep models are strongly affected by the choice of initialization.

The initial parameters need to “break symmetry” between different units.

The goal of having each unit compute a different function motivates random initialization of the parameters. -> draw from Gauss or Uniform.

Final parameters should be close to the initial parameters.

-> draw from $\mathcal{U}(-\frac{1}{\sqrt{m}}, \frac{1}{\sqrt{m}})$ or using **normalized initialization** $W_{i,j} ∼ \mathcal{U}\left(−\sqrt{\frac{6}{m+n}}, \sqrt{\frac{6}{m+n}}\right)$

In practice, we usually need to treat the scale of the weights as a hyperparameter whose optimal value lies somewhere roughly near but
not exactly equal to the theoretical predictions.

### Setting biases
Setting the biases to zero is compatible with most weight initialization schemes. 

There are a few situations where we may set some biases to non-zero values:
- a bias for output unit.
- choose biases for  avoid causing too much saturation at initialization. 
- Sometimes a unit controls whether other units are able to participate in a function.
    - Ex: forget gate of LSTM model.
    
**Choosing a variance or precision parameter**: We can usually initialize variance or precision parameters to 1 safely.

initialize a supervised model with the parameters learned by an unsupervised model trained on the same inputs.