# Optimisers and learning rates


There's a vast number of optimisers in the literature and most of them are offered in pytorch. 
Different problems tend to work best with certain optimisers, although recent research shows that SGD with proper architecture works effectively most of the times.




### Momentum

Momentum is a method that helps accelerate SGD in the relevant direction.
It uses a fraction of the past time step to update the current gradient


$$ u_t = \gamma u_{t-1} + \eta \nabla_{\theta}{J(\theta))} $$

$$\theta \leftarrow \theta - u_{t}$$

$$ u_{0} = 0$$


### Adagrad

In the vanilla SGD we are using the same learning rate for all the parameters.

Adagrad is an algorithm for gradient-based optimization that adapts the learning rate to the parameters, performing smaller updates (i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features. For this reason, it is well-suited for dealing with sparse data. 
 
 
### Adadelta 


Adadelta is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size 


### Adam

Adaptive Moment Estimation (Adam) is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients like Adadelta, Adam also keeps an exponentially decaying average of past gradients  similar to momentum.




#### Choosing learning rate

Adagram, Adadelta and Adam in pytorch have a default learning rate:

```python
torch.optim.Adam(params, lr=0.001)
torch.optim.Adagrad(params, lr=0.01) 
torch.optim.Adadelta(params, lr=1.0)
```

SGD does not have a default value for learning rate. Different problems may require different learning rates.

Typical values for a neural network with standardized inputs (or inputs mapped to the (0,1) interval) are less than 1 and greater than $10^{-6}$

A traditional default value for the learning rate is 0.01, and this may represent a good starting point.

Personally, I would do two runs, one with a rate around $10^{-4}$ and one with $10^{-2}$, observe the loss function and decide if I need to move closer to $10^{-4}$ or to $10^{-2}$






## Data Augmentation



A very common paradigm followed in Machine Vision/Deep Learning is what we call data augmentation.
Here, depending on the nature of the problem, we are transforming the input images and thus augmenting the training data set.

Augmentation techniques include random noise injection to the input images, to make the network more robust in perturbations. Also, depending on the nature of the problem we can use rotation, translation, mirroring etc.

Data augmentation is essentially an almost free way to increase the training set and allow the network to learn a more robust representation of the input domain, and it is hugely used in practice.




