Hyperparameters can be divided in two categories:

1. **optimizer hyperparameters** - focused on optimization and training process rather than the model
    - learning rate
    - mini-batch size
    - epochs
2. **model hyperparameters** - involved in the structure of the model
    - number of layers
    - model specific for architectures

## Optimizer hyperparameters

### Learning Rate (most important)

The learning rate acts as a multiplier we use to push the weights in the right direction.

- Good starting point: 0.01
- Common lr: 0.1, 0.01, 0.001, 0.0001, 0.00001

<img src="part-4_images/simple_lr_example.png" alt="Simple (Ideal) LR example" style="width: 650px;"/>

Learning Rate Decay - a technique that decreases the learning rate by a certain factor. 

There are also smart algorithms such as Adaptive learning rate that can increase/decrease the learning rate depending on the training.

Sources to understand gradient based methods:
- [Adam](https://arxiv.org/pdf/1412.6980.pdf)
- [Intro to gradient based methods](https://tao.lri.fr/tiki-download_wiki_attachment.php?attId=954)
- [Methods for convex optimization](https://ppasupat.github.io/a9online/uploads/proximal_notes.pdf)
- [Adagrad](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf)

### Minibatch Size

Mini-batch size - number of training examples

Most often the mini-batch size is between: 1, 2, 4 ... 256, in powers of two.
- too small -> too slow
- too large -> too computationally expensive

In practice smaller mini-batch size are more noisy but they are more useful as it can prevent the gradient to get stuck into a local minima. If we change the batch size, then we have to adjust the learning rate.

In-depth analysis of different hyperparameter choice evaluation, including [mini-batch size](https://arxiv.org/pdf/1606.02228.pdf).

### Number of Training Iterations / Epochs

The metric we need to focus on is the validation error.

The number of training iterations is a hyperparameter we can optimize automatically using a technique called early stopping.

For PyTorch, early stopping isn't directly implemented but can be found through an [additional library](https://pytorch.org/ignite/handlers.html).

## Model hyperparameters

### Number of Hidden Units / Layers

- the hidden units is what helps the model learn it needs to have enough capacity to learn how to approximate the function

"in practice it is often the case that 3-layer neural networks will outperform 2-layer nets, but going even deeper (4,5,6-layer) rarely helps much more. This is in stark contrast to Convolutional Networks, where depth has been found to be an extremely important component for a good recognition system (e.g. on order of 10 learnable layers)." ~ Andrej Karpathy in https://cs231n.github.io/neural-networks-1/

Additional resource: http://www.deeplearningbook.org/contents/ml.html

### RNN Hyperparameters

In practice, it has been shown that LSTM and GRUs perform better than regular RNNs. However, between the two that remains to be seen.

### Hyperparameter resources

- [Practical recommendations for gradient-based training of deep architectures](https://arxiv.org/abs/1206.5533)
- [Deep Learning book - chapter 11.4: Selecting Hyperparameters](http://www.deeplearningbook.org/contents/guidelines.html)
- [Neural Networks and Deep Learning book - Chapter 3: How to choose a neural network's hyper-parameters?](http://neuralnetworksanddeeplearning.com/chap3.html#how_to_choose_a_neural_network's_hyper-parameters)
- [Efficient BackProp (pdf)](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf)

#### Specialized resources

- [How to Generate a Good Word Embedding?](https://arxiv.org/abs/1507.05523)
- [Visualizing and Understanding Recurrent Networks](https://arxiv.org/abs/1506.02078)

