# Parameters And Hyperparameters


Model _parameters_ are those that are changed by the neural network itself when an input is added to the model, e.g. the neuron weights. On the other hand, _hyperparameters_ are external and defined by the person building the neural network; they shape the network and its learning process, e.g. the learning rate or the activation functions for each neuron. Deciding which hyperparameters are the best usually requires experience, trial and error, as there is not a set of rules to follow for this matter.

There are two kinds of hyperparameters:

- Hyperparameters related to network structure: number of layers, neurons per layer, activation function, starting neuron weights... In essence, those that set the structure of the neural network.

- Hyperparameters related to training algorithm: epochs, batch size, learning rate, momentum... Those that shape the learning process. In this notebook, we will focus on these ones.

## Hyperparameters realated to training algorithm

#### Epochs

The number of epochs refers to the number of times the training data is passed through the neural network during the learning process. As we have seen in previous notebooks, setting a right value for this hyperparameter is pivotal, as too many epochs might lead to model overfitting. On the other hand, setting a low number of epochs might affect the proper training of the model.


#### Batch Size 

The dataset can be split into several subsets. This is usually implemented to obtain better results with the gradient descent. There are several factors that affect the optimal value for the batch size, but we will not cover it for now.


#### Learning Rate and Learning Rate Decay

The learning rate is the factor that multiplies the gradient in the gradient descent algorithm. For instance, if the gradient has a value of 1 and the learning rate is 0.1, the point selected to run the next iteration of the algorithm will be at a distance of $0.1*1 = 1$ from the previous point. 

Setting a high learning rate is convenient when we want to approach the minimum with few steps. However, since the steps are big, it might jump the minimum, so the algorithm would not converge. Although using a low learning rate would lead to convergence, it is not the best option either, as it would take too many steps and it would affect the performance of the algorithm timewise


<img src="https://www.jeremyjordan.me/content/images/2018/02/Screen-Shot-2018-02-24-at-11.47.09-AM.png" width="400" height="400" />


A common approach to solve this problem is adding the "learning rate decay" as a hyperparameters. As the name suggests, this parameters allows to decrease the learning with each iteration/epoch. This way, we could come close to the minimum quickly and have enough precision with the later steps to make the algorithm converge. 


#### Momentum

The image below suggests that when applying the gradient descent we would only find a global minimum. However, this is not usually the case and we might find local minima where the algorithm would stop working. There are two common ways to solve this: we can the algorithm several times from different points hoping that we find the global minimum or we could use the _momentum_. In particular, gradient descent with momentum considers the past gradients to update the starting point for the next iteration. It computes an exponentially weighted average of the previous gradients, and then uses that gradient average instead..In principle, by doing this, the algorithm would get some 'help' to go past local minima.

With Keras, it is easy to add the momentum to the optimizer. In the following example, we have used the Stochastic Gradient Descent along with Nesterov momentum (for more information, check https://dominikschmidt.xyz/nesterov-momentum/), which is one of the ways to apply the momentum. 

```python

sgq = optimizers.SQG(lr=0.001, momentum=0.9,nesterov=True) #momentum value goes from 0 to 1 

```



#### Activation Functions


Activation functions 'decide' if a neuron propagates to the neurons in the following layer (if they are connected in the first place). In other words, the activation function controls if a neuron activates the connection it has with neurons from the next layer. There are several activation functions, but the most popular are the following:


- **Linear**: $y = x$ 

Neuron signal does not change.




- **Sigmoid**: $ y = \frac{1}{1+e^{-x}} $ 

It is usefulness relies on the fact that its values range from 0 to 1 and most of the values are close to these extremes, so it is really interesting for binary classification (e.g. to activate or to not activate a neuron).



<img src="https://upload.wikimedia.org/wikipedia/commons/8/88/Logistic-curve.svg" width="400" height="400" />



- **Hyperbolic Tangent**: $ y = \frac{sinh(x)}{cosh(x)} $ 

Similar to the sigmoid, but in this case the range is from -1 to 1, which might be more convenient for some neural networks.


<img src="https://upload.wikimedia.org/wikipedia/commons/8/87/Hyperbolic_Tangent.svg" width="400" height="400" />



- **Softmax function**:

It is a generalization of the sigmoid function (or logistic regression) for non-binary classifications. It is usually implemented in the final layer of the neural networks, as we have seen when we created our first neural network.



- **Rectified Linear Unit (ReLU) function** : 

It only activates a neuron if the input is above a certain threshold. If the threshold is surpassed, the relation between the input and output is a linear function. In the image above, we have the case when the threshold is set to $z=0$:

<img src="https://miro.medium.com/max/357/1*oePAhrm74RNnNEolprmTaQ.png" width="400" height="400" />

## References


#### Literature

_Gradient Descent With Momentum_ (https://engmrk.com/gradient-descent-with-momentum/#:~:text=Gradient%20Descent%20with%20Momentum%20considers,How%20does%20it%20work%3F)

_Python Deep Learning_ , by Jordi Torres (https://www.marcombo.com/python-deep-learning-9788426728289/)


#### Images 

Gradient descent:(https://www.jeremyjordan.me/nn-learning-rate/)

Sigmoid function: (https://en.wikipedia.org/wiki/Sigmoid_function)

Hyperbolic tangent (Spanish): (https://es.wikipedia.org/wiki/Tangente_hiperb%C3%B3lica)

ReLU: (https://medium.com/@kanchansarkar/relu-not-a-differentiable-function-why-used-in-gradient-based-optimization-7fef3a4cecec)


