# Training A Neural Network

In this notebook we will briefly go through some of the theoretical aspects of the neural network training.

Remember that the neural network learning is just the process through which we adjust the parameters and bias of the network for a given group of data. In order to make these adjustments, we need to take into account two things: the _loss function_ and the _optimizator_:.

The _loss function_ is basically a way of measuring the error between the neural network predictions and the true labels. The value given by the loss function is then used by the _optimizator_, which is in charge of adjusting the neural network parameters in order to minimize the error.

In essence, training a neural network requires three steps:

- 1. Forward propagation : data is introduced in the neural network, passes through all the layers and ends up giving a prediction.
- 2. Error estimation: we use a loss function to estimate the error and see how good our model is.
      
- 3. Backward propagation: neural network parameters are adjusted with an optimizator in order to minimize the error given by the loss function in a backward direction (i.e. the first neurons to be adjusted are the ones that are closer to the outcome).

## Gradient Descent 

We will now cover one of the most popular optimizators in Deep Learning, the _Gradient Descent_, and some of its variations (_Batch Gradient Descent_ ,_Mini Batch Gradient Descent_ and _Stochastic Gradient Descent_)

#### Basic Gradient Descent

The basic idea of this optimization is the adjustment of the parameters iteratively so as to minimize the loss functions. 

The gradient descent uses the first derivative (hence _gradient_) of the loss function, which is chained with the derivatives of each network layer by applying the chain rule. The application of the chain rule is essentially what leads to the backpropagation.  Since the gradient at a point ($w$ in the figure below) goes in the direction for which the function increases, we must choose the negative of it. By doing this, if we substract the negative gradient multiplied by a factor ($\alpha$ in the figure) to the initial point  we make the function follow the direction of steepest descent, which ends up leading to a minimum (for more information, specially the equations, please have a look at https://en.wikipedia.org/wiki/Gradient_descent).




<img src="https://i.ytimg.com/vi/b4Vyma9wPHo/maxresdefault.jpg" width="400" height="400" />


Bear in mind that the figure is a simplified case, as we have not performed the chain rule in the gradient.


#### Batch Gradient Descent

In order to adjust the parameters, we use the average gradient of the whole dataset in each iteration/epoch.

#### Stochastic Gradient Descent

In this case we use a single sample of the dataset per iteration instead of all of them to adjust a parameter. In general, this method performs better, but it is not suitable for all the optimizations techniques 

#### Mini Batch Gradient Descent

This technique lies in the middle of the previous two, as it uses the average gradient of a subset from the dataset instead of a single sample. This method usually performs better than the other two. 




In order to choose the batch size in Keras, we can use the following line:


` model.fit(x_train,y_train,epochs=5,batch_size=100)`

## Loss functions

We have already seen in previous notebooks one loss function, which is the _categorical_crossentropy_. This loss function was implemented because we were required a categorical outcome. In this case, there must be the same number of neurons in the last layer as outcomes (this is why we used the _softmax_ activation function) However, if we are dealing with a binary classification, we usually use _binary_crossentropy_ as the loss function and _sigmoid_ as the activation function.

As you can see, depending on the type of problem we are dealing with, one loss function might be more suitable than other. For this reason, we will introduce the different loss functions in the notebooks where they are needed.

## Optimizers 

Keras offers the possibility of using plenty of optimizers. In addition to that, it also allows to change their hyperparameters. As an example, have a look at the code below: 

```python

from tensorflow.keras.optimizer import RMSprop #this is a type of optimizer

my_optimizer = tf.keras.optimizers.RMSprop(0.001) #changing learning rate to 0.001
model.compile(optimizer=my_optimizer,
              loss='binary_crossentropy',
              metrics=['accuracy'])
          
```

When calling the optimizer we added a parameter, the _learning rate_, which we will explain in following notebooks.

## References

_Python Deep Learning_ , by Jordi Torres (https://www.marcombo.com/python-deep-learning-9788426728289/)

Gradient Descent image extracted from _PyTorch Lecture 3: Gradient Descent_ (https://www.youtube.com/watch?v=b4Vyma9wPHo)