<a href="https://colab.research.google.com/github/luigiselmi/dl_tensorflow/blob/main/mathematical_foundations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mathematical foundations
A neural network (NN) is a computational model of a function based on a set of connected units, called neurons, that can learn a mapping between the inputs from a domain space and the outputs in the same or in different domain.   

A unit of a NN is represented as a nonlinear function of an affine transformation of a set of inputs

$$z = \sigma(b + \sum_i w_i*x_i)$$

where $x_i$ represent the inputs, b is a bias value like the intercept of a linear function, z the output, $w_i$ the parameters and $\sigma$ is the activation function that enable the NN to represent nonlinear functions. The units o a NN are organized in layers. Each layer contains a number of units. Each unit of a layer receives its inputs from the units in the previous layer and sends the result of its computation to the units in the next layer. A NN can be represented as a set of nested functions, e.g. for three layers h,g,f

$$\hat y(x, w) = f(g(h(x, w_h), w_g), w_f)$$

The NN can be trained, in a supervised approach, using a set of example pairs of inputs and related outputs. The goal is to learn the NN units' parameters so that the error, that is the difference between the network's output $\hat y(x)$ and the real value y, is as small as possible. The error to be minimized is called the loss function and can be written as

$$ℒ = \frac{1}{2}||\hat y(x, w) - y(x)||^2$$

Finding the NN parameters values that result in the minimum of the error function for any input x is an optimization problem that can be solved by computing the gradient of the function y(x, w), with respect to the parameters w, and using it to reduce the error

$$w_{i+1} = w_i - γ∇_wℒ$$

This optimization algorithm is called gradient descent. It is performed in backpropagation, starting from the output and going back, through each layer of the network, computing the derivatives at each unit. The derivatives are computed using the chain rule in the operations described in the NN's computational graph that represents the operations to be performed at each unit. The NN computational graph is a directed acyclic graph that is executed automatically by deep learning frameworks such as Tensorflow and PyTorch.    

## Types of data
A NN can only handle numerical values that can be provided as

* Tabular data
* Images
* Sequences (time series)

## Tensors and tensor operations
In a NN the input data and the network parameters are handled as tensors, that is multidimensional numerical arrays. The operations that can be performed on a tensor (unit operation) or pair of tensors are

* dot (tensor product)
* addition
* element-wise
* broadcasting
* reshaping
* scaling
* rotation
* translation
* derivative

## Loss functions and optimizers
Many other loss functions and optimization algorithms are available that might be more or less performant but all based on computing derivatives.

## Stochastic gradient descent, mini batch, and batch training loops
The update of the network parameters is performed within a training loop in a forward and backward pass. The amount of data to be used in each loops can be one single data point, extracted randomly from the training set and for this reason is called stochastic gradient descent. If the training set is large the stochastic gradient descent may take a long time to process all the data. Another approach is to use a small set of samples, called a mini-batch, and update the parameters using the mean of the values. One other approach is to use the full training set to compute the parametrs update, called batch, for each training loop.

## Epochs
A training loop that is perfomed on all the training data is called epoch. The training phase of a NN can consist usually of several epochs till the accuracy of the model achieves a maximum.

## NN design, training, and performance evaluation
The implementation of a NN model can be divided in:

1. Design (number of layers, number of units per layer, activation function)
2. Compilation (loss function, optimizer, performance metrics)
3. Fitting (number of epochs, batch size)
4. Evaluation (mean squared error, accuracy)

The evaluation metrics are used with validation and test sets, to test a model.
For instance, the mean squared error ([MSE](https://en.wikipedia.org/wiki/Mean_squared_error)) is a performance metrics used in regression tasks. Given N samples in the test set we have

$$MSE = \frac{1}{N} \sum_{i=0}^N (Y_i - \hat Y_i)^2$$

where $Y_i$ are the observations and $\hat Y_i$ the predicted values. [Accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision) is used in classification tasks and is defined as a ratio


$$Accuracy = \frac{number\ of\ correct\ classifications}{number\ of\ classifications} = \frac {TP + TN}{TP + TN + FP + FN}$$


where TP = true positives, TN = true negatives, FP = false positives, and FN = false negatives.
