In [None]:
import sys
from google.colab import drive

drive.mount('/content/drive', force_remount=True)
sys.path.append('/content/drive/MyDrive/finance_course/2022/lesson9')

Mounted at /content/drive


# Non Linear Models

* When the target variable $y$ has a more complicated relationship with the independent variables $X$ and linear models are not working anymore we need to move to **non-linear models**.

* If the model is not know it is possible to use machine learning techniques in order to infeer its characteristics directly from the dataset.

## Machine Learning

### Neural Network Definition
* Artificial Neural Networks (ANN or simply NN) are information processing models that are developed by inspiring from the working principles of human brain. 
  * **Their most essential property is the ability of learning from sample sets.** 

* The basic unit of ANN architecture are neurons. 

<center>

![Model of an artificial neuron.](https://drive.google.com/uc?id=1sT_uKTvHpG4KJqBICnhYlimAz7UIagsk)

</center>
  
$$ \textrm{Inputs} = \sum_{i=1}^{N} x_i w_i +w_0 = \Sigma \rightarrow = f(\Sigma) \rightarrow \textrm{Output}$$  

* The *activation function* is used to add non-linearity to the respons of the neuron.
  * There are many different types of activation function
    * *step function* which returns just 0 or 1 according to the input value 
    * *sigmoid* which can be thought of as the continuous version of the step function)
    * rectified Linear Unit (ReLU) 
    * hyperbolic tangent (tanh).

<center>

![](https://drive.google.com/uc?id=1yPgenOKBcnH3B_F1jx1Q6T1Bc249n9ya)

</center>

### Supervised Training of a Neuron

* In the process of training a neuron we would like to teach it to give the "correct" output providing a certain input (hence the name *supervised*).

1. Inputs from the *training* set are presented to the neuron one after the other together with the target output;
2. the neuron weights are modified in order to make the neuron output as close as possible to the target;
3. when an entire pass through all of the input training vectors is completed (an *epoch*) the neuron has learnt. 
  * Actually we can present many times the same set to the neuron to make it learn better (but not too many times, see **overfitting**).

* Using just a neuron is a too simple architecture. The next step is to put together more neurons in *layers*.

### Multilayered Neural Networks

<center>

![A multilayered neural network.](https://drive.google.com/uc?id=1_D3eO0Bb5XwF9SIFbEvsMNNbz_hI3EuX)

</center>

* In a multilayered NN each neuron from the *input layer* is fed up to each neuron in the next hidden layer, and from there to each neuron on the output layer. 
  * There can be any number of neurons per layer.

### Training a Multilayered Neural Network

* The training of a multilayered NN follows similar these steps:
  1. present a training sample to the neural network and compute the network output obtained by calculating activations of each neuron of each layer;
  2. calculate the **loss** as the difference between the NN predicted and the target output;
  3. "re-adjust" the weights of the network such that the difference with the target output decreases;
  5. continue the process for each input several times (epochs).

<center>

![](https://drive.google.com/uc?id=1M38qS_oDvO45sOA894UqMTyw50K2o5zS)

</center>

* The NN loss is computed by the *loss function*, possible choices are
  * Mean Absolute Error (MAE): the average of the absolute value of the differences between the predictions and true values. It represents how far off we are on average from the correct value;
  * Root Mean Squared Error (MSE): the square root of the average of the squared differences between the predictions and true values. It penalizes larger errors more heavily and is commonly used in regression tasks. 

* **Back propagation** is the algorithm used to reduce the loss function:
  * the current loss is "propagated" backwards to previous layers, where it is used to modify the weights.

$$\min_{w} L(w_{11}, w_{12},\ldots) \implies \frac{\partial L}{\partial w_{ij}} = 0$$

<center>

![](https://drive.google.com/uc?id=1NQFCPJomQQD4l1KcK-7iBTLxDkE8DpiG)

</center>

* Weights are modified using a function called *Optimization Function* (we will use *Adam* as optimizator in the following but there are more).



## Regression and Classification

### Classification 
* Is the process of finding a function to split the dataset into classes based on different parameters. 
  * The goal is to find the mapping function between the input and the **discrete** output($y$).

* Email spam detection: the model is trained on the basis of millions of emails on different parameters, and whenever it receives a new email, it identifies whether the email is spam or not.
* Classification algorithms can also be in speech recognition, car plates identification, etc.

### Regression
* Is the process of finding the correlations between dependent and independent variables. 
  * The goal is to find the mapping function to map the input variable to the **continuous** output variable.

* Housing price prediction: the input data can be different home features and the output prediction will be pricing estimate. 
  * In general whenever we are dealing with function approximation this kind of algorithms can be applied. 	


### Technical Note

* Neural network training and testing is performed using $\tt{keras}$ (which is based on a Google opensource library called $\tt{tensorflow}$) and $\tt{scikit-learn}$ which provide many useful utilitites for the training.

## Function approximation 

* Let's design an ANN which is capable of learning the functional form underlying a set of data ([function_approx.csv](https://github.com/matteosan1/finance_course/raw/develop/libro/input_files/function_approx.csv)).

* **SPOILER** the relation between $X$ and $y$ is $f(x) = x^3 +2$.

In [None]:
# load the dataset
import numpy as np
import pandas as pd


In [None]:
# check min and max for input and output


* **Usually when dealing with multi-input NN it is good practice to transform each variable to have all the inputs with uniform scales (usually [0,1])**. 
* This is done to provide the NN with *normalized* data, infact it can be fooled by very large or very small numbers giving unstable results.

In [None]:
# normalize data, split train and test
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense


### Neural Network Design

* There is no rule to guide developers into the design of a neural network in terms of number of layers and neuron per layer. 
* The most common strategy is *trial and error* where you pick up the solution giving the best accuracy. 
  * In general a larger number of nodes is better to catch highly structured data with a lot of feature although it may require larger training sample to work correctly.
  * **As a rule of thumb a NN with just one hidden layer with a number of neurons averaging the inputs and outputs is sufficient in most cases.** 


* Let's use two layers with 15 and 5 neurons and a *tanh* activation function. 
* The $\tt{inputs}$ parameter has to be set to 1 since we have just one single input, the $x$ value. 

<img src="https://drive.google.com/uc?id=1VBb5EA8dX9ZeAGD_EkUvbv0pkBXv_TID">

In [None]:
# design the neural network model
from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(15, input_dim=1, activation='sigmoid'))
model.add(Dense(5, activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam')

model.fit(train_X, train_y, epochs=100, batch_size=10, verbose=1)

In [None]:
# load model trained with 500 epochs
from tensorflow.keras.models import load_model

model500 = load_model("/content/drive/MyDrive/finance_course/2022/lesson9/func_500epochs")


* After the training is completed we can evaluate how good it is. 
  * **Usually performance are measured using the loss function value at the end of the training.**
  * a *perfect* prediction would lead to a loss = 0 so the lower this number the better the agreement. 

* The picture below are shown the actual function we want to approximate and different predictions of our NN obtained with four epoch numbers (5, 100, 800, 5000).

<img src="https://drive.google.com/uc?id=1oWfQq6q7PVzrZ979FS_weAJUj3cL7AOu">

* The agreement improves with higher number of epochs which means that the NN has more opportunities to adapt the weights and reduce the loss to the target values. 

### Overfitting (Overtraining)

* Increasing too much the number of epochs may lead to overfitting: 
  * the NN learns too well the training sample but its performance degrade substantially in an independent sample. 
* It is required to split the available sample in two parts: training and testing (e.g. 80% and 20%) 
  * **training** to perform the setting of the weights;
  * **testing** to cross-check the performance in an independent sample. 

* To check if this is the case we can *evaluate* our NN with both the training ad the testing samples. 
  * If the losses are comparable the NN is ok otherwise if the training losses are much smaller than the testing we had overfitting.
  * In this second case if we need more accuracy we need to either increase the training sample or to change the NN design.







In [None]:
# evaluate on train and test


## A Feature Not a Bug

* If you ran the previous example you would most likely obtain different results.
  * **This is not a bug but a feature of NN**, let's see which are the possible sources for such discrepancies.

* **Stochastic learning algorithm**: NN algorithm is stochastic i.e. its behaviour incorporates elements of randomness (beware that stochastic
does not mean learning a random model). 
* Their randomness comes from: 
  * the *random initial weights*, which allow the model to try learning from a different starting point in the search space each time; 
  * the *random shuffle of examples during training*, which ensures that each gradient estimate and weight update is slightly different. 

* The impact is that each time it is run on the same data, it learns a slightly different model and when evaluated, may have a slightly different performance. 

* You can control randomness by setting the seed used by the pseudorandom number generator: although this is not a good approach in practice:
  * **there is no best seed for any algorithm**; 
  * you need to summarize the performance by fitting multiple times a model on your dataset and averaging its predictions.
