<a href="https://colab.research.google.com/github/mnijhuis-dnb/Artificial_Intelligence_and_Machine_Learning_for_SupTech/blob/main/Tutorials/Tutorial%204%20Cross-validation%20applied%20to%20LASSO%20variable%20selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Artificial Intelligence and Machine Learning for SupTech  
Tutorial 4: Cross-validation applied to LASSO variable selection 

*	Looking closer at cross-validation (CV) and holdouts
*	K-fold, Leave-one-out, stratified CV
*	Splitting your data into training and testing samples
*	How to use CV to tune a LASSO model

<br/>

14 March 2023  

**Instructors**  
Prof. Iman van Lelyveld (iman.van.lelyveld@vu.nl)<br/>
Dr. Michiel Nijhuis (m.nijhuis@dnb.nl)  

In [None]:
!gdown 1YO09naIv_Lf4Nne-_zH9oDM0SXfk2VSh

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

For the next few tutorials we will be looking at historical commodity price data. We are going to predict the prices of certain commodities by using a LASSO model. First let us explore the data a bit

In [None]:
df = pd.read_csv('/content/commodity-prices.csv', sep=';')

In [None]:
df.head()

Try to get a bit familiar with the dataset

We will alter the dataset to have the data in the wide format (column based) and not in the long format (row based)

In [None]:
df = df.pivot_table(columns='Commodity', values='Price index', index='Date', aggfunc='mean')

In [None]:
df.head()

To keep it simple we will first drop the NaN columns

In [None]:
df = df.dropna(axis=1)

The data seems to have different scales, so let's scale all the data to the [0,1] range

In [None]:
df = df/df.max()

For the first model we are going to look at Aluminum prices

In [None]:
df['Aluminum'].plot()

We will try to build a model to predict the Aluminum price based on the prices in the previous month

In [None]:
y = df.loc['1980-04':,'Aluminum'] # select the Aluminum price 1 quarter ahead
x = df.loc[:'2015-11', :] # select the predictors from all of the data

In [None]:
from sklearn.linear_model import Lasso

In [None]:
lasso = Lasso(alpha=0.00000001)

In [None]:
lasso = lasso.fit(x, y)

Lets have a look at out prediction

In [None]:
plot_data = pd.DataFrame()
plot_data['true'] = y
plot_data['predicted'] = lasso.predict(x)
plot_data.plot()

That does not look like a very good prediction, we even perfectly predicted the drop because of the 2008 financial crisis

Could this just be because the training data also has data on the drop during the financial crisis? Let's make a simpel test train split and find out!

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
 x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

In [None]:
lasso = lasso.fit(x_train, y_train)

In [None]:
plot_data = pd.DataFrame()
plot_data['true'] = y_test
plot_data['predicted'] = lasso.predict(x_test)
plot_data.sort_index().plot()

That looks still like a very good prediction. Would this be the correct way to make your test-train split in this case?

</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>

It would be better to select a continous period for the training laying further back in the past then the testing data 

In [None]:
train_y = df.loc['1980-04':'2006-01','Aluminum'] # select the Aluminum from the second period onwards
train_x = df.loc[:'2005-10', :] # select the predictors from all of the data
test_y = df.loc['2006-01':,'Aluminum'] # select the Aluminum from the second period onwards
test_x = df.loc['2005-10':'2015-11', :] # select the predictors from all of the data

In [None]:
lasso = lasso.fit(train_x, train_y)

In [None]:
plot_data = pd.DataFrame()
plot_data['true'] = test_y
plot_data['predicted'] = lasso.predict(test_x)
plot_data.plot()

Let's have a look at the coefficients the model uses

In [None]:
non_zero_coefficents = np.abs(lasso.coef_,)>0
pd.Series(lasso.coef_[non_zero_coefficents], index=x.columns[non_zero_coefficents])

It looks like we take in all the variables that we have. Can you see why that would be wrong to do, for instance look at the 4 Crude Oil indicators. We can calculate the correlation between the crude oil columns to see if they bring something different to the table:

In [None]:
df.filter(like='Crude Oil').corr()

Can you find other columns which might not be good predictors?

We can adjust the regularisation in the Lasso model to reduce the amount of variables we are using in our model by adjusting the alpha values. Can you find the optimal alpha value? 

By using KFolds and a lasso model with cross validation we can find the optimal value of alpha based on the folds we use. 

In [None]:
from sklearn.model_selection import KFold
from sklearn.linear_model import LassoCV

The same thing can be done by using leave one out cross validation

In [None]:
from sklearn.model_selection import LeaveOneOut

We can also use a stratisfied KFolds

In [None]:
from sklearn.model_selection import StratifiedKFold

Can you show the differences between the three cross validation methods?