 # Notebook Instructions

1. If you are new to Jupyter notebooks, please go through this introductory manual <a href='https://quantra.quantinsti.com/quantra-notebook' target="_blank">here</a>.
1. Any changes made in this notebook would be lost after you close the browser window. **You can download the notebook to save your work on your PC.**
1. Before running this notebook on your local PC:<br>
i.  You need to set up a Python environment and the relevant packages on your local PC. To do so, go through the section on "**Run Codes Locally on Your Machine**" in the course.<br>
ii. You need to **download the zip file available in the last unit** of this course. The zip file contains the data files and/or python modules that might be required to run this notebook.

## Cross Validation, Test and Train
In the previous notebooks, you have learned how to import data, preprocess data and create custom input and output parameters for your model. You were also introduced to the concept of pipeline which we will now use on our GLD dataset.

In this notebook, our primary focus will be to find the best hyperparameters for our linear regression model.

<b>But what are hyperparameters? And how can we find the best set of hyperparameters for our model?</b>

Hyperparameters are parameters that the model cannot estimate itself. These need to be set manually by the user to help in the estimation of the model. We will learn more about hyperparameters and how to evaluate the best performing hyperparameters using cross-validation in this notebook. 

We will also learn how to split our data into train and test datasets. The key steps are:

1. [Import the Data](#import)
2. [Create Input and Output Variables](#xy)
3. [Data Preprocessing and Hyperparameters](#preprocess)
4. [Split Train and Test Data](#split)
5. [Grid Search Cross-Validation](#cross)
6. [Grid Search Performance](#perf)

In [1]:
# For data manipulation
import pandas as pd
import numpy as np

# Machine learning libraries
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.model_selection import cross_val_score

# To ignore unwanted warnings
import warnings
warnings.filterwarnings("ignore")

<a id='import'></a>
## Import the Data

The input data is stored in `data_preprocess.csv`, which we will load here and store within the variable `gold_prices`.

In [2]:
# Read the data
gold_prices = pd.read_csv(
    '../data_modules/data_preprocess.csv', index_col='Date')

# Print the data
gold_prices.head()

Unnamed: 0_level_0,Open,High,Low,Close,S_3,S_15,S_60,Corr,Std_U,Std_D,OD,OL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2013-07-10,121.150002,122.349998,120.589996,120.949997,119.406667,122.149334,132.9,-0.042512,1.199996,0.560006,0.310006,0.529999
2013-07-11,124.260002,124.360001,123.470001,124.239998,120.360001,121.404,132.727334,-0.680947,0.099999,0.790001,3.11,3.310005
2013-07-12,123.519997,124.300003,123.32,124.129997,121.936666,120.980667,132.584667,-0.449336,0.780006,0.199997,-0.740005,-0.720001
2013-07-15,124.080002,124.389999,123.839996,124.18,123.106664,121.016,132.439,0.401985,0.309997,0.240006,0.560005,-0.049995
2013-07-16,124.760002,125.209999,124.330002,124.889999,124.183332,120.958,132.270333,0.561437,0.449997,0.43,0.68,0.580002


<a id='xy'></a>
## Create Input and Output Variables
 
As done previously, we will create an input dataset `X` which is the independent variable and output datasets `yU` and `yD` which are the dependent variables. 

In [3]:
# Independent variables
X = gold_prices[['Open', 'S_3', 'S_15', 'S_60', 'OD', 'OL', 'Corr']]

# Dependent variable for upward deviation
yU = gold_prices['Std_U']

# Dependent variable for downward deviation
yD = gold_prices['Std_D']

<a id='preprocess'></a>
## Data Preprocessing and Hyperparameters
Feeding the model with preprocessed data in a machine learning model is essential. Raw data contains many errors, and using such data will result in inconsistent and erroneous results. 
### Pipeline
We have already discussed how to create a pipeline and use it on a sample dataset. We will now implement it on our GLD dataset. 
We are using the following two steps in our pipeline: 
1. Scale the data
2. Fit the data using the linear regression model

In [4]:
# First we put scaling and then linear regression in the pipeline.
steps = [('scaler', StandardScaler()),
         ('linear', LinearRegression())]

# Define pipeline
pipeline = Pipeline(steps)

### Hyperparameters

As mentioned earlier, there are some parameters that the model itself cannot estimate. These values can not be learned from the training data but are fixed when we begin to define our model. They play a crucial role in increasing the performance of the system. Such parameters are called hyperparameters. 

The number of hyperparameters and the value it can take depends on the kind of ML model we select. Some hyperparameters can take a range of values while others have discrete and specific outcomes. In the case of linear regression, only the intercept can be used as a hyperparameter. 

We will use the `fit_intercept` function which can tell us whether to calculate the intercept for this model or not. This is a boolean function and hence can only return 0 or 1. If our result is 0, it means that the model performs better without the intercept. If the result is 1, it means an intercept needs to be modelled for better results. 

In [5]:
# Here we are using intercept as hyperparameter
parameters = {'linear__fit_intercept': [0, 1]}

<a id='split'></a>
## Split Train and Test Data
In the train-test split you divide the data into two parts.

If we consider the train-test split as 80%-20%, it means 80% of the original data is the training data and the remaining 20% is the testing data. 

The 80%-20% proportion is a popular proportion to split the data. But there is no rule of thumb that we always have to use the 80%-20% ratio. You can also try other popular proportion choices like 90%-10%, 75%-25%. 

In [6]:
# We are using 80%-20% split, therefore splitting ratio will be 0.80
splitting_ratio = .80

# Split the data into two parts
# Use int to ensure that result is of integer data type.
split = int(splitting_ratio*len(gold_prices))

# Define train dataset
X_train = X[:split]
yU_train = yU[:split]
yD_train = yD[:split]

# Define test data
X_test = X[split:]
yU_test = yU[split:]
yD_test = yD[split:]

<a id='cross'></a>
## Grid Search Cross-Validation

Now that we know what are hyperparameters, our goal is to find hyperparameter values which give the best performance for our model. This is known as hyperparameter tuning.  

But the question arises, <b>how to find these best sets of hyperparameters?</b>

With respect to linear regression, we will try to understand whether our model is more accurate with an intercept or without an intercept. For this, we will use the Grid Search technique. Grid Search will take the `fit_intercept` value as 0, then 1, and tell us which value is better by calculating the performance for each of them.    

However, we cannot perform Grid Search on the whole dataset as it can lead to overfitting. An overfitted model will fit training data very well but it will perform poorly on an unseen dataset. To overcome this problem we will use cross-validation. In this, the dataset is divided into training set, validation set and test set. Essentially the model learns on the training set, evaluates on the validation set and then finally tests on the test set. 

We will use the `GridSearchCV` function which is an inbuilt function for cross-validation. We have set `cv=5`, which implies that the grid search will consider five rounds of cross-validation for averaging the performance results. We are using `GridSearchCV` instead of `RandomSearchCV` due to fewer features.`TimeSeriesSplit` splits training data into multiple segments. We will also use `neg_mean_squared_error` as a scoring metric for the `GridSearchCV` function. 

In [7]:
# Use TimeSeriesSplit for cross validation
my_cv = TimeSeriesSplit(n_splits=5)

# Define reg as variable for GridSearch function containing pipeline, hyperparameters
reg = GridSearchCV(pipeline, parameters,
                   scoring='neg_mean_squared_error', cv=my_cv)

We call the `fit` function of the model and pass the `X_train` and `yU_train` datasets. 

In [8]:
# Fit the model
reg.fit(X_train, yU_train)

GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('linear', LinearRegression())]),
             param_grid={'linear__fit_intercept': [0, 1]},
             scoring='neg_mean_squared_error')

<a id='perf'></a>
## Grid Search Performance 

<a id='best'></a>
### Best Fit Variable 

Next, we will create the best fit variable by calling the `best_params` function  The `best_params` function will use the best parameter set found by the Grid Search method.

In [9]:
# Initialise a best fit variable
best_fit = reg.best_params_

# Print best parameter
print(best_fit)

{'linear__fit_intercept': 1}


We can see that `best_params_` for our model gives `linear_fit_intercept` equal to 1. This means our model will contain an intercept for better performance. 

Let us also see how the 5 rounds of cross-validation performed on the training data by calculating the mean squared error for each round. We will also see the final score for the `GridSearchCV` method using the `neg_mean_squared_error` scoring method. 

In [10]:
# Find the scores for 5 rounds of cross-validation
CV_scores = cross_val_score(reg, X_train, yU_train, cv=5,
                            scoring='neg_mean_squared_error')

# Print the scores for 5 rounds of cross-validation
print(CV_scores)

# Create a score variable
score = reg.best_score_

# Print the score
print(score)

[-0.55125259 -0.35420839 -0.27102945 -0.14834363 -0.14452836]
-0.24203918499284677


We can see the final scores of our Grid Search technique above. But all the values are negative. Moreover, they have a lower magnitude as well. This is because we are using `neg_mean_squared_error` which means negated MSE. Since, we want low MSE in our models, functions ending with `_error` or `_loss` in the `sklearn` library return a value that is to be minimised. Hence, the lower the score, the better the result.
So, the mean cross-validated score comes out to be around 24% for the training dataset. 

## Conclusion
In this notebook, you learned about hyperparameters and how to find optimal parameters for our model using Grid Search and cross-validation. You also learned how to split our dataset into train and test datasets <br><br>