# Notebook Instructions

1. If you are new to Jupyter notebooks, please go through this introductory manual <a href='https://quantra.quantinsti.com/quantra-notebook' target="_blank">here</a>.
1. Any changes made in this notebook would be lost after you close the browser window. **You can download the notebook to save your work on your PC.**
1. Before running this notebook on your local PC:<br>
i.  You need to set up a Python environment and the relevant packages on your local PC. To do so, go through the section on "**Run Codes Locally on Your Machine**" in the course.<br>
ii. You need to **download the zip file available in the last unit** of this course. The zip file contains the data files and/or python modules that might be required to run this notebook.

## Linear Regression and Predicting GLD Movement
In the previous notebooks, we have covered how to create input and output datasets, cross-validation, train-test split and pipelines. 

In this notebook, we will learn how to use the linear regression function on our dataset. We will also predict the upward and downward movement of GLD and evaluate the error of the model. The key steps are:

1. [Import the Data](#import)
2. [Data Preprocessing](#preprocess)
3. [Linear Regression](#lreg)
4. [Predict GLD Movement](#pred)

In [1]:
# Data manipulation
import pandas as pd
import numpy as np

# Machine learning libraries
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn import metrics

# To ignore unwanted warnings
import warnings
warnings.filterwarnings("ignore")

<a id='import'></a>
## Import the Data

The dependent (`yU`, `yD`) and independent (`X`) datasets for the `train` and `test` dataset is read from the CSV files. This data was prepared in the previous section and can be downloaded from the downloadable zip folder in the last section of this course.

In [2]:
# Define the path for the data files
path = "../data_modules/"

# Read the train and test datasets
X_train = pd.read_csv(path + "gold_prices_X_train.csv",
                      index_col=0, parse_dates=True)
X_test = pd.read_csv(path + "gold_prices_X_test.csv",
                     index_col=0, parse_dates=True)
yU_train = pd.read_csv(path + "gold_prices_yU_train.csv",
                       index_col=0, parse_dates=True)
yD_train = pd.read_csv(path + "gold_prices_yD_train.csv",
                       index_col=0, parse_dates=True)
yU_test = pd.read_csv(path + "gold_prices_yU_test.csv",
                      index_col=0, parse_dates=True)
yD_test = pd.read_csv(path + "gold_prices_yD_test.csv",
                      index_col=0, parse_dates=True)

<a id='preprocess'></a>
## Data Preprocessing

We will now use `Pipeline` for scaling and linear regression. We will use intercept as a hyperparameter to tune our algorithm. Next, we will use the `GridSearchCV` method for cross-validation and build our model. 

In [3]:
# First we put scaling and then linear regression in the pipeline.
steps = [('scaler', StandardScaler()),
         ('linear', LinearRegression())]

# Defining pipeline
pipeline = Pipeline(steps)

# Here we are using intercept as hyperparameter
parameters = {'linear__fit_intercept': [0, 1]}

# Using TimeSeriesSplit for cross validation
my_cv = TimeSeriesSplit(n_splits=5)

# Defining reg as variable for GridSearch funtcion containing pipeline, hyperparameters
reg = GridSearchCV(pipeline, parameters,
                   scoring='neg_mean_squared_error', cv=my_cv)

# Fit the model
reg.fit(X_train, yU_train)

GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('linear', LinearRegression())]),
             param_grid={'linear__fit_intercept': [0, 1]},
             scoring='neg_mean_squared_error')

<a id='lreg'></a>
## Linear Regression

We will fit the linear regression model on the training dataset. 

<a id='best'></a>
### Best Fit Variable 

As done earlier, we will create a best fit variable by calling the `best_params` function.

In [4]:
# Create Best fit variable
best_fit = reg.best_params_

# Print best parameter
print(reg.best_params_)

{'linear__fit_intercept': 1}


We can see that `best_params_` for our model gives `linear_fit_intercept` equal to one. This means our model will contain an intercept for better performance. 

We will use the `LinearRegression` function of the scikit-learn library to create a linear regression model. We will pass the arguments, `fit_intercept` = `best_fit`, where `best_fit` is the value that decides the intercept parameter for the `LinearRegression` function.

In [5]:
# Linear Regression
reg = LinearRegression(fit_intercept=best_fit)

<a id='pred'></a>
## Predict GLD Movement

We will predict the upward deviation using `reg` model on test dataset. We define `yU_predict` for upward prediction.

In [6]:
# Fit the model
reg.fit(X_train, yU_train)

# Predict the upward deviation
yU_predict = reg.predict(X_test)

Similarly, we will fit the data to predict downward deviation using `X_train` and `yD_train`. Then, we will predict the downward deviation and assign it to a variable named `yD_predict`.

In [7]:
# Fit the model
reg.fit(X_train, yD_train)

# Predict the downward deviation
yD_predict = reg.predict(X_test)

We will add the predicted values to `X_test` dataset for further analysis. 

In [8]:
# Create new columns in X_test
X_test['yU_predict'] = yU_predict
X_test['yD_predict'] = yD_predict

# Print tail of X_test
X_test.tail()

Unnamed: 0_level_0,Open,S_3,S_15,S_60,OD,OL,Corr,yU_predict,yD_predict
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2019-05-08,121.540001,120.89,120.606668,122.611834,0.520004,-0.330002,-0.221595,0.501053,0.533787
2019-05-09,120.959999,120.976667,120.633335,122.567001,-0.580002,-0.049995,-0.290695,0.508194,0.526315
2019-05-10,121.410004,121.106667,120.694668,122.522667,0.450005,-0.210007,-0.280418,0.512847,0.528816
2019-05-13,122.629997,121.18,120.765334,122.490334,1.219993,-1.199997,0.078028,0.5046,0.530648
2019-05-14,122.599998,121.766665,120.918667,122.467167,-0.029999,0.07,0.365089,0.473075,0.538222


Finally, let's see how well our model performed by calculating the mean squared error. This metric will simply give us the average of the square of the differences between the actual values and predicted values. Since this is an error metric, a lower value is desirable.

In [9]:
# Calculate MSE for upward deviations
yU_MSE = metrics.mean_squared_error(yU_test, yU_predict)

# Calculate MSE for downward deviations
yD_MSE = metrics.mean_squared_error(yD_test, yD_predict)

# Print the results
print(yU_MSE, yD_MSE)

0.11543997220418288 0.10119894418942185


We get the MSE as 11.5% for the upward deviations and 10.1% for downward deviations which suggests that the model has performed well. 

## Conclusion
In this notebook, we have predicted the Upward and Downward movement of GLD represented by `yU_predict` and `yD_predict`, respectively, using the `LinearRegression` function. We also learned how to calculate the mean squared error and evaluate our model's performance. <br><br>