# Notebook Instructions

1. All the <u>code and data files</u> used in this course are available in the downloadable unit of the <u>last section of this course</u>.
2. You can run the notebook document sequentially (one cell at a time) by pressing **Shift + Enter**. 
3. While a cell is running, a [*] is shown on the left. After the cell is run, the output will appear on the next line.

This course is based on specific versions of Python packages. You can find the details of the packages in <a href='https://quantra.quantinsti.com/quantra-notebook' target="_blank" >this manual</a>.

##  Implementing Trading with Machine Learning Regression- Part- 2
In the previous notebook, we have covered how to import data and create data indicators. We defined dependent and independent variables for linear regression. Part-2 of the below flowchart represents the steps involved in implementing the trading strategy.
![image.png](https://d2a032ejo53cab.cloudfront.net/Glossary/bRWziD3a/p2.drawio.png)

In this notebook, you will learn the machine learning regression technique. We will implement a linear regression model on Gold ETF that will predict the Day's High and Day's Low given its Day's Open, High, Low and other defined indicators. The key steps are:

1. [Import the Data](#import)
2. [Preprocess the Data](#preprocess)
3. [Grid Search Cross-Validation](#cross)
4. [Split Train and Test Data](#split)
5. [Predict the High-Low Prices](#prediction)


In [1]:
# Machine learning libraries
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# Import the libraries
import numpy as np
import pandas as pd

# For Plotting
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-darkgrid')

# To ignore unwanted warnings
import warnings
warnings.filterwarnings("ignore")

<a id='import'></a>
## Import the Data

The input data is stored in `input_parameters.csv`, which we will import here as `gold_prices` to make prediction using Pipeline.

In [2]:
# Read the data
gold_prices = pd.read_csv(
    '../data_modules/input_parameters.csv', index_col='Date')

# Printing the data
gold_prices.head()

Unnamed: 0_level_0,Open,High,Low,Close,S_3,S_15,S_60,Corr,Std_U,Std_D,OD,OL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2013-04-15,136.0,136.75,130.509995,131.309998,,,,,0.75,5.490005,,
2013-04-16,134.899994,135.110001,131.759995,132.800003,,,,,0.210007,3.139999,-1.100006,3.589996
2013-04-17,133.809998,134.949997,132.320007,132.869995,,,,,1.139999,1.489991,-1.089996,1.009995
2013-04-18,134.119995,135.309998,133.619995,134.300003,132.326665,,,,1.190003,0.5,0.309997,1.25
2013-04-19,136.0,136.020004,134.600006,135.470001,133.323334,,,,0.020004,1.399994,1.880005,1.699997


# Checking for NaN Values
Here we will for NaN values, then we will drop all the rows having NaN values using `dropna` method.

In [3]:
gold_prices.isna().sum()

Open      0
High      0
Low       0
Close     0
S_3       3
S_15     15
S_60     60
Corr     13
Std_U     0
Std_D     0
OD        1
OL        1
dtype: int64

We have 60 NaN values in `S_60`, 15 NaN values in `S_15`,13 NaN values in `S_13` and 3 NaN values in `S_3` etc.Now we will simply drop all the NaN values using `dropna`. 

In [4]:
# Dropping all the NaN values
gold_prices.dropna(inplace=True)

# Checking for NaN values
gold_prices.isna().sum()

Open     0
High     0
Low      0
Close    0
S_3      0
S_15     0
S_60     0
Corr     0
Std_U    0
Std_D    0
OD       0
OL       0
dtype: int64

Now our dataframe `gold_prices` is free from NaN values.

In [5]:
# Independent variables
X = gold_prices[['Open', 'S_3', 'S_15', 'S_60', 'OD', 'OL', 'Corr']]

# Dependent variable for upward deviation
yU = gold_prices['Std_U']

# Dependent variable for downward deviation
yD = gold_prices['Std_D']

<a id='preprocess'></a>
## Data Preprocessing
Feeding the model with preprocessed data in a machine learning model is essential. Raw data contains many errors, and using such data will result in inconsistent and erroneous results. 


### Scaling
Suppose a feature has a variance of an order of magnitude larger than the other features. In that case, it might dominate the objective function and make the estimator unable to learn from other features correctly. To achieve this, we call the Standard Scaler function.
For more details about how scaling works, please refer to [Section 3, Unit 1](https://quantra.quantinsti.com/startCourseDetails?cid=43&section_no=3&unit_no=1&course_type=paid&unit_type=Video)

### Linear Regression
As discussed in [Section 4, Unit 1](https://quantra.quantinsti.com/startCourseDetails?cid=43&section_no=4&unit_no=1&course_type=paid&unit_type=Video), linear regression uses independent variables to predict a dependent variable using Linear equation. Here we use `X` as independent and `yU`,`yD` as the dependent variable.

## Pipeline

As explained in [Section 3, Unit 7](https://quantra.quantinsti.com/startCourseDetails?cid=43&section_no=3&unit_no=7&course_type=paid&unit_type=Document) we define a list containing tuples that specify various machine learning tasks given in the order of execution.

Specify in the steps a list of (name, transform) tuples. The 'name' is the variable name given to the task, and the 'transform' is the function used to perform the task. Then, sequentially apply a list of transforms specified in steps using the pipeline.

Syntax: 
```python
steps = [(name_1,transform_1), (name_2,transform_2),........(name_n,transform_n)]
Pipeline(steps)
```
We are using the following two steps in our pipeline,
1. Scaling the data. 
2. Fitting the data using the linear regression model.

In [6]:
# First we put scaling and then linear regression in the pipeline.
steps = [('scaler', StandardScaler()),
         ('linear', LinearRegression())]

# Defining pipeline
pipeline = Pipeline(steps)

## Hyperparameters

There are some parameters that the model itself cannot estimate.But we still need to account for them as they play a crucial role in increasing the performance of the system. Such parameters are called hyperparameters. We used intercept but you can add more hyperparameters to tune this algorithm.

In [7]:
# Here we are using intercept as hyperparameter
parameters = {'linear__fit_intercept': [0, 1]}

<a id='cross'></a>
## Grid Search Cross-Validation
As described in [Section 3, unit 14](https://quantra.quantinsti.com/startCourseDetails?cid=43&section_no=3&unit_no=14&course_type=paid&unit_type=Video), cross-validation indicates the model’s performance in a practical situation. It is used to tackle the overfitting of a model. We will use the `GridSearchCV` function, an inbuilt function for cross-validation.

We have set `cv=5`, which implies that the grid search will consider five rounds of cross-validation for averaging the performance results. We are using `GridSearchCV` instead of `RandomSearchCV` due to fewer features.`TimeSeriesSplit` splits training data into multiple segments.



In [8]:
# Using TimeSeriesSplit for cross validation
my_cv = TimeSeriesSplit(n_splits=5)

# Defining reg as variable for GridSearch function containing pipeline, hyperparameters
reg = GridSearchCV(pipeline, parameters, cv=my_cv)

<a id='split'></a>
## Split Train and Test Data

Now, we will split data into train and test data sets. 

1. First, 70% of data is used for training and the remaining data for testing.
2. Fit the training data to a grid search function.

In [9]:
spilitting_ratio = .70

# Splitting the data into two parts
# Using int to make sure integer number comes out.
split = int(spilitting_ratio*len(gold_prices))

# Defining train dataset
X_train = X[:split]
yU_train = yU[:split]
yD_train = yD[:split]

# Defining test data
X_test = X[split:]

<a id='prediction'></a>
## Prediction

We will fit the linear regression model on the training dataset and predict the upward deviation in the test dataset. 


In [10]:
# Fit the model
reg.fit(X_train, yU_train)

GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('linear', LinearRegression())]),
             param_grid={'linear__fit_intercept': [0, 1]})

After fitting the data, we will pass the `best_params_` to the `reg` model.`best_params_` is a boolean parameter that can only take 0 or 1 as its value, indicating False or True, respectively. It provides us with information regarding the best linear fit intercept for the model.

In [11]:
# Print best parameter
print(reg.best_params_)

{'linear__fit_intercept': 1}


We can see `best_params_` for our model gives `linear_fit_intercept` equal to one.

Here we predict upward deviation using `reg` model on test dataset. We define `yU_predict` for upward prediction.

In [12]:
# Predict the upward deviation
yU_predict = reg.predict(X_test)

Similarly, we will fit the data to predict downward deviation using `X_train` and `yD_train`. Then, we will print `best_params` for the fitted data. After fitting the data, we will predict the downward deviation and assign it to a variable named `yD_predict`.

In [13]:
# Fit the model
reg.fit(X_train, yD_train)

# Print best parameter
print(reg.best_params_)

# Predict the downward deviation
yD_predict = reg.predict(X_test)

{'linear__fit_intercept': 1}


Now we will create `yU_predict` and `yD_predict` columns in the `X_test`.Formulas for upward deviation and downward deviation are given by:

Upward deviation  = High - Open

Downward deviation = Open - Low

It is clear from the above two formulas that upward and downward deviation can not be negative. So, we replace negative values with zero.


In [14]:
# Create new column in X_test
X_test['yU_predict'] = yU_predict
X_test['yD_predict'] = yD_predict

# Assign zero to all the negative predicted values to take into account real life conditions
X_test.loc[X_test['yU_predict'] < 0, 'yU_predict'] = 0
X_test.loc[X_test['yD_predict'] < 0, 'yD_predict'] = 0

We will use the predicted upside deviation values to calculate the high price and the predicted downside deviation values to calculate the low price.

In [15]:
# Add open values in ['yU_predict'] to get the predicted high column
X_test['P_H'] = X_test['Open']+X_test['yU_predict']

# Subtract ['yD_predict'] values in open to get the predicted low column.
X_test['P_L'] = X_test['Open']-X_test['yD_predict']

# Print tail of gold_prices dataframe
X_test.tail()

Unnamed: 0_level_0,Open,S_3,S_15,S_60,OD,OL,Corr,yU_predict,yD_predict,P_H,P_L
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2019-05-08,121.540001,120.89,120.606668,122.611834,0.520004,0.330002,-0.221595,0.521922,0.538998,122.061923,121.001003
2019-05-09,120.959999,120.976667,120.633335,122.567001,-0.580002,0.049995,-0.290695,0.526836,0.536895,121.486835,120.423104
2019-05-10,121.410004,121.106667,120.694668,122.522667,0.450005,0.210007,-0.280418,0.534113,0.533184,121.944117,120.87682
2019-05-13,122.629997,121.18,120.765334,122.490334,1.219993,1.199997,0.078028,0.52394,0.544198,123.153937,122.085799
2019-05-14,122.599998,121.766665,120.918667,122.467167,-0.029999,-0.07,0.365089,0.493836,0.550817,123.093834,122.049181


Here we add the `Close`, `High`, and `Low` columns from `gold_prices` because we will need all these columns to calculate strategy returns in the following notebook.
We are using the split function to get only the test part of the `gold_prices`.

In [16]:
# Copy columns from gold_prices to X_test
X_test[['Close', 'High', 'Low']] = gold_prices[['Close', 'High', 'Low']][split:]
X_test.tail()

Unnamed: 0_level_0,Open,S_3,S_15,S_60,OD,OL,Corr,yU_predict,yD_predict,P_H,P_L,Close,High,Low
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2019-05-08,121.540001,120.89,120.606668,122.611834,0.520004,0.330002,-0.221595,0.521922,0.538998,122.061923,121.001003,120.910004,121.540001,120.769997
2019-05-09,120.959999,120.976667,120.633335,122.567001,-0.580002,0.049995,-0.290695,0.526836,0.536895,121.486835,120.423104,121.199997,121.620003,120.860001
2019-05-10,121.410004,121.106667,120.694668,122.522667,0.450005,0.210007,-0.280418,0.534113,0.533184,121.944117,120.87682,121.43,121.730003,121.300003
2019-05-13,122.629997,121.18,120.765334,122.490334,1.219993,1.199997,0.078028,0.52394,0.544198,123.153937,122.085799,122.669998,122.849998,122.330002
2019-05-14,122.599998,121.766665,120.918667,122.467167,-0.029999,-0.07,0.365089,0.493836,0.550817,123.093834,122.049181,122.459999,122.660004,122.120003


## Store the Data into csv
Now we will store our test data for strategy analysis by saving our dataframe into a `test_dataset_pred_high_low.csv`.

In [17]:
# Storing the data for the next notebook
X_test[['Close', 'High','P_H', 'Low', 'P_L']].to_csv('test_dataset_pred_high_low.csv', index = True)

 ### Tweak the Code
 
For further practice, you can tweak the code in the following ways:
1. Use different data sets: backtest and try out the model on different data sets.
2. Features: create your features using different indicators to improve the prediction accuracy.
3. Try Random Search for hyperparameters selection and compare the results.

## Conclusion
In this notebook, we have predicted the High and low values represented by `P_H` and `P_L`, respectively.
The next notebook will generate trading signals using the predicted highs and lows. We will also calculate the strategy returns and generate the performance statistics. <br><br>