# Notebook Instructions

1. All the <u>code and data files</u> used in this course are available in the downloadable unit of the <u>last section of this course</u>.
2. You can run the notebook document sequentially (one cell at a time) by pressing **Shift + Enter**. 
3. While a cell is running, a [*] is shown on the left. After the cell is run, the output will appear on the next line.

This course is based on specific versions of Python packages. You can find the details of the packages in <a href='https://quantra.quantinsti.com/quantra-notebook' target="_blank" >this manual</a>.

# Target and Features

A machine learning algorithm requires a set of features to predict the target variable. 

In this notebook, you will perform the following steps:

1. [Problem Statement](#problem-statement)
2. [Read the Data](#read)
3. [Target Variable](#target)
4. [Features](#features)
5. [Create X and y](#x-and-y)<br>
    a. [Stationarity Check](#stationary)<br>
    b. [Correlation Check](#correlation)<br>

<a id='problem-statement'></a> 
## Problem Statement

Let's say you have decided that you want to trade in J.P. Morgan. Next, you should decide whether to go long in J.P. Morgan at a given point in time. Thus, the problem statement will be:<br>

<b><i>Whether to buy J.P. Morgan's stock at a given time or not?</i></b>

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import talib as ta
from statsmodels.tsa.stattools import adfuller

<a id='read'></a> 
## Read the Data

The 15-minute OHLCV data of J.P. Morgan stock price is stored in a CSV file `JPM.csv` in the `data_modules` directory. The data ranges from January 2017 to  December 2019. You can download this data from the last unit of this course '**Python Codes and Data**'.

To read a CSV file, you can use the `read_csv` method of pandas. The syntax is shown below.

Syntax: 
```python
import pandas as pd
pd.read_csv(filename)
```
**filename:** Complete path of the file and file name in string format.

In [2]:
# The data is stored in the directory 'data_modules'
path = "../data_modules/"

# Read the data
data = pd.read_csv(path + 'JPM_2017_2019.csv', index_col=0)
data.index = pd.to_datetime(data.index)

<a id='target'></a> 
## Target Variable

Target variable is what the machine learning model tries to predict in order to solve the problem statement. It is referred to as **y**. 

Going back to our problem statement, **Whether to buy J.P. Morgan's stock or not?**, we will create a column, `signal`. The `signal` column will have two labels, `1` and `0`. Whenever the label is `1`, the model indicates a **buy** signal. And whenever the label is `0`, the model indicates **do not buy**. We will assign `1` to the `signal` column whenever the future returns will be greater than 0.

The `future_returns` can be calculated using the `pct_change` method of `pandas`. The `pct_change` method will calculate the percentage change for the current time period.

Since we want the future returns, we will shift the percentage change for the current period to the previous time period. This can be done using the `shift` method of `pandas`.

Syntax:
```python
DataFrame[column].pct_change().shift(period)
```

Parameters:</br>
1. **column:** The column for which the percentage change is to be calculated.
2. **period:** The period to shift the series. To shift to the current value to the previous time period, the period will be `-1`.

In [3]:
# Create a column 'future_returns' with the calculation of percentage change
data['future_returns'] = data['close'].pct_change().shift(-1)

# Create the signal column
data['signal'] = np.where(data['future_returns'] > 0, 1, 0)

data.head()

Unnamed: 0,open,high,low,close,volume,future_returns,signal
2017-01-03 09:45:00+00:00,87.34,87.75,87.02,87.39,2184761.0,-0.002289,0
2017-01-03 10:00:00+00:00,87.39,87.44,86.95,87.19,1148228.0,0.001262,1
2017-01-03 10:15:00+00:00,87.21,87.41,87.14,87.3,860609.0,0.000916,1
2017-01-03 10:30:00+00:00,87.31,87.38,87.26,87.38,481605.0,-0.002861,0
2017-01-03 10:45:00+00:00,87.37,87.46,87.13,87.13,675950.0,-0.006312,0


<a id='features'></a> 
## Features

In order to predict the `signal`, we will create the input variables for the ML model. These input variables are called features. The features are referred to as **X**. You can create the features in such a way that each feature in your dataset has some predictive power.

We will start by creating the 15-minute, 30-minute and 75-minute prior percentage change columns.

In [4]:
# Create a column 'pct_change' with the 15-minute prior percentage change
data['pct_change'] = data['close'].pct_change()

# Create a column 'pct_change2' with the half an hour prior percentage change
data['pct_change2'] = data['close'].pct_change(2)

# Create a column 'pct_change5' with the 75-minute prior percentage change
data['pct_change5'] = data['close'].pct_change(5)

Next, we will calculate the technical indicators, RSI and ADX. These can be done by using the `RSI` and `ADX` method of the `talib` library.

Syntax:
```python
import talib as ta
ta.RSI(data, timeperiod)
ta.ADX(data_high, data_low, data_open, timeperiod)
```

The parameters above are self-explanatory.

Since there are 6.5 trading hours in a day, and ours is a 15-minutes data, the time period will be 6.5*4.

In [5]:
# Create a column by the name RSI, and assign the RSI values to it
data['rsi'] = ta.RSI(data['close'].values, timeperiod=int(6.5*4))

# Create a column by the name ADX, and assign the ADX values to it
data['adx'] = ta.ADX(data['high'].values, data['low'].values,
                     data['open'].values, timeperiod=int(6.5*4))

We will now create the simple moving average and rolling correlation of the close price. This can be done by using the `mean` and the `corr` method of the pandas library.

Syntax:
```python
DataFrame[column].rolling(window).mean()
DataFrame[column].rolling(window).corr()
```

**column:** The column to perform the operation on.<br>
**window:** The span of the rolling window.

We will calculate the daily moving average and correlation. 

In [6]:
# Create a column by the name sma, and assign the SMA values to it
data['sma'] = data['close'].rolling(window=int(6.5*4)).mean()

# Create a column by the name corr, and assign the correlation values to it
data['corr'] = data['close'].rolling(window=int(6.5*4)).corr(data['sma'])

Let us now calculate the volatility of the stock. This can be done by calculating the rolling standard deviation of the `pct_change` column. 

In [7]:
# 1-day and 2-day volatility
data['volatility'] = data.rolling(
    int(6.5*4), min_periods=int(6.5*4))['pct_change'].std()*100

data['volatility2'] = data.rolling(
    int(6.5*8), min_periods=int(6.5*8))['pct_change'].std()*100

data.tail()

Unnamed: 0,open,high,low,close,volume,future_returns,signal,pct_change,pct_change2,pct_change5,rsi,adx,sma,corr,volatility,volatility2
2019-12-31 15:00:00+00:00,138.72,138.786,138.72,138.755,137277.0,0.000144,1,0.000108,-0.0009,0.000108,48.917663,18.0599,138.707846,-0.233961,0.076398,0.110535
2019-12-31 15:15:00+00:00,138.745,138.787,138.655,138.775,139979.0,0.000973,1,0.000144,0.000252,-0.000547,49.359122,18.104801,138.705885,-0.348533,0.073904,0.11043
2019-12-31 15:30:00+00:00,138.76,138.93,138.704,138.91,144914.0,0.002088,1,0.000973,0.001117,0.00072,52.255662,17.906918,138.707615,-0.53808,0.076218,0.111184
2019-12-31 15:45:00+00:00,138.91,139.22,138.883,139.2,336305.0,0.001365,1,0.002088,0.003063,0.002304,57.665344,17.283036,138.729346,-0.511921,0.078749,0.11485
2019-12-31 16:00:00+00:00,139.195,139.48,139.14,139.39,949197.0,,0,0.001365,0.003455,0.004685,60.699494,16.886426,138.757038,-0.466739,0.082235,0.11596


<a id='x-and-y'></a> 
## Create X and y

Before creating the features (`X`) and target(`y`), we will drop the rows with any missing values.

In [8]:
# Drop the missing values
data.dropna(inplace=True)

Since we have created features using the original columns of the dataset, we will not consider the original columns (`high`, `low`, `open`, `volume`, `close`) in features.

Store the `signal` column in `y` and features in `X`. The columns in the variable `X` will be the input for the ML model and the `signal` column in `y` will be the output that the ML model will predict.

In [9]:
# Target
y = data[['signal']].copy()

# Features
X = data[['pct_change', 'pct_change2', 'pct_change5', 'rsi',
       'adx', 'sma', 'corr', 'volatility', 'volatility2']].copy()

<a id='stationary'></a> 
### Stationarity Check

As you have seen that most ML algorithm requires stationary features, we will drop the non-stationary columns from `X`.

You can use the `adfuller` method from the `statsmodels` library to perform this test in Python, and compare the p-value.
- If the p-value is less than or equal to 0.05, you reject H0.
- If the p-value is greater than 0.05, you fail to reject H0.


To use the `adfuller` method, you need to import it from the `statsmodels` library as shown below:

```python
from statsmodels.tsa.stattools import adfuller
```

The `adfuller` method can be used as shown below:

```python
result = adfuller(X)
```

The p-value can be accessed as `result[1]`.

In [10]:
def stationary(series):
    """Function to check if the series is stationary or not.
    """

    result = adfuller(series)
    if(result[1] < 0.05):
        return 'stationary'
    else:
        return 'not stationary'


# Check for stationarity
for col in X.columns:
    if stationary(data[col]) == 'not stationary':
        print('%s is not stationary. Dropping it.' % col)
        X.drop(columns=[col], axis=1, inplace=True)
    else:
        print('%s is stationary.' % col)

pct_change is stationary.
pct_change2 is stationary.
pct_change5 is stationary.
rsi is stationary.
adx is stationary.
sma is not stationary. Dropping it.
corr is stationary.
volatility is stationary.
volatility2 is stationary.


Thus, you can see that all the columns but `sma` are stationary. The `sma` column is dropped from the dataset.

<a id='correlation'></a> 
### Correlation Check

Let us now check for correlation between the features. 

In [11]:
def get_pair_above_threshold(X, threshold):
    """Function to return the pairs with correlation above threshold.
    """
    # Calculate the correlation matrix
    correl = X.corr()

    # Unstack the matrix
    correl = correl.abs().unstack()

    # Recurring & redundant pair
    pairs_to_drop = set()
    cols = X.corr().columns
    for i in range(0, X.corr().shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))

    # Drop the recurring & redundant pair
    correl = correl.drop(labels=pairs_to_drop).sort_values(ascending=False)

    return correl[correl > threshold].index


print(get_pair_above_threshold(X, 0.7))

MultiIndex([('volatility', 'volatility2')],
           )


In the above output, you can see that the correlation between `volatility` and `volatility2` is above threshold of 0.7. Hence, we should drop any one of the above columns. We will drop the `volatility2` column.

In [12]:
# Drop the highly correlated column
X = X.drop(columns=['volatility2'], axis=1)

### Display the Final Features

In [13]:
X.head()

Unnamed: 0,pct_change,pct_change2,pct_change5,rsi,adx,corr,volatility
2017-01-05 09:45:00+00:00,-0.000115,-0.00092,-0.00092,46.849667,9.932116,0.795243,0.112499
2017-01-05 10:00:00+00:00,-0.003108,-0.003222,-0.00391,42.011647,10.304958,0.702034,0.092029
2017-01-05 10:15:00+00:00,-0.002425,-0.005525,-0.005639,38.772892,10.916288,0.506866,0.10332
2017-01-05 10:30:00+00:00,-0.000347,-0.002771,-0.006785,38.33383,11.716813,0.313286,0.103321
2017-01-05 10:45:00+00:00,-0.003358,-0.003704,-0.009322,34.415816,12.622385,0.073429,0.12208


### Save the Files on Your Disk

The dataframes `X` and `y` has the features and target variables stored. For further use, you can export this into a CSV file using the `to_csv` method of `pandas`. 

Syntax:
```python
DataFrame.to_csv(file_name)
```

The above line will save `DataFrame` as a CSV file with the name `file_name` on your local disk.

<b> The following cell will not run in the browser. Download this notebook and convert the cell to "Code" type.</b>

## Conclusion

As most of the ML models requires stationary features, the final features in the dataframe are stationary. We have also dropped the highly correlated features. In the next section, you will learn about the types of ML algorithms.
<br><br>