# Notebook Instructions

1. All the <u>code and data files</u> used in this course are available in the downloadable unit of the <u>last section of this course</u>.
2. You can run the notebook document sequentially (one cell at a time) by pressing **Shift + Enter**. 
3. While a cell is running, a [*] is shown on the left. After the cell is run, the output will appear on the next line.

This course is based on specific versions of python packages. You can find the details of the packages in <a href='https://quantra.quantinsti.com/quantra-notebook' target="_blank" >this manual</a>.

# ML Classification Model Training and Forecasting

We have learned about the features (`X`), target (`y`), and the train-test split in the previous sections. We will now use the `X_train` and `y_train` to train a machine learning model. The model training is also referred to as "fitting" the model.

![Model Training](https://d2a032ejo53cab.cloudfront.net/Glossary/acvvSItH/1.jpg)<br>

After the model is fit, the `X_test` will be used with the trained machine learning model to get the predicted values (`y_pred`).

![Model Forecasting](https://d2a032ejo53cab.cloudfront.net/Glossary/k1PWJz87/2.jpg)<br>

This notebook is divided into the following parts:

1. [Read the Data](#read)
1. [Select a Classification Model](#model)
1. [Train the Model](#train)
1. [Forecast Data](#forecast)

## Import Libraries

In [1]:
# For data manipulation
import pandas as pd

# Import sklearn's Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

<a id='read'></a> 
## Read the Data
The target (`y`) and features (`X`) for the `train` and `test` dataset is read from the CSV files. This data was prepared in the previous section and can be downloaded from the downloadable zip folder in the last section of this course.

In [2]:
# Define the path for the data files
path = "../data_modules/"

# Read the target and features of the training and testing data
X_train = pd.read_csv(
    path + "JPM_features_training_2017_2019.csv", index_col=0, parse_dates=True)
X_test = pd.read_csv(
    path + "JPM_features_testing_2017_2019.csv", index_col=0, parse_dates=True)
y_train = pd.read_csv(
    path + "JPM_target_training_2017_2019.csv", index_col=0, parse_dates=True)
y_test = pd.read_csv(path + "JPM_target_testing_2017_2019.csv",
                     index_col=0, parse_dates=True)

<a id='model'></a> 
## Select a Classification Model

Now we will select a classification model. For illustration, we will use the `RandomForestClassifier`. Don't worry if you are unfamiliar with this ML model. It is not important to understand how the random forest classifier works at this time. We can use any other classification model in its place. What is important here is to learn how the `train_data` and `test_data` are used along with the ML model.

The `RandomForestClassifier` model from the `sklearn` package is used to create the classification tree model. If you are very new to machine learning, you can skip the interpretation and understanding of these parameters for now.

Syntax:
```python
RandomForestClassifier(n_estimators, max_features, max_depth, random_state)
```

Parameters:
1. **n_estimators:** The number of trees in the forest.
1. **max_features:** The number of features to consider when looking for the best split.
1. **max_depth:** The maximum depth of a tree.
1. **random_state:** Seed value for the randomised bootstrapping and feature selection. This is set to replicate results for subsequent runs.

Returns:<br>
A `RandomForestClassifier` type object that can be fit on the test data and then used for making forecasts.

We have set the values for the parameters. These are for illustration and can be changed.

In [3]:
# Create the machine learning model
rf_model = RandomForestClassifier(
    n_estimators=3, max_features=3, max_depth=2, random_state=4)

<a id='train'></a> 
## Train the Model

Now it is time for the model to learn from the `X_train` and `y_train`. We call the `fit` function of the model and pass the `X_train` and `y_train` datasets. 

Syntax:
```python
model.fit(X_train, y_train)
```

Parameters:
1. **model:** The model (RandomForestClassifier) object.
2. **X_train:** The features from the training dataset.
3. **y_train:** The target from the training dataset.

Returns:<br>
The `fit` function trains the model using the data passed to it. The trained model is stored in the model object where the `fit` function was applied.

In [4]:
# Fit the model on the training data
rf_model.fit(X_train, y_train['signal'])

RandomForestClassifier(max_depth=2, max_features=3, n_estimators=3,
                       random_state=4)

<a id='forecast'></a> 
## Forecast Data

The model is now ready to make forecasts. We can now pass the unseen data (`X_test`) to the model and obtain the model predicted values (`y_pred`). To make the forecast, the `predict` function is called and the unseen data is passed as a parameter.

Syntax:
```python
model.predict(X_test)
```

Parameters:
1. **model:** The model (RandomForestClassifier) object.
2. **X_test:** The features from the testing dataset.

Returns:<br>
A `numpy` array of the predicted outputs is obtained.

Let's make one prediction using the model. For illustration, we are using the first data point in the `X_test`.

In [5]:
# Get a sample day of data from X_test
unseen_data_single_day = X_test.head(1)

# Preview the data
unseen_data_single_day

Unnamed: 0,pct_change,pct_change2,pct_change5,rsi,adx,corr,volatility
2019-05-28 12:00:00+00:00,0.0,-9.1e-05,0.001374,47.746053,26.139722,-0.515815,0.143024


The data is for the 28th May 2019. Let us pass this to the model and get the prediction.

In [6]:
# Get the prediction of a single day
single_day_prediction = rf_model.predict(unseen_data_single_day)

# Preview the prediction
single_day_prediction

array([0], dtype=int64)

The predicted model output is 1. This means that the model is signaling to take a long position on 28th May 2019.
Let's apply the model to all of the testing dataset.

In [7]:
# Use the model and predict the values for the test data
y_pred = rf_model.predict(X_test)

# Display the first five predictions
print("The first five predicted values", y_pred[:5])

The first five predicted values [0 0 1 0 0]


The model predictions are stored in `y_pred`. 0 means no position and 1 means a long position. With the `y_pred` we can now place trades using an ML model.

### Save the Files on Your Disk

<b>The following cell will not run in the browser. Download this notebook and convert the cell to "Code" type.</b>

<b>But how do we know that the ML model predictions are good?</b>

As we can see, the model correctly predicts the first three values of the `test_data`. But how do we know the accuracy of the model prediction for the entire dataset? We need to learn some metric for measuring the model performance. <br><br>