# Chapter 5
# Training and Forecasting using Classification Model
In the previous chapters, we learned about the *features* (`X`), *target* (`y`), and the train-test split.

Now, we have to train the ML algorithm so that when it goes in the real world, it will perform spectaculary.

We will use the `X_train` and `y_train`  to train the ML model. The model training is also referred to as "fiting" the model.

![alt text](assets/graph005.png)

After the model is fit, the `X_test` will be used with the trained machine learning model to get the predicted values (`y_pred`).

![alt text](assets/graph006.png)



In [1]:
# Import Libraries
# For data manipulation
import pandas as pd

# Import Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

## Read the Data
The target (`y`) and features (`X`) for the `train` and `test` dataset is read from the CSV files. Note that this data was prepared in the previous chapters.

In [2]:
# Read the target and features of the training and testing data

X_train = pd.read_csv('/Users/nacho/Documents/GitHub/python_course/books/Machine-Learning-Trading/assets/machine-learning-in-trading-main-main/data_modules/JPM_features_training_2017_2019.csv',
                      index_col=0, parse_dates=True)
X_test  = pd.read_csv('/Users/nacho/Documents/GitHub/python_course/books/Machine-Learning-Trading/assets/machine-learning-in-trading-main-main/data_modules/JPM_features_testing_2017_2019.csv',
                      index_col=0, parse_dates=True)
y_train = pd.read_csv('/Users/nacho/Documents/GitHub/python_course/books/Machine-Learning-Trading/assets/machine-learning-in-trading-main-main/data_modules/JPM_target_training_2017_2019.csv',
                      index_col=0, parse_dates=True)
y_test  = pd.read_csv('/Users/nacho/Documents/GitHub/python_course/books/Machine-Learning-Trading/assets/machine-learning-in-trading-main-main/data_modules/JPM_target_testing_2017_2019.csv',
                      index_col=0, parse_dates=True)

## Select a Classification Model
We will use the `RandomForestClassifier`, right now it is not important to understand the model and how it works. What is important to learn here is how the `train_data` and `test_data` are used  along with the ML model.

The `RandomForestClassifier` model from the `sklearn` package is used to create the classification tree model. 

```python
RandomForestClassifier(n_estimators, max_features, max_depth, random_state)
```

Where:
1. `n_estimators` The number of trees in the forest
2. `max_features` The number of features to consider when looking for the best split
3. `max_depth` The maximum depth of a tree
4. `random_state` Seed value for the randomised boostrapping and feature selection. This is set to replicate results for subsequent runs.

Function returns:
1. A `RandomForestClassifier` type object that can be fit on the test data, and then used for making forecasts.

We have set the values for the parameters. These are for illustration and can be changed.

In [3]:
# Create the machine learning model
rf_model = RandomForestClassifier(
    n_estimators=3, max_features=3, max_depth=2, random_state=4)

### Train the Model
Now it is time for the model to learn from the `X_train` and `y_train`. We call the `fit` function of the model and pass the `X_train` and `y_train` datasets.

```python
model.fit(X_train, y_train)
```

Where:
1. `model` The model (in this case the `RandomForestClassifier`) object
2. `X_train` The features from the training dataset
3. `y_train` The target from the training dataset

The `fit` function trains the model using the data passed to it. The trained model is stored in the model object where the `fit` function was applied.

In [4]:
# Fit the model on the training data
rf_model.fit(X_train, y_train['signal'])


### Forecast Data
The model is now ready to make forecasts. We can now pass the unseen data (`X_test`) to the model, and obtain the model predicted values (`y_pred`). To make a forecast, the `predict` function is called and the unseen data is passed as a parameter.

```python
model.predict(X_test)
```

Where:
1. `model` The model (`RandomForestClassifier`) object
2. `X_test` The features from the testing dataset

The return is a `numpy` array of the predicted outputs.

So, finally let's make one prediction using the model. For illustration, we are using the first data point in the `X_test`.

In [5]:
# Get a sample day of the data from X_test
unseen_data_single_day = X_test.head(1)

# Preview the data
unseen_data_single_day.T

Unnamed: 0,2019-05-28 12:00:00+00:00
pct_change,0.0
pct_change2,-9.1e-05
pct_change5,0.001374
rsi,47.746053
adx,26.139722
corr,-0.515815
volatility,0.143024


This data is for 28th May, 2019. Let us pass this to the model and get the prediction.

In [6]:
# Get the prediction of a single day
single_day_prediction = rf_model.predict(unseen_data_single_day)

# Preview the prediction
single_day_prediction

array([0])

The predicted model output is `0`. This means that the model is signalling to take no position on 28th May, 2019. Let's apply the model to all of the testing dataset.

In [7]:
# Use the model and predict the values for the test data
y_pred = rf_model.predict(X_test)

# Display the first five prediction
print("The first five predicted values", y_pred[:5])

The first five predicted values [0 0 1 0 0]


The model predictions are stored in `y_pred`. `0` means no position and `1` means a long position. With the `y_pred`, we can now place trades using an ML model.

***But how do we know that the ML model predictions are good?***
As we can see, the model correctly predicts the first three values of the `test_data`. But how do we know the accuracy of the model prediction for the entire dataset?

We need to learn some metrics for measuring the model performance; this will be covered in the next chapter.
