# Trained Random Forest Model

## What is this notebook?
The point of this notebook is to provide a discussion of what was learned by training a random forest classifier, and to
discuss the thought process behind the way it was trained. We will discuss
* Pipeline
* Hyperparameter Tuning
* Final Model
* Model metrics
* Results and Discussion

## Defining the model

### Pipeline

The model was trained using an `sklearn` pipeline. This pipeline was in charge of both preprocessing the data, as well
as creating the model.

The pipeline proceeded as follows:
1. Apply statistical standardization.
    * Zero-mean data is an assumption of PCA.
2. Apply PCA.
    *  It was found that 8 principal components allowed for us to explain 95% of the data. This was a good step to take
    for two reasons:
        1. it reduced the dimensionality of the data from $11 \to 8$.
        2. it helps to reduce noise while retaining the true signal.
3. Feed the data to the model.

This ultimately proved effective.

### Hyperparameter Tuning

During the process of hyperparameter tuning, the following parameters were considered:

* `n_estimators`: the number of decision trees in the final random forest model.
* `max_depth`: the maximum depth of any decision tree in the forest.
* `criterion`: whether to use entropy or Gini impurity.
* `class_weight`: whether to weight each class based off the frequency of the training dataset, or by each bootstrap.
* `max_features`: the number of features to consider at each split in the tree.

#### Gridsearch

To appropriately tune these hyperparmeters, I applied a gridsearch in the `training_RFC.py` file. Each of these search
spaces was based off some intuition gleaned from when I initially tried to determine the effectiveness of each model on
the dataset.

The parameters were searched within the following ranges:

* `n_estimators` $\in \{50, 100, \dots, 500\}$.
* `max_depth` $\in \{64, 67, \dots, 88\} \cup +\infty$
* `criterion` $\in \{ \text{entropy}, \text{gini} \}$
* `class_weight` $\in \{ \text{balanced_subsample}, \text{balanced} \}$
* `max_features` $\in \{ 2, 3, 4 \} \cup \text{sqrt}$, where `sqrt` allows for the square-root of the number of features
remaining at each decision.

#### Validation
I applied an 80/20 train/validation split for training the model. During the training, we applied 5-fold cross
validation to aid with the grid search.


#### Training Process
I decided to optimize the weighted F1 score with the model. The weighted F1 score is a weighted average of the f1 scores
of each class, with more weight given to the classes with fewer examples (support). Seeing as how the dataset is massively class-
imbalanced, it made sense to use a metric that gave more emphasis to minority classes. Furthermore, the f1 score
provides a nice balance between optimizing the recall and the precision of each class.

### Final Model and Performance

#### The best model parameters
It was found that the best model had the parameters:

* `n_estimators`: 300
* `max_depth`: 64
* `criterion`: Gini impurity
* `class_weight`: weights balanced by each bootstrap, i.e., `balanced_subsample`
* `max_features`: 4

#### Building the model

In [1]:
from models.data_loader import DataLoader
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# We learned from `exploring_data.ipynb` that PCA with 8 principal components is optimal.
n_components = 8
train_prop = 0.8

# Random state to allow for this to be deterministic.
random_state = np.random.RandomState(42069)

# Define the model with the optimal parameters.
model_pipeline = Pipeline(steps=[
    ('standardization', StandardScaler()),
    ('pca', PCA(n_components=n_components, random_state=random_state)),
    ('classifier', RandomForestClassifier(
        random_state=random_state,
        n_estimators=300,
        max_depth=64,
        criterion='gini',
        class_weight='balanced_subsample',
        max_features=4
    ))
])

#### Loading the data

In [2]:
# Load the data.
dl = DataLoader('../data/winequality-red.csv', random_state=random_state)

# Load the entire dataset (for a final prediction)
X, y = dl.get_all_data()

# Load the training and testing data.
X_train, X_test, y_train, y_test = dl.train_test_split(test_prop=(1.0-train_prop))

#### Evaluating the model

In [3]:
# Train the model
model_pipeline.fit(X_train, y_train)

# Predict with the model
y_hat = model_pipeline.predict(X_test)

# Determine the unique labels from the model
unique_labels = np.unique(y_train)

# Create a confusion matrix
confusion_mtrx = pd.DataFrame(
    confusion_matrix(y_test, y_hat, labels=unique_labels),
    index=[f'true:{i}' for i in unique_labels],
    columns=[f'pred:{i}' for i in unique_labels]
)

# Print the confusion matrix
print("Confusion Matrix:")
print(confusion_mtrx)

# Print the classification report
print("")
print("Classification Report:")
print(classification_report(y_test, y_hat))

print(f"Overall OOS model accuracy: {round(100*sum(y_hat == y_test) / len(y_hat), 4)}%")

print(f"Model Performance on Entire Dataset: {round(100*sum(model_pipeline.predict(X) == y) / len(y), 4)}%")

Confusion Matrix:
        pred:3  pred:4  pred:5  pred:6  pred:7  pred:8
true:3       0       0       2       0       0       0
true:4       0       0       9       3       0       0
true:5       0       0     102      28       2       0
true:6       0       0      21     110       5       0
true:7       0       0       1      14      21       1
true:8       0       0       0       0       0       1

Classification Report:
              precision    recall  f1-score   support

           3       0.00      0.00      0.00         2
           4       0.00      0.00      0.00        12
           5       0.76      0.77      0.76       132
           6       0.71      0.81      0.76       136
           7       0.75      0.57      0.65        37
           8       0.50      1.00      0.67         1

    accuracy                           0.73       320
   macro avg       0.45      0.52      0.47       320
weighted avg       0.70      0.73      0.71       320

Overall OOS model accuracy: 73

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### Results and Discussion
The model appears to work well, with it performing with $\approx 73\%$ accuracy on the 320 data points OOS. Upon
using the model to predict on the entire dataset, it predicted with $\approx 94.6\%$ accuracy (noting that this is also
not a good metric to report, but nice to know).

The model has a weighted f1 score of $0.71$, with a recall of $0.73$ and a precision of $0.70$.

The model appears to primarily score the majority classes well (as would be expected). It appeared to perform worse with
classes $3, 4$.

The rest of the interpretation's up to you.