# Random Forest

While I'm no wine connoisseur, I do have to question how formulaic it is to find the "best" wine. For example, perhaps
a wine with quality 8 has a particular range of pH and sulfur content. I do not know, but that's what this model should
tell us more about.

#### Goals
1. Determine the effectiveness of a random forest model on the wine dataset
    * Apply a few sets of hand-selected values for each hyper parameter to get a feel for the dataset.
2. Attempt to draw some conclusions regarding whether random forest is the way to go.

There's not really much to say here. I'm expecting $~60\%$ performance like we've been seeing with most of these models.

#### Loading the Data
As before, we're going to load the data and apply PCA with the results learned from `exploring_data.ipynb`. Reducing the
dimension from $11 \to 8$ should help us remove some of the inherent noise in the dataset.

In [1]:
import numpy as np
from models.data_loader import DataLoader
from sklearn.ensemble import RandomForestClassifier
import itertools
import pandas as pd

rs = np.random.RandomState(42069)

# Load the data.
dl = DataLoader('../data/winequality-red.csv', random_state=rs)
dl.apply_pca_to_dataset()

# Apply a Train/Test split.
X_train, X_test, y_train, y_test = dl.train_test_split()

# Obtain the dimension of the data.
_, d = X_train.shape

#### Applying the Random Forest model to the Data
The main hyperparameters to explore are:
1. `n_estimators`: the number of trees in the forest
2. `max_depth`: how many layers deep each tree may go
3. `class_balance`: either `balanced` or `balanced_subsample`. You can read more here: [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

I'm going to explore these values initially:
* `n_estimators` $\in \{ 20, 50, 100, 200, 300 \}$.
* `max_depth`: $\in \{ \mathrm{None}, d, d^2\}$, with $d$-dimensional data. Note that `None` allows for as much
depth as is needed, and is the default value.
* Both `class_balance` options.

This is intentionally sparse, as the goal is not to perform hypertuning, but to evaluate what to expect with the model.

Again, the set of all options being explored here is $\mathrm{n\_estimators} \times \mathrm{max\_depth} \times \mathrm{class\_balance}$.


In [2]:
n_estimators = [20, 50, 100, 200, 300]
max_depths = [None, d, d**2]
class_balances = ['balanced', 'balanced_subsample']


In [3]:
performances = pd.DataFrame(index=['n_estimators', 'max_depth', 'class_balance', 'accuracy'])

for n, max_depth, class_balance in itertools.product(n_estimators, max_depths, class_balances):
    rfc = RandomForestClassifier(n_estimators=n, max_depth=max_depth, class_weight=class_balance, random_state=rs)
    rfc.fit(X_train, y_train)

    accuracy = rfc.score(X_test, y_test)

    performances = performances.append({
        'n_estimators': n,
        'max_depth': max_depth,
        'class_balance': class_balance,
        'accuracy': accuracy
    }, ignore_index=True)

#### Determine the Top-10 Best Models

In [4]:
performances = performances.sort_values('accuracy', ascending=False)
print(performances.head(10))

    accuracy       class_balance max_depth  n_estimators
32  0.706250            balanced        64         300.0
27  0.702083  balanced_subsample        64         200.0
33  0.697917  balanced_subsample        64         300.0
29  0.697917  balanced_subsample      None         300.0
22  0.697917            balanced      None         200.0
17  0.697917  balanced_subsample      None         100.0
20  0.695833            balanced        64         100.0
23  0.695833  balanced_subsample      None         200.0
26  0.691667            balanced        64         200.0
10  0.689583            balanced      None          50.0


#### What was Learned?
* The Random Forest Classifier model performed much better than I anticipated, i.e., about 10% better.
* It appears to favor a larger depth, with many of the best performing models using a `max_depth` of $d^2$ or `None`.
* The class balance method does not appear to have much impact, with half of the models using each method in the top 10.
    * 3 of the 5 top models use `balanced_subsample`, though that's hardly worth drawing conclusions from.
* It would be wise to see how well it performs with other performance measurements, such as AUC, when applying the final
model.
* Using many estimators appears be favorable, albeit more computationally expensive. It would make sense to view the
confusion matrices for these top estimators.