# Finding the Best Random Forest through Random Search

Author: Mohamed Oussama NAJI

Date: Feb 2, 2024

## Introduction

In order to maximize the performance of the random forest, we can perform a random search for better hyperparameters. This will randomly select combinations of hyperparameters from a grid, evaluate them using cross-validation on the training data, and return the values (best model with hyperparameters) that perform the best.


## Table of Contents
1. [Random Forest Theory Revisited](#random-forest-theory)
2. [Behavioral Risk Factor Surveillance System Dataset](#brfss-dataset)
3. [Data Acquisition](#data-acquisition)
4. [Data Exploration](#data-exploration)
5. [Data Preprocessing](#data-preprocessing)
   - [Label Distribution](#label-distribution)
   - [Label Feature](#label-feature)
   - [Splitting Data into Training and Testing Set](#data-split)
   - [Imputation of Missing Values](#missing-values-imputation)
6. [Random Search for Hyperparameter Tuning](#random-search)
   - [Importing Required Libraries](#importing-libraries)
   - [Setting the Parameter Grid](#parameter-grid)
   - [Creating the Estimator](#creating-estimator)
   - [Creating the Random Search Model](#creating-random-search-model)
   - [Fitting the Model](#fitting-model)
   - [Exploring the Best Parameters](#best-parameters)
7. [Evaluating the Best Model](#evaluating-best-model)
   - [Making Predictions](#making-predictions)
   - [Getting Node Counts and Maximum Depths](#node-counts-max-depths)
   - [Plotting ROC AUC Scores](#plotting-roc-auc)
   - [Evaluating Model Performance](#evaluating-model-performance)
   - [Plotting Confusion Matrix](#plotting-confusion-matrix)
8. [Visualizing a Tree from the Optimized Forest](#visualizing-tree)
9. [Conclusion](#conclusion)

## Random Forest Theory Revisited <a id="random-forest-theory"></a>

Random Forest is a model made up of many decision trees. It uses two key concepts that give it the name "random":
- Random sampling of training data points when building trees (bagging)
- Random subsets of features considered when splitting nodes

## Behavioral Risk Factor Surveillance System Dataset <a id="brfss-dataset"></a>

The Behavioral Risk Factor Surveillance System (BRFSS) is a health-related telephone survey conducted by the Centers for Disease Control and Prevention (CDC). It collects data on preventive health practices and risk behaviors linked to chronic diseases, injuries, and preventable infectious diseases in the adult population.


## Data Acquisition <a id="data-acquisition"></a>


In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
RSEED = 50

df = pd.read_csv('2015.csv').sample(100000, random_state=RSEED)
df.head()

## Data Exploration <a id="data-exploration"></a>


In [None]:
df = df.select_dtypes('number')
df

## Data Preprocessing <a id="data-preprocessing"></a>


### Label Distribution <a id="label-distribution"></a>


In [None]:
df['_RFHLTH'].value_counts()

### Label Feature <a id="label-feature"></a>

In [None]:
df['_RFHLTH'] = df['_RFHLTH'].replace({2: 0})
df = df.loc[df['_RFHLTH'].isin([0, 1])].copy()
df = df.rename(columns={'_RFHLTH': 'label'})
df['label'].value_counts()

In [None]:
# Remove columns with missing values
df = df.drop(columns=['POORHLTH', 'PHYSHLTH', 'GENHLTH', 'PAINACT2',
                      'QLMENTL2', 'QLSTRES2', 'QLHLTH2', 'HLTHPLN1', 'MENTHLTH'])

### Splitting Data into Training and Testing Set <a id="data-split"></a>


In [None]:
from sklearn.model_selection import train_test_split

labels = np.array(df.pop('label'))

train, test, train_labels, test_labels = train_test_split(df, labels,
                                                          stratify=labels,
                                                          test_size=0.3,
                                                          random_state=RSEED)

### Imputation of Missing Values <a id="missing-values-imputation"></a>


In [None]:
train = train.fillna(train.mean())
test = test.fillna(train.mean())

features = list(train.columns)

train.shape
test.shape

## Random Search for Hyperparameter Tuning <a id="random-search"></a>


### Importing Required Libraries <a id="importing-libraries"></a>


In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

### Setting the Parameter Grid <a id="parameter-grid"></a>


In [None]:
parameter_grid = {
    'n_estimators': np.arange(10, 201, 10),
    'max_depth': np.arange(3, 21, 1),
    'max_features': ['auto', 'sqrt', None] + list(np.arange(0.5, 1, 0.1)),
    'max_leaf_nodes': np.arange(10, 51, 1),
    'min_samples_split': [2, 5, 10],
    'bootstrap': [True, False]
}

### Creating the Estimator <a id="creating-estimator"></a>


In [None]:
RSEED = 50
estimator = RandomForestClassifier(random_state=RSEED)

### Creating the Random Search Model <a id="creating-random-search-model"></a>


In [None]:
random_search = RandomizedSearchCV(estimator, parameter_grid, cv=3, n_iter=10, scoring='roc_auc', random_state=RSEED, verbose=2)

### Fitting the Model <a id="fitting-model"></a>


In [None]:
random_search.fit(train, train_labels)

### Exploring the Best Parameters <a id="best-parameters"></a>

In [None]:
best_params = random_search.best_params_
print("Best parameters: ")
print(best_params)

best_model = random_search.best_estimator_

## Evaluating the Best Model <a id="evaluating-best-model"></a>

### Making Predictions <a id="making-predictions"></a>

In [None]:
predictions = best_model.predict(test)

### Getting Node Counts and Maximum Depths <a id="node-counts-max-depths"></a>

In [None]:
node_counts = [estimator.tree_.node_count for estimator in best_model.estimators_]
max_depths = [estimator.tree_.max_depth for estimator in best_model.estimators_]

print("Node counts for the trees in the forest:", node_counts)
print("Maximum depths for the trees in the forest:", max_depths)

### Plotting ROC AUC Scores <a id="plotting-roc-auc"></a>


In [None]:
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

roc_auc_train = roc_auc_score(train_labels, best_model.predict_proba(train)[:, 1])
roc_auc_test = roc_auc_score(test_labels, best_model.predict_proba(test)[:, 1])

fpr, tpr, _ = roc_curve(test_labels, best_model.predict_proba(test)[:, 1])

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'Test ROC AUC = {roc_auc_test:.2f}')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.show()

print(f"ROC AUC Score on Training Data: {roc_auc_train:.2f}")
print(f"ROC AUC Score on Testing Data: {roc_auc_test:.2f}")


### Evaluating Model Performance <a id="evaluating-model-performance"></a>


In [None]:
from sklearn.metrics import recall_score, precision_score, roc_auc_score, roc_curve

train_predictions = best_model.predict(train)
train_probs = best_model.predict_proba(train)[:, 1]

predictions = best_model.predict(test)
probs = best_model.predict_proba(test)[:, 1]

evaluate_model(predictions, probs, train_predictions, train_probs)

### Plotting Confusion Matrix <a id="plotting-confusion-matrix"></a>


In [None]:
from sklearn.metrics import confusion_matrix
import itertools

cm = confusion_matrix(test_labels, predictions)
classes = ['0.0', '1.0']

plot_confusion_matrix(cm, classes=classes, normalize=False, title='Confusion Matrix')

## Visualizing a Tree from the Optimized Forest <a id="visualizing-tree"></a>


In [None]:
from sklearn.tree import export_graphviz
from subprocess import call
from IPython.display import Image

estimator = best_model.estimators_[1]

export_graphviz(estimator, 'tree_from_optimized_forest.dot', rounded=True,
                feature_names=train.columns, max_depth=8,
                class_names=['poverty', 'no poverty'], filled=True)

call(['dot', '-Tpng', 'tree_from_optimized_forest.dot', '-o', 'tree_from_optimized_forest.png', '-Gdpi=200'])
Image('tree_from_optimized_forest.png')

## Conclusion <a id="conclusion"></a>

In this notebook, we explored the process of finding the best random forest model using random search for hyperparameter tuning. We used the Behavioral Risk Factor Surveillance System dataset to predict the overall health of individuals.

The random search process involved defining a parameter grid with various hyperparameter values and randomly sampling from this grid to train and evaluate multiple random forest models. The best model was selected based on the highest ROC AUC score.

The evaluation of the best model showed a good balance between correctly identifying both "Good Health" and "Poor Health" individuals. The model achieved a high ROC AUC score on both the training and testing data, indicating its effectiveness in predicting health outcomes.

We also visualized a tree from the optimized forest to gain insights into the decision-making process of the model.

Overall, this notebook demonstrates the power of random search for hyperparameter tuning in optimizing the performance of random forest models. By systematically exploring different hyperparameter combinations, we can find the best model that suits our specific problem and dataset.