Import libraries

In [None]:
import matplotlib.ticker as mticker
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text

In the last class, we looked at using the `scikit-learn` library to fit a `DecisionTreeClassifier` model. The following code block repeats the code need to achieve this for the default settings of the `DecisionTreeClassifier`.

In [None]:
data = pd.read_csv('data/diabetes.csv')

target = 'Outcome'
features = [col for col in data.columns if col != target]

X_train, X_test, y_train, y_test = train_test_split(
    data[features], 
    data[target], 
    random_state=12,
)

clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)

training_score = clf.score(X_train, y_train)
testing_score = clf.score(X_test, y_test)

print(f'{training_score = }')
print(f'{testing_score = }')

As we discussed, the default `DecisionTreeClassifier` overfits the data. We saw that we can mitigate this overfitting by changing hyperparameters of the classifier model. This is shown in the following code block that fits another `DecisionTreeClassifier` with the `max_depth` hyperparameter adjusted.

In [None]:
data = pd.read_csv('data/diabetes.csv')

target = 'Outcome'
features = [col for col in data.columns if col != target]

X_train, X_test, y_train, y_test = train_test_split(
    data[features], 
    data[target], 
    random_state=12,
)

clf = DecisionTreeClassifier(
    max_depth=3,
    random_state=0,
)
clf.fit(X_train, y_train)

training_score = clf.score(X_train, y_train)
testing_score = clf.score(X_test, y_test)

print(f'{training_score = }')
print(f'{testing_score = }')

The previous code block shows that we can handle overfitting (and other issues) using hyperparameters of the models. However, the question is what are the best hyperparameters and how can we conduct experiments to find them efficiently. One approach is to use the `GridSearchCV` method that is available in `sklearn`. This method allows you to conduct a search over the various hyperparameters available for a model to identify those that best achieve a given objective.

The following code block imports the function.

In [None]:
from sklearn.model_selection import GridSearchCV

The following code block demonstrates its use on the data.

In [None]:
%%time

data = pd.read_csv('data/diabetes.csv')

target = 'Outcome'
features = [col for col in data.columns if col != target]

X_train, X_test, y_train, y_test = train_test_split(
    data[features], 
    data[target], 
    random_state=12,
)

base_clf = DecisionTreeClassifier(
    random_state=0,
)
params = {
    'max_depth': [2, 3, 4, 5],
    'criterion': ['gini', 'entropy', 'log_loss'],
    'splitter': ['best', 'random'],
    'class_weight': ['balanced', None],
}
GS = GridSearchCV(
    base_clf,
    param_grid=params,
)
GS.fit(X_train, y_train)
best_clf = GS.best_estimator_

training_score = best_clf.score(X_train, y_train)
testing_score = best_clf.score(X_test, y_test)

print(f'{training_score = }')
print(f'{testing_score = }')

As we can see, the `GridSearchCV` function is able to identify a variant of the `DecisionTreeClassifier` that offers better accuracy. It does this by running a cross-validation procedure using all combinations of the hyperparameter values specified in the `params` dictionary. To get the best hyperparameter values it identified, we can use the `best_params_` attribute of the fitted object.

In [None]:
GS.best_params_

We can see the detailed results of the cross-validation experiment by accessing the `cv_results_` attribute and using it to create a `pandas` `DataFrame`.

In [None]:
pd.DataFrame(GS.cv_results_).head()

Let's try to rerun this experiment using a different classifier, namely, a `RandomForestClassifier`. The following code block imports the `RandomForestClassifier` object from `scikit-learn`.

In [None]:
from sklearn.ensemble import RandomForestClassifier

The following code block runs the experiment.

In [None]:
%%time

data = pd.read_csv('data/diabetes.csv')

target = 'Outcome'
features = [col for col in data.columns if col != target]

X_train, X_test, y_train, y_test = train_test_split(
    data[features], 
    data[target], 
    random_state=12,
)

base_clf = RandomForestClassifier(
    random_state=0,
)
params = {
    'max_depth': [2, 3, 4, 5],
    'criterion': ['gini', 'entropy', 'log_loss'],
    'class_weight': ['balanced', None],
}
GS = GridSearchCV(
    base_clf,
    param_grid=params,
)
GS.fit(X_train, y_train)
best_clf = GS.best_estimator_

training_score = best_clf.score(X_train, y_train)
testing_score = best_clf.score(X_test, y_test)

print(f'{training_score = }')
print(f'{testing_score = }')

The `RandomForestClassifer` is able to do better than the `DecisionTreeClassifer`. However, it took a bit longer. The data we are working with is rather small, but this could be problematic if it were larger. Since the `GridSearchCV` method is running multiple experiments, it turns out that we can run these in parallel (using multiple CPU cores) to achieve better computational performance. This is demonstrated in the following code block.

In [None]:
%%time

data = pd.read_csv('data/diabetes.csv')

target = 'Outcome'
features = [col for col in data.columns if col != target]

X_train, X_test, y_train, y_test = train_test_split(
    data[features], 
    data[target], 
    random_state=12,
)

base_clf = RandomForestClassifier(
    random_state=0,
)
params = {
    'max_depth': [2, 3, 4, 5],
    'criterion': ['gini', 'entropy', 'log_loss'],
    'class_weight': ['balanced', None],
}
GS = GridSearchCV(
    base_clf,
    param_grid=params,
    n_jobs=-1,
)
GS.fit(X_train, y_train)
best_clf = GS.best_estimator_

training_score = best_clf.score(X_train, y_train)
testing_score = best_clf.score(X_test, y_test)

print(f'{training_score = }')
print(f'{testing_score = }')

Let's look at the confusion matrix for the best model identified by the procedure.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(4, 4))

ConfusionMatrixDisplay.from_estimator(
    best_clf, 
    X_test, 
    y_test,
    ax=ax,
)

plt.show()

If we wanted to see if we can affect the likelihood of certain errors, we can specify different scoring metrics for the `GridSearchCV` parameters. The possible scoring metrics are given at: https://scikit-learn.org/stable/modules/model_evaluation.html. The following code block shows how we can use the `balanced_accuracy` metric.

In [None]:
%%time

data = pd.read_csv('data/diabetes.csv')

target = 'Outcome'
features = [col for col in data.columns if col != target]

X_train, X_test, y_train, y_test = train_test_split(
    data[features], 
    data[target], 
    random_state=12,
)

base_clf = RandomForestClassifier(
    random_state=0,
)
params = {
    'max_depth': [2, 3, 4, 5],
    'criterion': ['gini', 'entropy', 'log_loss'],
    'class_weight': ['balanced', None],
}
GS = GridSearchCV(
    base_clf,
    param_grid=params,
    scoring='balanced_accuracy',
    n_jobs=-1,
)
GS.fit(X_train, y_train)
best_clf = GS.best_estimator_

training_score = best_clf.score(X_train, y_train)
testing_score = best_clf.score(X_test, y_test)

print(f'{training_score = }')
print(f'{testing_score = }')

The resulting confusion matrix shows that this change has no effect in this case, but there are situations where the effects can be dramatic.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(4, 4))

ConfusionMatrixDisplay.from_estimator(
    best_clf, 
    X_test, 
    y_test,
    ax=ax,
)

plt.show()

**Aside**: Note that the `RandomForestClassifer` does give us class probabilities, so we could define our own prediction threshold (as we did with the logistic regression model) to see if we can reduce false negatives further.

In [None]:
best_clf.predict_proba(X_test)