Import libraries

In [None]:
import matplotlib.ticker as mticker
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf

In the previous class, we looked at our first classification model, a Logistic Regression model. We used the `statsmodels` library to implement the logistic regression model. As a reminder of our process, the following code block repeats the steps for loading the data, splitting the data into training and testing datasets, fitting the logistic regression model, and then generating predictions based on a threshold of 0.50 for the predicted class probability.

In [None]:
# read data
data = pd.read_csv('data/diabetes.csv')

# define target and feature variables
target = 'Outcome'
features = [col for col in data.columns if col != target]

# generate training and testing datasets
train = data.sample(frac=0.75, random_state=42)
test = data[~data.index.isin(train.index)].copy()

# define regression formula
formula = f'{target} ~ {" + ".join(features)}'

# fit regression
reg = smf.logit(formula, data=data).fit()

# generate predictions (positive class probabilities for this model)
train['positive_probability'] = reg.predict(train)
test['positive_probability'] = reg.predict(test)

# define positive class threshold
threshold = 0.50

# get training dataset accuracy
train_predictions = (train['positive_probability'] >= threshold).astype(int)
train_accuracy = (train_predictions == train[target]).astype(int).mean()
print(f'{train_accuracy = :.2%}')

# get testing dataset accuracy
test_predictions = (test['positive_probability'] >= threshold).astype(int)
test_accuracy = (test_predictions == test[target]).astype(int).mean()
print(f'{test_accuracy = :.2%}')

In this notebook, we will explore another library that is very popular for classification tasks, namely, *scikit-learn*. This library is massive (as we will see). Thus, the standard practice is to import modules or classes defined in the library as needed. A nice thing about scikit-learn, which is imported as `sklearn`, is that the developers have done an excellent job in creating a simple and uniform way to work with classification problems. We will demonstrate this using a simple classification method, a Decision Tree. Decision Trees work by learning a tree structure for *splitting* the feature space to generate predictions. If you are interested in additional information, see: https://en.wikipedia.org/wiki/Decision_tree.

The following code block imports two objects from scikit-learn, one that is used to simplify the process of creating training and testing datasets for use with scikit-learn and another for fitting a Decision Tree classifier.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

The following code block shows how we can use the `train_test_split` function to generate training and testing datasets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data[features], data[target], random_state=12)

The following code block prints the dimensions of the two datasets.

In [None]:
print(f'{X_train.shape = }')
print(f'{X_test.shape = }')

The following code block fits a simple Decision Tree classifier using the training dataset.

In [None]:
clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)

The follwoing code block shows how we can generate predictions using the fitted classifier.

In [None]:
clf.predict(X_train)

The following code block shows how we can generate probabilities for the two possible classes.

In [None]:
clf.predict_proba(X_train)

The fitted classifier has a convenient `score` method that allows us to assess accuracy. The following code block prints the score for the training data.

In [None]:
clf.score(X_train, y_train)

The following code block prints the score for the testing data.

In [None]:
clf.score(X_test, y_test)

A nice thing about decision trees is that we can see exactly how the classifier is making the predictions. To do this, we use the `export_tree` method, which is imported in the following code block.

In [None]:
from sklearn.tree import export_text

The following code block uses the `export_tree` function to print the fitted Decision Tree.

In [None]:
print(export_text(clf, feature_names=features))

The previously fit Decision Tree is clearly overfitting the data. We can use *hyperparameters* of the Decision Tree classifier in order to change the behavior of the model and (hopefully) avoid overfitting.

In [None]:
clf = DecisionTreeClassifier(
    random_state=42,
    max_depth=3,
)
clf.fit(X_train, y_train)

The following code block prints the updated training dataset score.

In [None]:
clf.score(X_train, y_train)

The following code block prints the updated testing dataset score.

In [None]:
clf.score(X_test, y_test)

As we saw earlier, we do not get back class probabilities for a Decision Tree. To see the errors that are made with the current classifier, we can plot a *confusion matrix*. The following code block imports a function that simplifies this process.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

The following code block uses the `ConfusionMatrixDisplay` function, along with the fitted classifier and the testing data, to generate a plot of the confusion matrix.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(4, 4))

ConfusionMatrixDisplay.from_estimator(
    clf, 
    X_test, 
    y_test,
    ax=ax,
)

plt.show()

Earlier I mentioned that *scikit-learn* is designed to make the use of different classifier easy. In order to do so, all we have to do is swap out the classifier that is used. As an example, the following code block uses a *K Nearest Neighbors* classifier instead of a Decision Tree.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier()

clf.fit(X_train, y_train)

print(f'{clf.score(X_train, y_train) = :.2f}')
print(f'{clf.score(X_test, y_test) = :.2f}')

fig, ax = plt.subplots(1, 1, figsize=(4, 4))

ConfusionMatrixDisplay.from_estimator(
    clf, 
    X_test, 
    y_test,
    ax=ax,
)

plt.show()

## Homework 7

Fit an AdaBoost Classifier for the dataset (see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html). Experiment with some of the hyperparameters manually to see if you can improve the fit on the testing dataset. Also, generate a confusion matrix for the best variant of the model that you find and comment on the classifier's performance with respect to false positives and false negatives.