<a href="https://colab.research.google.com/github/matthewpecsok/4482_fall_2022/blob/main/tutorials/model_benchmarks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
import pandas as pd
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import confusion_matrix

# Balanced Target Example

## Import, transform, fit and predict

In [None]:
cd_additional_balanced = pd.read_csv('https://raw.githubusercontent.com/matthewpecsok/4482_fall_2022/main/data/CD_additional_balanced.csv')

In [None]:
cd_additional_balanced.y.value_counts()

yes    4640
no     4640
Name: y, dtype: int64

In [None]:
cd_additional_balanced.y.value_counts(normalize=True)

yes    0.5
no     0.5
Name: y, dtype: float64

In [None]:
balanced_y_target = cd_additional_balanced.pop('y')

In [None]:
balanced_y_target = balanced_y_target.eq('yes').mul(1)

In [None]:
cd_additional_balanced_enc = pd.get_dummies(cd_additional_balanced)

## our real model performance

In [None]:
scores = cross_validate(
    DecisionTreeClassifier(criterion='entropy',
                           ccp_alpha=.002,
                           random_state=42), 
    cd_additional_balanced_enc, 
    balanced_y_target,
    cv=StratifiedKFold(n_splits=3,shuffle=True,random_state=True),
    scoring=['accuracy','recall','precision'],
    return_train_score=True)
pd.DataFrame(scores)

Unnamed: 0,fit_time,score_time,test_accuracy,train_accuracy,test_recall,train_recall,test_precision,train_precision
0,0.073947,0.008671,0.879121,0.881668,0.952165,0.95894,0.830795,0.83058
1,0.064335,0.008283,0.869382,0.884758,0.910795,0.925315,0.841194,0.855861
2,0.062147,0.02288,0.868089,0.881041,0.932083,0.947641,0.826261,0.836281


## random guessing

we would expect to get ~50% correct by random guessing.

In [None]:
dummy_clf = DummyClassifier(strategy="uniform")
dummy_clf.fit(cd_additional_balanced_enc, balanced_y_target)

print(confusion_matrix(y_true=balanced_y_target,y_pred=dummy_clf.predict(cd_additional_balanced_enc)))

[[2291 2349]
 [2252 2388]]


In [None]:
print(metrics.classification_report(balanced_y_target,dummy_clf.predict(cd_additional_balanced_enc)))

              precision    recall  f1-score   support

           0       0.50      0.52      0.51      4640
           1       0.50      0.49      0.50      4640

    accuracy                           0.50      9280
   macro avg       0.50      0.50      0.50      9280
weighted avg       0.50      0.50      0.50      9280



## majority classifier 

Accuracy: we would expect to get an accuracy approximately equal to the majority class proportion. 

Recall: we would expect to 100% recall for the majority class and 0% for the minority. 

Precision: we would expect to have precision equal to prevalence/proportion of the majority class. 

In [None]:
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(cd_additional_balanced_enc, balanced_y_target)

print(confusion_matrix(y_true=balanced_y_target,y_pred=dummy_clf.predict(cd_additional_balanced_enc)))

[[4640    0]
 [4640    0]]


In [None]:
print(metrics.classification_report(balanced_y_target,dummy_clf.predict(cd_additional_balanced_enc)))

              precision    recall  f1-score   support

           0       0.50      1.00      0.67      4640
           1       0.00      0.00      0.00      4640

    accuracy                           0.50      9280
   macro avg       0.25      0.50      0.33      9280
weighted avg       0.25      0.50      0.33      9280



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# Imbalanced Target Example

## Import, transform, fit and predict

In [None]:
cd_additional_modified = pd.read_csv('https://raw.githubusercontent.com/matthewpecsok/4482_fall_2022/main/data/CD_additional_modified.csv')

In [None]:
cd_additional_modified.y.value_counts()

no     3668
yes     451
Name: y, dtype: int64

In [None]:
cd_additional_modified.y.value_counts(normalize=True)

no     0.890507
yes    0.109493
Name: y, dtype: float64

In [None]:
imbalanced_y_target = cd_additional_modified.pop('y')

In [None]:
imbalanced_y_target = imbalanced_y_target.eq('yes').mul(1)

In [None]:
cd_additional_modified_enc = pd.get_dummies(cd_additional_modified)

## our real model performance

In [None]:
scores = cross_validate(
    DecisionTreeClassifier(
        ccp_alpha=0.01), 
        cd_additional_modified_enc, 
        imbalanced_y_target, 
        cv=StratifiedKFold(
            n_splits=3,
            shuffle=True,
            random_state=True),
        scoring=['accuracy','recall','precision'],
        return_train_score=True)
pd.DataFrame(scores)

Unnamed: 0,fit_time,score_time,test_accuracy,train_accuracy,test_recall,train_recall,test_precision,train_precision
0,0.027924,0.006825,0.909687,0.907138,0.346667,0.345515,0.666667,0.641975
1,0.022855,0.007267,0.907502,0.904224,0.453333,0.428571,0.60177,0.586364
2,0.023028,0.00692,0.898762,0.908594,0.397351,0.456667,0.555556,0.608889


## random guessing

we would expect to get ~50% correct by random guessing. but we get a lot more instances correct of the majority class, than the minority class.

In [None]:
dummy_clf = DummyClassifier(strategy="uniform")
dummy_clf.fit(cd_additional_modified_enc, imbalanced_y_target)

print(confusion_matrix(y_true=imbalanced_y_target,y_pred=dummy_clf.predict(cd_additional_modified_enc)))

[[1804 1864]
 [ 245  206]]


In [None]:
print(metrics.classification_report(imbalanced_y_target,dummy_clf.predict(cd_additional_modified_enc)))

              precision    recall  f1-score   support

           0       0.88      0.48      0.62      3668
           1       0.10      0.48      0.17       451

    accuracy                           0.48      4119
   macro avg       0.49      0.48      0.40      4119
weighted avg       0.80      0.48      0.57      4119



## majority classifier 

Accuracy: we would expect to get an accuracy approximately equal to the majority class proportion. 

Recall: we would expect to 100% recall for the majority class and 0% for the minority. 

Precision: we would expect to have precision equal to prevalence/proportion of the majority class. 

Notice compared to our real classifier, how the accuracy is very similar but how far off the recall and precision are. 

In [None]:
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(cd_additional_modified_enc, imbalanced_y_target)

print(confusion_matrix(y_true=imbalanced_y_target,y_pred=dummy_clf.predict(cd_additional_modified_enc)))


[[3668    0]
 [ 451    0]]


In [None]:
print(metrics.classification_report(imbalanced_y_target,dummy_clf.predict(cd_additional_modified_enc)))

              precision    recall  f1-score   support

           0       0.89      1.00      0.94      3668
           1       0.00      0.00      0.00       451

    accuracy                           0.89      4119
   macro avg       0.45      0.50      0.47      4119
weighted avg       0.79      0.89      0.84      4119



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# What does overfitting look like? This.

Note the huge dropoff in metrics from train to test sets.

In [None]:
scores = cross_validate(
    DecisionTreeClassifier(
        ccp_alpha=0), 
        cd_additional_modified_enc, 
        imbalanced_y_target, 
        cv=StratifiedKFold(
            n_splits=3,
            shuffle=True,
            random_state=True),
        scoring=['accuracy','recall','precision'],
        return_train_score=True)
pd.DataFrame(scores)

Unnamed: 0,fit_time,score_time,test_accuracy,train_accuracy,test_recall,train_recall,test_precision,train_precision
0,0.04521,0.011858,0.885652,1.0,0.466667,1.0,0.47619,1.0
1,0.039759,0.011868,0.884195,1.0,0.513333,1.0,0.472393,1.0
2,0.037385,0.033662,0.883467,1.0,0.503311,1.0,0.47205,1.0


## and This.

In [None]:
scores = cross_validate(
    DecisionTreeClassifier(
        ccp_alpha=0), 
        cd_additional_balanced_enc, 
        balanced_y_target, 
        cv=StratifiedKFold(
            n_splits=3,
            shuffle=True,
            random_state=True),
        scoring=['accuracy','recall','precision'],
        return_train_score=True)
pd.DataFrame(scores)

Unnamed: 0,fit_time,score_time,test_accuracy,train_accuracy,test_recall,train_recall,test_precision,train_precision
0,0.093268,0.014024,0.829347,1.0,0.817065,1.0,0.837641,1.0
1,0.07866,0.018628,0.826705,1.0,0.815126,1.0,0.834547,1.0
2,0.083883,0.008708,0.822502,1.0,0.819534,1.0,0.824333,1.0


# Takeaway

Notice that comparing solely on the basis of accuracy we may think that our model is doing well, but in reality even a majority rule classifier can appear to do well on an imbalanced dataset. 

Using dummy classifiers can help us understand our model performance by using benchmarks of random or most_frequent choices. If our model fails to perform better than a dummy classifier we can suspect that our model has not learned very well and we must investigate the cause.