# Estimating Test Metrics for Classification

We have learned several scores (accuracy, precision, recall, F1) for evaluating classification models. We calculated these scores on the training data---that is, the same data that was used to evaluate the model. When studying regression models, we saw that evaluating machine learning models on the training data is problematic because a machine learning model can achieve a good training score by _overfitting_ to the training data. We argued that the goal of a machine learning model should be to achieve a good score on test data. However, true labels are often not available for the test data. Nevertheless, we can use cross-validation on the training data to estimate the test scores. These so-called validation scores can be used to select between models and tune hyperparameters.

The same cross-validation process for regression models can be carried out for classification models. Instead of calculating the *training* accuracy, precision, etc., we estimate the *test* accuracy, precision, etc. using cross-validation. This notebook demonstrates how to carry out this program, but the concepts (and even the code) are essentially identical what we did for regression models.

First, we define a classifier that we want to evaluate, once again using the breast cancer data.

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline

df_breast = pd.read_csv("http://dlsun.github.io/pods/data/breast-cancer.csv")

X_train = df_breast[["Clump Thickness", "Uniformity of Cell Size"]]
y_train = df_breast["Class"]

pipeline = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier(n_neighbors=10)
)

pipeline.fit(X=X_train, y=y_train)

To calculate test scores using $k$-fold cross validation, we use the `cross_val_score` function in scikit-learn. For example, to calculate test accuracy, we do the following:

In [2]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(pipeline, X_train, y_train,
                            cv=10, scoring="accuracy")
cv_scores

array([0.92753623, 0.89855072, 0.92753623, 0.94117647, 0.94117647,
       0.91176471, 0.95588235, 1.        , 0.98529412, 0.97058824])

Compare to the code we used to perform cross-validation for regression; the only change is to `scoring`.

We get 10 accuracy scores, one from each of the $k=10$ folds. It is customary to average these accuracy scores to obtain one overall estimate of the test accuracy.

In [3]:
cv_scores.mean()

0.9459505541346973

This validation accuracy is high, but a little lower than the 95.3% training accuracy that we obtained in the previous lesson. This makes sense because it is always harder for a model to predict on data it has not seen than on data it saw.

However, accuracy is still not a great performance measure, even if it is test accuracy (rather than training accuracy). Instead, we focus on precision, recall, and F1 score.

Remember that each class of the target variable has its own precision, recall, and F1 score. Scikit-Learn can calculate the precision and recall of a class $c$, but the labels need to be converted to a binary label&mdash;that is, $1$ (or `True`) if the observation is in class $c$ and $0$ (or `False`) otherwise. For example, to calculate the precision for malignant tumors (class 1), we define the new label `is_malignant`.

In [4]:
is_malignant = (y_train == 1)

cv_scores = cross_val_score(pipeline, X_train, is_malignant,
                cv=10, scoring="precision")

cv_scores

array([0.91304348, 0.94736842, 0.91304348, 0.91304348, 0.85714286,
       0.95      , 0.95652174, 1.        , 1.        , 0.95833333])

Once again, we get 10 precision scores, one from each of the $k=10$ folds, which we can average to get an overall estimate of the test precision.

In [5]:
precision_1 = cv_scores.mean()
precision_1

0.9408496785441866

To calculate recall or F1 score for malignancy we just need to change the scoring.

In [6]:
recall_1 = cross_val_score(pipeline, X_train, is_malignant,
                cv=10, scoring="recall").mean()
recall_1

0.9038043478260871

In [7]:
f1score_1 = cross_val_score(pipeline, X_train, is_malignant,
                cv=10, scoring="f1").mean()

f1score_1

0.9197427060207539

Likewise, the validation precision, recall, and F1 score for benign (class = 0) tumors is

In [8]:
is_benign = (y_train == 0)

precision_0 = cross_val_score(pipeline, X_train, is_benign,
                              cv=10, scoring="precision").mean()

recall_0 = cross_val_score(pipeline, X_train, is_benign,
                           cv=10, scoring="recall").mean()

f1score_0 = cross_val_score(pipeline, X_train, is_benign,
                            cv=10, scoring="f1").mean()

precision_0, recall_0, f1score_0

(0.9800264645465042, 0.9482323232323232, 0.9633975996956522)

Remember that each category of the target variables has its own precision, recall, and F1 score. We can average these values over the categories to get a "model" precision, recall, and F1 score using `_macro` scoring.

In [9]:
precision_macro = cross_val_score(pipeline, X_train, is_benign,
                                  cv=10, scoring="precision_macro").mean()

recall_macro = cross_val_score(pipeline, X_train, is_benign,
                               cv=10, scoring="recall_macro").mean()

f1score_macro = cross_val_score(pipeline, X_train, is_benign,
                                cv=10, scoring="f1_macro").mean()

precision_macro, recall_macro, f1score_macro

(0.9456715619079684, 0.9552755819060167, 0.9491032590587702)

## Hyperparameter Tuning

Could we do better with a different value of $k$ in our $k$-nearest neighbors model? We can use cross-validation on a grid of $k$ values and pick the one that maximizes some score. Since the F1 score summarizes both precision and recall, we'll use F1 as the score. There are two F1 scores---one for the benign masses and the malignant masses---`_macro` specifies that we take the average.

In [10]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(
    pipeline,
    param_grid={"kneighborsclassifier__n_neighbors": range(1, 50)},
    scoring="f1_macro",
    cv=10
)

grid_search.fit(X_train, y_train)
grid_search.best_params_

{'kneighborsclassifier__n_neighbors': 7}

The grid search identifies $k=7$ as optimal. How does this value compare to other values of $k$?

In [11]:
pd.DataFrame(grid_search.cv_results_).sort_values("rank_test_score").head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsclassifier__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
6,0.001581,1.1e-05,0.003173,6e-05,7,{'kneighborsclassifier__n_neighbors': 7},0.919336,0.917563,0.952534,0.935606,0.984049,0.918719,0.967803,0.984049,0.984049,0.968372,0.953208,0.026859,1
7,0.001587,2.4e-05,0.003188,3.9e-05,8,{'kneighborsclassifier__n_neighbors': 8},0.919336,0.917563,0.952534,0.935606,0.984049,0.918719,0.950183,1.0,0.983744,0.968372,0.95301,0.028687,2
8,0.001609,5.3e-05,0.003193,5.2e-05,9,{'kneighborsclassifier__n_neighbors': 9},0.919336,0.917563,0.968636,0.935606,0.937729,0.918719,0.967803,0.984049,0.983744,0.968372,0.950156,0.025745,3
4,0.001644,6.1e-05,0.003315,0.000214,5,{'kneighborsclassifier__n_neighbors': 5},0.919336,0.917563,0.937273,0.9343,0.984049,0.883761,0.935606,0.984049,0.984049,0.968372,0.944836,0.032472,4
5,0.001609,4.4e-05,0.003208,7.5e-05,6,{'kneighborsclassifier__n_neighbors': 6},0.919336,0.917563,0.936111,0.9343,0.967803,0.827761,0.96715,1.0,0.983744,0.968372,0.942214,0.046212,5
22,0.001638,7.6e-05,0.003323,0.000122,23,{'kneighborsclassifier__n_neighbors': 23},0.919336,0.881763,0.919336,0.9343,0.984049,0.883761,0.950183,0.984049,0.984049,0.967803,0.940863,0.037591,6
23,0.001578,1.8e-05,0.003228,4.2e-05,24,{'kneighborsclassifier__n_neighbors': 24},0.919336,0.881763,0.919336,0.9343,0.984049,0.883761,0.950183,0.984049,0.984049,0.967803,0.940863,0.037591,6
10,0.00158,1.5e-05,0.003182,4.6e-05,11,{'kneighborsclassifier__n_neighbors': 11},0.917563,0.899903,0.92089,0.935606,0.937729,0.883761,0.951231,1.0,0.984049,0.967803,0.939853,0.034789,8
9,0.001583,2.1e-05,0.003178,5.9e-05,10,{'kneighborsclassifier__n_neighbors': 10},0.919336,0.881763,0.919336,0.9343,0.937729,0.899209,0.951231,1.0,0.983744,0.967803,0.939445,0.035152,9
19,0.001581,4.6e-05,0.003206,4.9e-05,20,{'kneighborsclassifier__n_neighbors': 20},0.919336,0.881763,0.904167,0.9343,0.984049,0.883761,0.950183,0.984049,0.984049,0.967803,0.939346,0.038719,10


Is $k=7$ better than $k=10$, the value we had used previously? Choosing $k=7$ leads to a higher average F1 score, though the difference is relatively small. What about the precision and recall for malignant masses for $k=7$?

In [12]:
new_precision = cross_val_score(
    grid_search.best_estimator_,
    X_train, is_malignant,
    scoring="precision",
    cv=10).mean()

new_recall = cross_val_score(
    grid_search.best_estimator_,
    X_train, is_malignant,
    scoring="recall",
    cv=10).mean()

precision_1, new_precision

(0.9408496785441866, 0.9339878165312948)

In [13]:
recall_1, new_recall

(0.9038043478260871, 0.9456521739130436)

We see that the new model has a higher recall but a lower precision for malignancy. So depending on the scoring criteria, we can end up with different "best" models.

While we won't cover it now, we'll mention that ensemble methods can also be used for to combine predictions for multiple classification models (like we did for regression models).