# Week 6: Ensembles and Evaluation
## Cross Validation
This week, we'll be studying in-depth methods for evaluating classifiers. We'll start by learning about cross validation, confusion matrices, and student's t-test. To study these concepts, we'll be comparing the results of a decision tree and k-nearest neighbors on the hypothyroid dataset we have previously utilized.

In [1]:
import pandas as pd
import numpy as np
# download data
data_url = 'https://raw.githubusercontent.com/cse44648/cse44648/master/datasets/hypothyroid.csv'

data = pd.read_csv(data_url)
features_to_use = ['Age', 'T4U', 'TSH']
X = data.loc[:, features_to_use] # we will only use some features
y = data.iloc[:, -1] # get class
X

Unnamed: 0,Age,T4U,TSH
0,72.0,1.48,30.0
1,15.0,1.13,145.0
2,24.0,1.00,0.0
3,24.0,1.04,430.0
4,77.0,1.28,7.3
...,...,...,...
3158,58.0,0.91,5.8
3159,29.0,1.01,0.8
3160,77.0,0.68,1.2
3161,74.0,0.48,1.3


In [2]:
# start by preprocessing the data
# we'll try dropping NA values first
X = X.dropna()
y = y.loc[X.index] # keep the corresponding classes
print(len(X))
print(len(y))
print(y.value_counts(normalize=True))

2291
2291
Class
negative       0.941074
hypothyroid    0.058926
Name: proportion, dtype: float64


In [3]:
from sklearn.preprocessing import StandardScaler
X = pd.DataFrame(StandardScaler().fit_transform(X), columns=features_to_use)
X.describe()
# note that the mean is very close to 0, but not quite - this is because Python takes shortcuts when rounding data

Unnamed: 0,Age,T4U,TSH
count,2291.0,2291.0,2291.0
mean,1.1630450000000001e-17,-5.458556e-16,-1.240581e-17
std,1.000218,1.000218,1.000218
min,-2.71773,-4.39317,-0.2498511
25%,-0.8333142,-0.5783611,-0.2498511
50%,0.1612387,-0.1347786,-0.220792
75%,0.8417223,0.3531621,-0.1585223
max,2.359724,4.611554,21.75208


Next, let's split our data into training and testing folds using 5-fold validation. In this case, the data set is split into 5 equal-sized partitions. In each fold, 1/5 of the data is used as testing data, and the other 4/5 of the data is used as training data. Each instance will thus be part of the testing set exactly once, and the training set four times.

In [4]:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, shuffle=False) # initialize the KFold object
splits = kfold.split(X) # call the split method on our feature set
for train_idx, test_idx in list(splits):
    X_train = X.iloc[train_idx]
    X_test = X.iloc[test_idx]
    print('Train set size: {}; test set size: {}'.format(len(X_train), len(X_test)))

Train set size: 1832; test set size: 459
Train set size: 1833; test set size: 458
Train set size: 1833; test set size: 458
Train set size: 1833; test set size: 458
Train set size: 1833; test set size: 458


Next we will train a KNN and decision tree model for each fold, and compare their average performance across each of the folds.

In [5]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

kfold = KFold(n_splits=5, shuffle=True) # initialize the KFold object
splits = kfold.split(X) # call the split method on our feature set

knn = KNeighborsClassifier(n_neighbors=5)
dt = DecisionTreeClassifier()

knn_results = []
dt_results = []

for i, split in enumerate(splits):
    train_idx, test_idx = split
    X_train = X.iloc[train_idx]
    y_train = y.iloc[train_idx]
    X_test = X.iloc[test_idx]
    y_test = y.iloc[test_idx]
    
    # each time we call the .fit() function on a model, it overwrites any previous training.
    knn.fit(X_train, y_train)
    dt.fit(X_train, y_train)
    
    # generate predictions and calculate accuracy
    knn_pred = knn.predict(X_test)
    knn_acc = np.sum(knn_pred == y_test) / len(y_test)
    knn_results.append(knn_acc)
    dt_pred = dt.predict(X_test)
    dt_acc = np.sum(dt_pred == y_test) / len(y_test)
    dt_results.append(dt_acc)
    
    print('Test fold {} results:'.format(i + 1))
    print('\tKNN: {:.4f}'.format(knn_acc))
    print('\t DT: {:.4f}'.format(dt_acc))
    
print('Average results:')
print('\tKNN: {:.4f} +- {:.4f}'.format(np.mean(knn_results), np.std(knn_results)))
print('\t DT: {:.4f} +- {:.4f}'.format(np.mean(dt_results), np.std(dt_results)))

Test fold 1 results:
	KNN: 0.9586
	 DT: 0.9521
Test fold 2 results:
	KNN: 0.9563
	 DT: 0.9607
Test fold 3 results:
	KNN: 0.9672
	 DT: 0.9585
Test fold 4 results:
	KNN: 0.9607
	 DT: 0.9498
Test fold 5 results:
	KNN: 0.9607
	 DT: 0.9541
Average results:
	KNN: 0.9607 +- 0.0036
	 DT: 0.9550 +- 0.0040


The average and standard deviation of a classifier across multiple testing sets provides a more robust measure of performance, since the classifier is tested on a variety of samples. Also, the low standard deviation means that there is low variance between the samples (folds). However, rather than just eyeballing these numbers, let's use statistical methods to decide whether the differences are significant.

## Student's T-test
Student's t-test is a statistical hypothesis test that is used to determine whether there is a statistically significant difference between the values of 2 samples. In this case, our 2 samples are the cross validation scores for the KNN and decision tree models. The null hypothesis is that there is no significant difference between the outputs of the two classifiers. Based on the result of the t-test we can either reject the null hypothesis (if the p-value is low enough) or fail to reject it.

In [7]:
from scipy.stats import ttest_ind

sample1 = knn_results
sample2 = dt_results

print(sample1)
print(sample2)

ttest_ind(sample1, sample2)

[0.9694989106753813, 0.9650655021834061, 0.9344978165938864, 0.9541484716157205, 0.9650655021834061]
[0.9520697167755992, 0.9606986899563319, 0.9585152838427947, 0.9497816593886463, 0.9541484716157205]


Ttest_indResult(statistic=0.39383979188172374, pvalue=0.703988507847489)

In this case, the p-value is .49 - well above the typical threshold of 0.05 for significance. Therefore, we fail to reject the null hypothesis - i.e., we say that there is no statistically significant difference between performance of the two classifiers.

Let's try comparing KNN to Naive Bayes.

In [1]:
from sklearn.naive_bayes import GaussianNB
from scipy.stats import ttest_ind

kfold = KFold(n_splits=5, shuffle=True) # initialize the KFold object
splits = kfold.split(X) # call the split method on our feature set

knn = KNeighborsClassifier(n_neighbors=5)
nb = GaussianNB()

knn_results = []
nb_results = []

for i, split in enumerate(splits):
    train_idx, test_idx = split
    X_train = X.iloc[train_idx]
    y_train = y.iloc[train_idx]
    X_test = X.iloc[test_idx]
    y_test = y.iloc[test_idx]
    
    # each time we call the .fit() function on a model, it overwrites any previous training.
    knn.fit(X_train, y_train)
    nb.fit(X_train, y_train)
    
    # generate predictions and calculate accuracy
    knn_pred = knn.predict(X_test)
    knn_acc = np.sum(knn_pred == y_test) / len(y_test)
    knn_results.append(knn_acc)
    nb_pred = dt.predict(X_test)
    nb_acc = np.sum(nb_pred == y_test) / len(y_test)
    nb_results.append(nb_acc)
    
    print('Test fold {} results:'.format(i + 1))
    print('\tKNN: {:.4f}'.format(knn_acc))
    print('\t NB: {:.4f}'.format(nb_acc))
    
print('Average results:')
print('\tKNN: {:.4f} +- {:.4f}'.format(np.mean(knn_results), np.std(knn_results)))
print('\t NB: {:.4f} +- {:.4f}'.format(np.mean(nb_results), np.std(nb_results)))
print()
print(ttest_ind(knn_results, nb_results))

NameError: name 'KFold' is not defined

In this case, the p-value is well below the 0.05 significance threshold. We can therefore reject the null hypothesis that the results of the classifiers are not statistically significantly different.

## Confusion Matrices
While the t-test is useful for comparing the performance of classifiers, performance as measured by accuracy does not tell us about the type of errors our classifier makes. Above we saw that about 94% of the data has the class "negative" which means that any classifier that simply predicted "hypothyroid" would achieve an accuracy of around 94%. To better understand the types of errors the classifier makes, we can use a confusion matrix. In this example, we will show the confusion matrix from the last testing fold in the previous example.

In [9]:
from sklearn.metrics import confusion_matrix
print('Confusion matrix for KNN')
print(confusion_matrix(y_test, knn_pred))
print('\nConfusion matrix for DT')
print(confusion_matrix(y_test, dt_pred))

Confusion matrix for KNN
[[ 11  14]
 [  8 425]]

Confusion matrix for DT
[[ 16   9]
 [ 14 419]]


In [14]:
from sklearn.metrics import ConfusionMatrixDisplay
# sklearn also includes plot_confusion_matrix, which does the predictions and computes the matrix for us
ConfusionMatrixDisplay(knn, X_test, y_test)
ConfusionMatrixDisplay(dt, X_test, y_test)

TypeError: __init__() takes 2 positional arguments but 4 were given

The confusion matrix is a summary of the types of errors made by a classifier. The y axis represents the true label, and the x axis represents the predicted label. The top left cell represents the number of testing instances that had a true label of "hypothyroid" for which the classifier correctly predicted "hypothyroid" - the true positives. The bottom left cell shows false positives, the top right cell shows false negatives, and the bottom right cell shows true negatives.

In the lecture we will discuss precision, recall, ROC curves, and other tools that measure a classifier based on the types of errors shown in this matrix.