# Evaluating Classifiers

Term 1 2019 - Instructor: Teerapong Leelanupab

Teaching Assistant: Suttida Satjasunsern

***

A key task when applying a classifier is to determine how effective our classifier will be at making predictions. One way to estimate this is to divide the full dataset into two sets using a "hold-out strategy":
1. *Training set*: A set of examples used to build the classification model.
2. *Test set*: A separate set of examples that is withheld from the classifier during training, and is used afterwards to evaluate the model.

### Evaluating Simple Train/Test Splits

To demonstrate how to evaluate classifiers in scikit-learn, we will randomly generate an artificial dataset with 400 examples described by 10 features, annotated with 2 classes: Positive (1) and Negative (-1)

In [1]:
from sklearn.datasets import make_hastie_10_2
data, target = make_hastie_10_2(400)

In [2]:
print(data.shape)
data[0,:]

(400, 10)


array([ 0.45936383, -0.01534159, -1.78911457,  0.45812823, -0.03201433,
        1.13804367, -0.5278233 , -1.10860896,  0.22043537,  0.3348967 ])

In [21]:
data[:5]

array([[ 0.45936383, -0.01534159, -1.78911457,  0.45812823, -0.03201433,
         1.13804367, -0.5278233 , -1.10860896,  0.22043537,  0.3348967 ],
       [-0.75795659, -1.34308119,  0.72057333, -0.65958994, -1.20833129,
        -0.45483804, -0.77665157, -0.96312656,  0.2326806 , -0.56812524],
       [ 0.74034954,  0.29369466,  1.08873705, -0.31758993, -1.0471465 ,
        -0.34017487, -0.78892892,  0.67615451, -0.44667061, -0.21387939],
       [-0.97335541,  0.03331356, -1.01965619,  0.90629496, -1.15466746,
         0.79024203, -1.30368906, -1.60406674,  0.65359115, -0.67078714],
       [-0.13840536, -1.99894361, -0.44249836,  0.61801621, -0.33669973,
         0.67166099, -0.70427787,  0.79217237, -0.80378995,  1.59405991]])

Each item in this artificial dataset has a label:

In [3]:
print(target)

[-1. -1. -1.  1.  1. -1. -1.  1. -1.  1.  1. -1. -1. -1.  1.  1. -1. -1.
 -1.  1.  1. -1. -1.  1. -1.  1.  1. -1.  1.  1. -1.  1. -1.  1.  1.  1.
  1. -1. -1.  1.  1.  1.  1. -1.  1. -1.  1.  1.  1.  1. -1. -1. -1.  1.
 -1.  1.  1. -1. -1.  1.  1.  1. -1. -1.  1.  1. -1.  1.  1.  1. -1.  1.
 -1. -1. -1.  1.  1. -1.  1. -1. -1. -1. -1.  1.  1.  1. -1.  1. -1.  1.
  1. -1.  1. -1.  1.  1. -1.  1.  1. -1. -1. -1.  1. -1.  1.  1.  1.  1.
  1.  1. -1.  1. -1. -1. -1.  1. -1.  1.  1. -1. -1.  1. -1.  1.  1. -1.
  1.  1. -1.  1.  1. -1. -1. -1.  1.  1. -1. -1.  1. -1. -1. -1.  1.  1.
 -1.  1. -1. -1.  1. -1. -1.  1.  1.  1.  1.  1. -1. -1. -1. -1.  1. -1.
  1. -1. -1.  1.  1. -1.  1.  1.  1. -1. -1.  1.  1. -1.  1.  1.  1. -1.
  1.  1.  1. -1. -1.  1. -1.  1.  1.  1. -1. -1. -1.  1. -1. -1.  1.  1.
  1. -1.  1.  1.  1. -1.  1.  1. -1.  1.  1.  1.  1.  1. -1. -1. -1. -1.
  1.  1.  1. -1. -1.  1.  1. -1.  1.  1.  1. -1. -1.  1. -1.  1.  1.  1.
  1.  1.  1.  1. -1. -1.  1. -1.  1. -1.  1. -1. -1

We can easily randomly split the complete dataset into a training test and a test set. We will specify that 20% (0.2) of the data will be used for the test set.

In [4]:
from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = train_test_split(data, target, test_size=0.2)

In [23]:
target_test

array([ 1.,  1.,  1.,  1., -1.,  1.,  1.,  1., -1., -1., -1.,  1.,  1.,
        1., -1., -1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., -1.,  1., -1.,
       -1., -1.,  1.,  1.,  1.,  1., -1., -1.,  1., -1., -1.,  1., -1.,
        1.,  1.,  1.,  1., -1.,  1.,  1., -1.,  1., -1., -1., -1., -1.,
        1.,  1.,  1., -1.,  1., -1., -1., -1., -1.,  1., -1., -1., -1.,
       -1.,  1., -1.,  1., -1.,  1., -1.,  1.,  1.,  1.,  1., -1.,  1.,
       -1.,  1.])

In [5]:
print("Training set has %d examples" % data_train.shape[0] )
print("Test set has %d examples" % data_test.shape[0] )

Training set has 320 examples
Test set has 80 examples


Now we will build a KNN classifier ($k=3$) as we have seen previously. Note that we only use the training data to build the model:

In [6]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3)
model.fit(data_train, target_train)
print(model)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform')


In [7]:
predicted = model.predict(data_test)
print("Target",target_test)
print("Predictions",predicted)

Target [ 1.  1.  1.  1. -1.  1.  1.  1. -1. -1. -1.  1.  1.  1. -1. -1.  1.  1.
  1.  1.  1.  1.  1. -1.  1. -1. -1. -1.  1.  1.  1.  1. -1. -1.  1. -1.
 -1.  1. -1.  1.  1.  1.  1. -1.  1.  1. -1.  1. -1. -1. -1. -1.  1.  1.
  1. -1.  1. -1. -1. -1. -1.  1. -1. -1. -1. -1.  1. -1.  1. -1.  1. -1.
  1.  1.  1.  1. -1.  1. -1.  1.]
Predictions [-1. -1.  1. -1. -1. -1. -1. -1. -1. -1. -1.  1.  1.  1. -1. -1.  1.  1.
 -1.  1.  1.  1.  1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.  1. -1. -1.
 -1.  1. -1. -1.  1. -1. -1. -1.  1. -1. -1. -1. -1. -1. -1. -1.  1. -1.
 -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.  1. -1.  1. -1.
 -1.  1. -1. -1. -1.  1. -1. -1.]


Manually comparing the target labels for the test data with our predictions can be misleading. Instead, we want to determine the extent to which the classifier made the following correct/incorrect predictions:
- *True Positives* (TP) are those which are labeled ``1`` which are actually ``1``
- *False Positives* (FP) are those which are labeled ``1`` which are actually ``-1``
- *True Negatives* (TN) are those which are labeled ``-1`` which are actually ``-1``
- *False Negatives* (FN) are those which are labeled ``-1`` which are actually ``1``

We can do this by creating a confusion matrix for the results. The result is a NumPy matrix, with predictions on the columns and actual labels on the rows. The values correspond to:

    [ [TP FN]
    [FP TN] ]
A perfect classifier with 100% accuracy would produce a pure diagonal matrix which would have all the test examples predicted in their correct class. In our case, we see that we have many false negatives (i.e. examples labelled -1 which are actually 1).

In [8]:
# import all of the scikit-learn evaluation functionality
from sklearn.metrics import *
# build the confusion matrix
cm = confusion_matrix(target_test, predicted,labels=[1,-1])
print(cm)

[[18 27]
 [ 1 34]]


An overall *accuracy* score for the predictions, defined as the fraction of correct predictions, can be calculated using the below. This will return a value between 0 (completely wrong) and 1 (predictions are 100% accurate):

In [9]:
print("Accuracy = %.2f" % accuracy_score(target_test, predicted) )

Accuracy = 0.65


Measures from information retrieval (search engines) can be used in ML evaluation. Note that these are calculated with respect to a particular class (e.g. the positive class labelled as "1").
- *Precision*: proportion of retrieved results that are relevant = TP/(TP+FP)
- *Recall*: proportion of relevant results that are retrieved = TP/(TP+FN)

In [10]:
# Note that we indicate that we are interested in the Positive class here, which is labelled as "1"
print("Precision (Positive) = %.2f" % precision_score(target_test, predicted, pos_label=1) )
print("Recall (Positive) = %.2f" % recall_score(target_test, predicted, pos_label=1) )

Precision (Positive) = 0.95
Recall (Positive) = 0.40


Note that there is often a trade-off between precision and recall. We can combine precision and recall into a single score using the *F1 Measure*, which is a weighted average of the precision and recall. The F1 Measure reaches its best value at 1 and worst at 0.

    F1 = 2 * (precision * recall) / (precision + recall)

In [11]:
print("F1 (Positive) = %.2f" % f1_score(target_test, predicted, pos_label=1) )

F1 (Positive) = 0.56


We can quickly compute a summary of these statistics using scikit-learn's provided convenience function:

In [12]:
print(classification_report(target_test, predicted, target_names=["negative","positive"]))

              precision    recall  f1-score   support

    negative       0.56      0.97      0.71        35
    positive       0.95      0.40      0.56        45

   micro avg       0.65      0.65      0.65        80
   macro avg       0.75      0.69      0.64        80
weighted avg       0.78      0.65      0.63        80



### Cross Validation

A problem with simply randomly splitting a dataset into two sets is that each random split might give different results. We are also ignoring a portion of your dataset. One way to address this is to use *k-fold cross-validation* to evaluate a classifier:
1. Divide the data into k disjoint subsets - “folds” (e.g. k=5).
2. For each of k experiments, use k-1 folds for training and the selected one fold for testing.
3. Repeat for all k folds, average the accuracy/error rates.

While this is a relatively complex process, scikit-learn allows us to achieve this using a single command. Let's do a 2-fold cross-validation of the KNN classifier

In [13]:
# create a single classifier
model = KNeighborsClassifier(n_neighbors=3)

# apply 2-fold cross-validation, measuring accuracy each time
from sklearn.model_selection import cross_val_score
acc_scores = cross_val_score(model, data, target, cv=2, scoring="accuracy")
print(acc_scores)

[0.645 0.625]


Similarly, for 10-fold cross validation we get an array with 10 accuracy scores, one for each fold:

In [14]:
acc_scores =  cross_val_score(model, data, target, cv=10, scoring="accuracy")
print(acc_scores)

[0.625 0.625 0.7   0.525 0.625 0.65  0.7   0.6   0.55  0.625]


Calculate the average accuracy across all folds:

In [15]:
print("KNN: Mean cross-validation accuracy = %.2f" % acc_scores.mean() )

KNN: Mean cross-validation accuracy = 0.62


We can use this approach to compare different classifiers on the same data, such as a logistic regression classifier or a Support Vector Machine (SVM) classifier.

In [16]:
from sklearn import linear_model
model = linear_model.LogisticRegression(solver='liblinear')
acc_scores =  cross_val_score(model, data, target, cv=10, scoring="accuracy")
print("Logistic Regression: Mean cross-validation accuracy = %.2f" % acc_scores.mean() )

Logistic Regression: Mean cross-validation accuracy = 0.47


In [17]:
from sklearn.svm import SVC
model = SVC(gamma='scale')
acc_scores = cross_val_score(model, data, target, cv=10, scoring="accuracy")
print("SVM: Mean cross-validation accuracy = %.2f" % acc_scores.mean() )

SVM: Mean cross-validation accuracy = 0.91
