#### Model Evaluation & selection

- Accuracy is widely used, but many other are possible:
    - User satisfaction (Web search)
    - Amount of revenue (e-commerce)
    - Increse in patient survivales rates (medical)
- Accuaracy with Imbalanced Classes:
    - Suppose yoy have two classes:
        - Relevant(R): the positive class
        - Not_relevant (N): the negative class
    - Out of 1000 randomly selected intem, on average
        - One item is relevant and has R label
        - The rest of items (999 of them) are not relevant and labelled N.
    - Recall that:

    $$Accuracy=\frac{\#Correct\ predictions}{\#total\ instances}$$

    - Assuming a test set of 1000 instances, what would this dummy classifier's accuracy be?
        Answer:  
        
        $$Accuracy_{Dummy}=\frac{999}{1000}=99.9\%$$

In [1]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits

dataset = load_digits()
X, y = dataset.data, dataset.target

for class_name, class_count in zip(dataset.target_names, np.bincount(dataset.target)):
    print(class_name,class_count)



0 178
1 182
2 177
3 183
4 181
5 182
6 181
7 179
8 174
9 180


In [2]:
# Creating a dataset with imbalanced binary classes:  
# Negative class (0) is 'not digit 1' 
# Positive class (1) is 'digit 1'
y_binary_imbalanced = y.copy()
y_binary_imbalanced[y_binary_imbalanced != 1] = 0

print('Original labels:\t', y[1:30])
print('New binary labels:\t', y_binary_imbalanced[1:30])

Original labels:	 [1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9]
New binary labels:	 [1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]


In [3]:
np.bincount(y_binary_imbalanced)    # Negative class (0) is the most frequent class

array([1615,  182])

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)

# Accuracy of Support Vector Machine classifier
from sklearn.svm import SVC

svm = SVC(kernel='rbf', C=1).fit(X_train, y_train)
svm.score(X_test, y_test)

0.9955555555555555

### Dummy Classifiers

DummyClassifier is a classifier that makes predictions using simple rules, which can be useful as a baseline for comparison against actual classifiers, especially with imbalanced classes.

- Dummy classifiers serve as a sanity check on your classifier's perfomance
- They provide a null metric (e.g null accuaracy) baseline
- Dummy classifiers should not be used for real problems
- Some commonly-used settings for the strategy parameter for DummyClassifier in scikit-learn:
    - most_frequent: Predicts the most frequent label in the training set.
    - stratified: random predictions based on training set class distribution
    - uniform: Generates predictions unifromly at random
    - mean: predicts the mean of the training targets
    - median: predicts the media of the training targets
    - quantile: predicts a user-provided quantile of the training targets
    - Constant: Always predicts a constant label provided by the user
        A major motivation of this method is FI-scoring when the positive class is in the monority
- What if my classifier accuaracy is close to the null accuracy baseline?
This could be a sing of:
    - Inefective, erroneous or missing features
    - Poor choice of kernel or hyperparameter
    - Large class imbalance

In [5]:
from sklearn.dummy import DummyClassifier

# Negative class (0) is most frequent
dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)
# Therefore the dummy 'most_frequent' classifier always predicts class 0
y_dummy_predictions = dummy_majority.predict(X_test)

y_dummy_predictions

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [6]:
dummy_majority.score(X_test, y_test)

0.9044444444444445

In [7]:
svm = SVC(kernel='linear', C=1).fit(X_train, y_train)
svm.score(X_test, y_test)

0.9777777777777777

#### Binary Prediction Outcomes 

TP= true positive 
FP = false positive (Type I Error)
TN = true negative
FN= false negtive (Type II error)


#### Binary (two-class) confusion matrix:

In [8]:
from sklearn.metrics import confusion_matrix

# Negative class (0) is most frequent
dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)
y_majority_predicted = dummy_majority.predict(X_test)
confusion = confusion_matrix(y_test, y_majority_predicted)

print('Most frequent class (dummy classifier)\n', confusion)

Most frequent class (dummy classifier)
 [[407   0]
 [ 43   0]]


In [9]:
# produces random predictions w/ same class proportion as training set
dummy_classprop = DummyClassifier(strategy='stratified').fit(X_train, y_train)
y_classprop_predicted = dummy_classprop.predict(X_test)
confusion = confusion_matrix(y_test, y_classprop_predicted)

print('Random class-proportional prediction (dummy classifier)\n', confusion)

Random class-proportional prediction (dummy classifier)
 [[361  46]
 [ 40   3]]


In [10]:
svm = SVC(kernel='linear', C=1).fit(X_train, y_train)
svm_predicted = svm.predict(X_test)
confusion = confusion_matrix(y_test, svm_predicted)

print('Support vector machine classifier (linear kernel, C=1)\n', confusion)

Support vector machine classifier (linear kernel, C=1)
 [[402   5]
 [  5  38]]


In [11]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression().fit(X_train, y_train)
lr_predicted = lr.predict(X_test)
confusion = confusion_matrix(y_test, lr_predicted)

print('Logistic regression classifier (default settings)\n', confusion)

Logistic regression classifier (default settings)
 [[401   6]
 [  8  35]]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [12]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(max_depth=2).fit(X_train, y_train)
tree_predicted = dt.predict(X_test)
confusion = confusion_matrix(y_test, tree_predicted)

print('Decision tree classifier (max_depth = 2)\n', confusion)

Decision tree classifier (max_depth = 2)
 [[400   7]
 [ 17  26]]


#### Confusion Matrices & basic Evaluation Metrics

- Accuaracy:
$$ Accuracy= \frac{TN+TP}{TN+TP+FN+FP}$$

- Classification Error (1-accuracy\): for what fraction of all instances is the classifier's predicion incorrect?

     $$ ClassificationError = \frac{FP+FN}{TN+TP+FN+FP}$$

- Recall, or True Positive Rate (TPR), (Sensitivity or probability of detection): what fraction of all positive instances does the clasifier correctly identify as positive?

$$Recall = \frac{TP}{TP+FN}$$

- Precision: What fraction of positive predictions are correct?
$$Precision=\frac{TP}{TP+FP}$$

How to decide what metric to apply
     
     - Is it more importat to avoid false positives, or false negatives?
     - Precision is used as metric when our objective is to minimize false positives
     - Recall is used when the objetive is to minimize false negatives

- False positive rate (FPR) (also known as Specificity): What fraction of all negative instances does the classifier incorrectly indetify as positive?

$$FPR=\frac{FP}{TN+FP}$$

- There is often a tradeoff between precision and recall

     - Recall-oriented machine learning tasks:
          - Search and information extraction in legal discovery
          - Tumor detection
          - Often paired with a human exper to filter out false positives
     - Precision-oriented machine learning task:
          - Search engine ranking, query suggestion
          - Document classification
          - Many customer-facing task (users remember failures)

- FI-score: combining precision & recall into a single number:

$$F_1=2 \times \frac{Precision \times recall}{Precision+Recall}=\frac{2TP}{2TP+FN+FP}$$

$$F_{\beta}=(1+\beta^2)\times\frac{Precision \times Recall}{(\beta^2 \times Precision + Recall)}
 = \frac{(1+\beta^2)TP}{(1+\beta^2)TP+\beta FN + FP}$$

 - $\beta$ allows adjustment of the metric to control the emphasis on recall vs precision
     - Precision-oriente users: $\beta=0.5$ (false positives hurt performance more than false negatives)
     - Recall-oriented users: $\beta=2$ (false negatives hirt performance more than false positive)

In [13]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Accuracy = TP + TN / (TP + TN + FP + FN)
# Precision = TP / (TP + FP)
# Recall = TP / (TP + FN)  Also known as sensitivity, or True Positive Rate
# F1 = 2 * Precision * Recall / (Precision + Recall) 
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, tree_predicted)))
print('Precision: {:.2f}'.format(precision_score(y_test, tree_predicted)))
print('Recall: {:.2f}'.format(recall_score(y_test, tree_predicted)))
print('F1: {:.2f}'.format(f1_score(y_test, tree_predicted)))

Accuracy: 0.95
Precision: 0.79
Recall: 0.60
F1: 0.68


In [14]:
# Combined report with all above metrics
from sklearn.metrics import classification_report

print(classification_report(y_test, tree_predicted, target_names=['not 1', '1']))

              precision    recall  f1-score   support

       not 1       0.96      0.98      0.97       407
           1       0.79      0.60      0.68        43

    accuracy                           0.95       450
   macro avg       0.87      0.79      0.83       450
weighted avg       0.94      0.95      0.94       450



In [15]:
print('Random class-proportional (dummy)\n', 
      classification_report(y_test, y_classprop_predicted, target_names=['not 1', '1']))
print('SVM\n', 
      classification_report(y_test, svm_predicted, target_names = ['not 1', '1']))
print('Logistic regression\n', 
      classification_report(y_test, lr_predicted, target_names = ['not 1', '1']))
print('Decision tree\n', 
      classification_report(y_test, tree_predicted, target_names = ['not 1', '1']))

Random class-proportional (dummy)
               precision    recall  f1-score   support

       not 1       0.90      0.89      0.89       407
           1       0.06      0.07      0.07        43

    accuracy                           0.81       450
   macro avg       0.48      0.48      0.48       450
weighted avg       0.82      0.81      0.81       450

SVM
               precision    recall  f1-score   support

       not 1       0.99      0.99      0.99       407
           1       0.88      0.88      0.88        43

    accuracy                           0.98       450
   macro avg       0.94      0.94      0.94       450
weighted avg       0.98      0.98      0.98       450

Logistic regression
               precision    recall  f1-score   support

       not 1       0.98      0.99      0.98       407
           1       0.85      0.81      0.83        43

    accuracy                           0.97       450
   macro avg       0.92      0.90      0.91       450
weighted avg 

#### Classifier Decision Functions

- Each classifier score value per test point indicates how confidently the classifier predicts the positive class 

- Choosing a fixed decision threshold gives a classification rule.
- By sweeping the decision threshold through the entire range of possile score values, we get a series of classification outcomes that form a curve

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)
y_scores_lr = lr.fit(X_train, y_train).decision_function(X_test)
y_score_list = list(zip(y_test[0:20], y_scores_lr[0:20]))

# show the decision_function scores for first 20 instances
y_score_list

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[(0, -29.828775799085854),
 (0, -19.382829508812186),
 (0, -29.19857279390273),
 (0, -21.746339374763863),
 (0, -22.642366278691963),
 (0, -11.805890930252485),
 (1, 6.496003146234195),
 (0, -23.35464428931671),
 (0, -27.544002175175844),
 (0, -26.888208464659),
 (0, -31.863103822353274),
 (0, -22.486056923282575),
 (0, -25.31804291703262),
 (0, -13.384496389620132),
 (0, -13.565660577691062),
 (0, -13.308326294818325),
 (1, 12.181016309684122),
 (0, -34.36239978496709),
 (0, -13.231563758676153),
 (0, -29.594006620707106)]

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)
y_proba_lr = lr.fit(X_train, y_train).predict_proba(X_test)
y_proba_list = list(zip(y_test[0:20], y_proba_lr[0:20,1]))

# show the probability of positive class for first 20 instances
y_proba_list

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[(0, 1.1105222607334101e-13),
 (0, 3.8207102104876396e-09),
 (0, 2.0855516800910965e-13),
 (0, 3.594883393191682e-10),
 (0, 1.467389029470996e-10),
 (0, 7.460423389729177e-06),
 (1, 0.9984928147919592),
 (0, 7.197917395426831e-11),
 (0, 1.0909173423170056e-12),
 (0, 2.1018389484343245e-12),
 (0, 1.4522113730930986e-14),
 (0, 1.7156534123072526e-10),
 (0, 1.0104473018087254e-11),
 (0, 1.5388149843410323e-06),
 (0, 1.2838311537693563e-06),
 (0, 1.6606060108726858e-06),
 (1, 0.9999948731618639),
 (0, 1.1928872272560515e-15),
 (0, 1.7930982798298833e-06),
 (0, 1.4043851456038233e-13)]