# Model Evaluation

In [50]:
%matplotlib notebook
import warnings; warnings.simplefilter('ignore')

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## Learning objectives
- Understand why **accuracy** only gives a partial picture of a classifier's performance.
- Understand the motivation and definition of important **evaluation metrics** in machine learning.
- Learn how to use a variety of evaluation metrics to **evaluate supervised** machine learning models.
- Learn about choosing the right metric for **selecting between models** or for doing **parameter tuning**.

## Represent / Train / Evaluate / Refine Cycle
![cycle.png](attachment:cycle.png)

## Evaluation Goal
- Different applications have very **different goals**
- **Accuracy** is widely used, but many others are possible, e.g.:
    - User satisfaction (Web search)
    - Amount of revenue (e-commerce)
    - Increase in patient survival rates (medical)
- It's very important to choose evaluation methods that **match the goal** of your application.
- Compute your selected evaluation metric for multiple **different models**.
- Then select the model with **'best'** value of evaluation metric.

## Evaluation for Classification

### Classification Accuracy and Error Rate
- Classification Accuracy:  is the percentage of correctly classified examples over all training examples

        𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦=  (𝑁𝑜. 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)/(𝑁𝑜. 𝑎𝑙𝑙 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)

    Often also referred to as recognition rate 

- Error rate (or misclassification rate) is the opposite of accuracy.

        𝐸𝑟𝑟𝑜𝑟 𝑟𝑎𝑡𝑒=1 −𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦

### Accuracy with imbalanced classes
the classification rate can be misleading!
Example:

In [3]:
from sklearn.datasets import load_digits

dataset = load_digits()
X, y = dataset.data, dataset.target

for class_name, class_count in zip(dataset.target_names, np.bincount(dataset.target)):
    print(class_name,class_count)

0 178
1 182
2 177
3 183
4 181
5 182
6 181
7 179
8 174
9 180


Creating a dataset with imbalanced binary classes:  
- Negative class (0) is 'not digit 1'
- Positive class (1) is 'digit 1'

In [4]:
y_binary_imbalanced = y.copy()
y_binary_imbalanced[y_binary_imbalanced != 1] = 0

print('Original labels:\t', y[1:30])
print('New binary labels:\t', y_binary_imbalanced[1:30])

Original labels:	 [1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9]
New binary labels:	 [1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]


In [5]:
np.bincount(y_binary_imbalanced)    # Negative class (0) is the most frequent class

array([1615,  182], dtype=int64)

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)

# Accuracy of Support Vector Machine classifier
from sklearn.svm import SVC

svm = SVC(kernel='rbf', C=1).fit(X_train, y_train)
svm.score(X_test, y_test)

0.9088888888888889

### Dummy Classifiers

Dummy classifiers completely ignore the input data!

In [7]:
from sklearn.dummy import DummyClassifier

# Negative class (0) is most frequent
dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)
# Therefore the dummy 'most_frequent' classifier always predicts class 0
y_dummy_predictions = dummy_majority.predict(X_test)

y_dummy_predictions

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [8]:
dummy_majority.score(X_test, y_test)

0.9044444444444445

- Dummy classifiers serve as a sanity check on your classifier's performance.
- They provide a null metric (e.g. null accuracy) baseline.
- Dummy classifiers should not be used for real problems.

Some commonly-used settings for the strategy parameter for DummyClassifier in scikit-learn:
- most_frequent : predicts the most frequent label in the training set.
- stratified : random predictions based on training set class distribution.
- uniform : generates predictions uniformly at random.
- constant : always predicts a constant label provided by the user.

***What if my classifier accuracy is close to the null accuracy baseline?***<br>
This could be a sign of:
- Ineffective, erroneous or missing **features**
- Poor choice of kernel or **hyperparameter**
- Large class **imbalance**

In [9]:
svm = SVC(kernel='linear', C=1).fit(X_train, y_train)
svm.score(X_test, y_test)

0.9777777777777777

### Confusion matrices
![cm.PNG](attachment:cm.PNG)

![accuracy.PNG](attachment:accuracy.PNG)

#### Binary (two-class) confusion matrix

In [10]:
def print_cm(cm, labels, hide_zeroes=False, hide_diagonal=False, hide_threshold=None):
    """pretty print for confusion matrixes"""
    columnwidth = max([len(x) for x in labels] + [5])  # 5 is value length
    empty_cell = " " * columnwidth
    # Print header
    print("    " + empty_cell, end=" ")
    for label in labels:
        print("%{0}s".format(columnwidth) % label, end=" ")
    print()
    # Print rows
    for i, label1 in enumerate(labels):
        print("    %{0}s".format(columnwidth) % label1, end=" ")
        for j in range(len(labels)):
            cell = "%{0}.1f".format(columnwidth) % cm[i, j]
            if hide_zeroes:
                cell = cell if float(cm[i, j]) != 0 else empty_cell
            if hide_diagonal:
                cell = cell if i != j else empty_cell
            if hide_threshold:
                cell = cell if cm[i, j] > hide_threshold else empty_cell
            print(cell, end=" ")
        print()

In [11]:
from sklearn.metrics import confusion_matrix

# Negative class (0) is most frequent
dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)
y_majority_predicted = dummy_majority.predict(X_test)

confusion = confusion_matrix(y_test, y_majority_predicted)

print('Most frequent class (dummy classifier)\n')
print_cm(confusion, ['Not one', 'One'])

Most frequent class (dummy classifier)

            Not one     One 
    Not one   407.0     0.0 
        One    43.0     0.0 


In [12]:
# produces random predictions w/ same class proportion as training set
dummy_classprop = DummyClassifier(strategy='stratified').fit(X_train, y_train)
y_classprop_predicted = dummy_classprop.predict(X_test)
confusion = confusion_matrix(y_test, y_classprop_predicted)

print('Random class-proportional prediction (dummy classifier)\n')
print_cm(confusion, ['Not one', 'One'])

Random class-proportional prediction (dummy classifier)

            Not one     One 
    Not one   367.0    40.0 
        One    39.0     4.0 


In [13]:
svm = SVC(kernel='rbf', C=1).fit(X_train, y_train)
svm_predicted = svm.predict(X_test)
confusion = confusion_matrix(y_test, svm_predicted)

print('Support vector machine classifier (linear kernel, C=1)\n')
print_cm(confusion, ['Not one', 'One'])

Support vector machine classifier (linear kernel, C=1)

            Not one     One 
    Not one   407.0     0.0 
        One    41.0     2.0 


In [14]:
svm = SVC(kernel='linear', C=1).fit(X_train, y_train)
svm_predicted = svm.predict(X_test)
confusion = confusion_matrix(y_test, svm_predicted)

print('Support vector machine classifier (linear kernel, C=1)\n')
print_cm(confusion, ['Not one', 'One'])

Support vector machine classifier (linear kernel, C=1)

            Not one     One 
    Not one   402.0     5.0 
        One     5.0    38.0 


In [15]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression().fit(X_train, y_train)
lr_predicted = lr.predict(X_test)
confusion = confusion_matrix(y_test, lr_predicted)

print('Logistic regression classifier (default settings)\n')
print_cm(confusion, ['Not one', 'One'])

Logistic regression classifier (default settings)

            Not one     One 
    Not one   401.0     6.0 
        One     6.0    37.0 


In [16]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(max_depth=2).fit(X_train, y_train)
tree_predicted = dt.predict(X_test)
confusion = confusion_matrix(y_test, tree_predicted)

print('Decision tree classifier (max_depth = 2)\n')
print_cm(confusion, ['Not one', 'One'])

Decision tree classifier (max_depth = 2)

            Not one     One 
    Not one   400.0     7.0 
        One    17.0    26.0 


### Evaluation metrics for binary classification
![egg.PNG](attachment:egg.PNG)

#### Precision
![precision.PNG](attachment:precision.PNG)
- The precision of predicting positive examples.
- How many of the predicted positive examples are really positive

#### Recall
![recall.PNG](attachment:recall.PNG)
- How much did the classifier recall of the positive examples
- aka. True Positive Rate (TPR), and Sensitivity

#### F1
![f1.PNG](attachment:f1.PNG)

There is also
![specificity.PNG](attachment:specificity.PNG)
- True negative rate
- The ability to correctly classify the negative class!

In [17]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print('Accuracy: {:.2f}'.format(accuracy_score(y_test, tree_predicted)))
print('Precision: {:.2f}'.format(precision_score(y_test, tree_predicted)))
print('Recall: {:.2f}'.format(recall_score(y_test, tree_predicted)))
print('F1: {:.2f}'.format(f1_score(y_test, tree_predicted)))

Accuracy: 0.95
Precision: 0.79
Recall: 0.60
F1: 0.68


In [18]:
# Combined report with all above metrics
from sklearn.metrics import classification_report

print(classification_report(y_test, tree_predicted, target_names=['not 1', '1']))

              precision    recall  f1-score   support

       not 1       0.96      0.98      0.97       407
           1       0.79      0.60      0.68        43

   micro avg       0.95      0.95      0.95       450
   macro avg       0.87      0.79      0.83       450
weighted avg       0.94      0.95      0.94       450



- **"micro"** gives each sample-class pair an equal contribution to the overall metric (except as a result of sample-weight). Rather than summing the metric per class,
- **"macro"** simply calculates the mean of the binary metrics, giving equal weight to each class. <br>
     In problems where infrequent classes are nonetheless important, macro-averaging may be a means of highlighting their performance. On the other hand, the assumption that all classes are equally important is often untrue, such that macro-averaging will over-emphasize the typically low performance on an infrequent class.
- **"weighted"** accounts for class imbalance by computing the average of binary metrics in which each class’s score is weighted by its presence in the true data sample.<br>
(check slides in (applied machine learning in python))

In [19]:
print('Random class-proportional (dummy)\n', 
      classification_report(y_test, y_classprop_predicted, target_names=['not 1', '1']))
print("---------------------------------------------------------")
print('SVM\n', 
      classification_report(y_test, svm_predicted, target_names = ['not 1', '1']))
print("---------------------------------------------------------")
print('Logistic regression\n', 
      classification_report(y_test, lr_predicted, target_names = ['not 1', '1']))
print("---------------------------------------------------------")
print('Decision tree\n', 
      classification_report(y_test, tree_predicted, target_names = ['not 1', '1']))
print("---------------------------------------------------------")

Random class-proportional (dummy)
               precision    recall  f1-score   support

       not 1       0.90      0.90      0.90       407
           1       0.09      0.09      0.09        43

   micro avg       0.82      0.82      0.82       450
   macro avg       0.50      0.50      0.50       450
weighted avg       0.83      0.82      0.83       450

---------------------------------------------------------
SVM
               precision    recall  f1-score   support

       not 1       0.99      0.99      0.99       407
           1       0.88      0.88      0.88        43

   micro avg       0.98      0.98      0.98       450
   macro avg       0.94      0.94      0.94       450
weighted avg       0.98      0.98      0.98       450

---------------------------------------------------------
Logistic regression
               precision    recall  f1-score   support

       not 1       0.99      0.99      0.99       407
           1       0.86      0.86      0.86        43

   mi

### Trading Off Precision and Recall
Can be achieved by e.g. varying decision **threshold** of a classifier.

- Suppose we want to predict only if very **confident**
    - Avoid false positives.
    - Higher threshold, higher precision, lower recall.

- Suppose we want to **avoid missing** positive examples (e.g. highly contagious disease)
    - Avoid false negatives.
    - Lower threshold, lower precision, higher recall.
    

- **Recall-oriented machine learning tasks**:
    - Search and information extraction in legal discovery
    - Tumor detection
    - Often paired with a human expert to filter out false positives
- **Precision-oriented machine learning tasks**:
    - Search engine ranking, query suggestion
    - Document classification
    - Many customer-facing tasks (users remember failures!)

### Decision functions
- Each **classifier score** value per test point indicates how confidently the classifier predicts the positive class (large-magnitude positive values) or the negative class (large-magnitude negative values).
- Choosing a fixed decision **threshold** gives a classification rule.

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)

y_scores_lr = lr.fit(X_train, y_train).decision_function(X_test)
y_score_list = list(zip(y_test[0:20], y_scores_lr[0:20]))

# show the decision_function scores for first 20 instances
y_score_list

[(0, -23.177190659057413),
 (0, -13.541499275924934),
 (0, -21.722931392118863),
 (0, -18.907331592939087),
 (0, -19.735710477993003),
 (0, -9.749881951500523),
 (1, 5.234934901586263),
 (0, -19.307654905146972),
 (0, -25.101179084160105),
 (0, -21.827293939603223),
 (0, -24.151385315482127),
 (0, -19.57697045677537),
 (0, -22.574536340741492),
 (0, -10.823178014466372),
 (0, -11.91199599279311),
 (0, -10.979093931056003),
 (1, 11.206094761125225),
 (0, -27.64601060941326),
 (0, -12.859207350646066),
 (0, -25.848766509204186)]

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)
y_proba_lr = lr.fit(X_train, y_train).predict_proba(X_test)
y_proba_list = list(zip(y_test[0:20], y_proba_lr[0:20,1]))

# show the probability of positive class for first 20 instances
y_proba_list

[(0, 8.595556329003577e-11),
 (0, 1.3152279108008344e-06),
 (0, 3.6800249698884413e-10),
 (0, 6.146816084115751e-09),
 (0, 2.6846633981073727e-09),
 (0, 5.8298146862571315e-05),
 (1, 0.994701057113338),
 (0, 4.119002109861467e-09),
 (0, 1.2551523237968057e-11),
 (0, 3.315329415875665e-10),
 (0, 3.24479024110104e-11),
 (0, 3.1465146408233984e-09),
 (0, 1.570375227232466e-10),
 (0, 1.9931723568128993e-05),
 (0, 6.709388778888975e-06),
 (0, 1.7054252267886335e-05),
 (1, 0.9999864090761522),
 (0, 9.851222897501183e-13),
 (0, 2.6020526264294253e-06),
 (0, 5.943240995520533e-12)]

### Precision-recall curves
- uses the ***decision function's results**
- By **sweeping the decision threshold** through the entire range of possible score values, we get a series of classification outcomes that form a **curve**

In [23]:
from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(y_test, y_scores_lr)
# index of the closest threshold value to zero (the one we used) (argmin returns the indices of the minimum values)
closest_zero = np.argmin(np.abs(thresholds))
closest_zero_p = precision[closest_zero]
closest_zero_r = recall[closest_zero]

plt.figure()
plt.xlim([0.0, 1.01])
plt.ylim([0.0, 1.01])
plt.plot(precision, recall, label='Precision-Recall Curve')
plt.plot(closest_zero_p, closest_zero_r, 'o', markersize = 12, fillstyle = 'none', c='r', mew=3)
plt.xlabel('Precision', fontsize=16)
plt.ylabel('Recall', fontsize=16)
plt.axes().set_aspect('equal')
plt.show()

[-4.0478398  -3.93653953 -3.48714858 -3.44808862 -3.34294612 -2.57455472
 -2.37638732 -2.34822145 -2.30147552 -2.1583194 ]


<IPython.core.display.Javascript object>

  "Adding an axes using the same arguments as a previous axes "


- X-axis: Precision, Y-axis: Recall
- Top right corner: The “ideal” point
    - Precision = 1.0
    - Recall = 1.0
- **“Steepness”** (having an almost vertical slope) of P-R curves is important:
    - *Maximize* precision
    - while *maximizing* recall

### ROC curves, Area-Under-Curve (AUC)

In [24]:
from sklearn.metrics import roc_curve, auc

X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)

y_score_lr = lr.fit(X_train, y_train).decision_function(X_test)
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_score_lr)
roc_auc_lr = auc(fpr_lr, tpr_lr)

plt.figure()
plt.xlim([-0.01, 1.00])
plt.ylim([-0.01, 1.01])
plt.plot(fpr_lr, tpr_lr, lw=3, label='LogRegr ROC curve (area = {:0.2f})'.format(roc_auc_lr))
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve (1-of-10 digits classifier)', fontsize=16)
plt.legend(loc='lower right', fontsize=13)
plt.plot([0, 1], [0, 1], color='navy', lw=3, linestyle='--')
plt.axes().set_aspect('equal')
plt.show()

<IPython.core.display.Javascript object>

  "Adding an axes using the same arguments as a previous axes "


- X-axis: False Positive Rate, Y-axis: True Positive Rate
- Top left corner: • The “ideal” point
    - False positive rate of zero
    - True positive rate of one
- **“Steepness”** of ROC curves is important:
    - *Maximize* the true positive rate
    - while *minimizing* the false positive rate

In [25]:
from matplotlib import cm

X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)

plt.figure()
plt.xlim([-0.01, 1.00])
plt.ylim([-0.01, 1.01])
for g in [0.01, 0.1, 0.20, 1]:
    svm = SVC(gamma=g).fit(X_train, y_train)
    y_score_svm = svm.decision_function(X_test)
    fpr_svm, tpr_svm, _ = roc_curve(y_test, y_score_svm)
    roc_auc_svm = auc(fpr_svm, tpr_svm)
    accuracy_svm = svm.score(X_test, y_test)
    print("gamma = {:.2f}  accuracy = {:.2f}   AUC = {:.2f}".format(g, accuracy_svm, 
                                                                    roc_auc_svm))
    plt.plot(fpr_svm, tpr_svm, lw=3, alpha=0.7, 
             label='SVM (gamma = {:0.2f}, area = {:0.2f})'.format(g, roc_auc_svm))

plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate (Recall)', fontsize=16)
plt.plot([0, 1], [0, 1], color='k', lw=0.5, linestyle='--')
plt.legend(loc="lower right", fontsize=11)
plt.title('ROC curve: (1-of-10 digits classifier)', fontsize=16)
plt.axes().set_aspect('equal')

plt.show()

<IPython.core.display.Javascript object>

gamma = 0.01  accuracy = 0.91   AUC = 1.00
gamma = 0.10  accuracy = 0.90   AUC = 0.98
gamma = 0.20  accuracy = 0.90   AUC = 0.66
gamma = 1.00  accuracy = 0.90   AUC = 0.50


  "Adding an axes using the same arguments as a previous axes "


#### Summarizing an ROC curve in one number: Area Under the Curve (AUC)
AUC = 0.5 (worst) AUC = 1 (best)
- AUC can be interpreted as:
    1.   The total area under the ROC curve.
    2.  The probability that the classifier will assign a higher score to a randomly chosen positive example than to a randomly chosen negative example.
- Advantages:
    - Gives a single number for easy comparison.
    - Does not require specifying a decision threshold.
- Drawbacks:
    - As with other single-number metrics, AUC loses information, e.g. about tradeoffs and the shape of the ROC curve.

### Evaluation measures for multi-class classification
Multi-class evaluation is an extension of the binary case.
- A collection of true vs predicted binary outcomes, one per class
- Confusion matrices are especially useful
- Classification report
- Overall evaluation metrics are averages across classes

#### Multi-class confusion matrix

In [26]:
dataset = load_digits()
X, y = dataset.data, dataset.target
X_train_mc, X_test_mc, y_train_mc, y_test_mc = train_test_split(X, y, random_state=0)


svm = SVC(kernel = 'linear').fit(X_train_mc, y_train_mc)
svm_predicted_mc = svm.predict(X_test_mc)
# make confusion matrix
confusion_mc = confusion_matrix(y_test_mc, svm_predicted_mc)
# put it in data frame to visualize
df_cm = pd.DataFrame(confusion_mc, 
                     index = [i for i in range(0,10)], columns = [i for i in range(0,10)])

plt.figure(figsize=(5.5,4))
sns.heatmap(df_cm, annot=True)
plt.title('SVM Linear Kernel \nAccuracy:{0:.3f}'.format(accuracy_score(y_test_mc, 
                                                                       svm_predicted_mc)))
plt.ylabel('True label')
plt.xlabel('Predicted label')


svm = SVC(kernel = 'rbf').fit(X_train_mc, y_train_mc)
svm_predicted_mc = svm.predict(X_test_mc)
confusion_mc = confusion_matrix(y_test_mc, svm_predicted_mc)
df_cm = pd.DataFrame(confusion_mc, index = [i for i in range(0,10)],
                  columns = [i for i in range(0,10)])

plt.figure(figsize = (5.5,4))
sns.heatmap(df_cm, annot=True)
plt.title('SVM RBF Kernel \nAccuracy:{0:.3f}'.format(accuracy_score(y_test_mc, 
                                                                    svm_predicted_mc)))
plt.ylabel('True label')
plt.xlabel('Predicted label');

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#### Multi-class classification report

In [27]:
print(classification_report(y_test_mc, svm_predicted_mc))

              precision    recall  f1-score   support

           0       1.00      0.65      0.79        37
           1       1.00      0.23      0.38        43
           2       1.00      0.39      0.56        44
           3       1.00      0.93      0.97        45
           4       0.14      1.00      0.25        38
           5       1.00      0.33      0.50        48
           6       1.00      0.54      0.70        52
           7       1.00      0.35      0.52        48
           8       1.00      0.02      0.04        48
           9       1.00      0.55      0.71        47

   micro avg       0.49      0.49      0.49       450
   macro avg       0.91      0.50      0.54       450
weighted avg       0.93      0.49      0.54       450



### Regression evaluation metrics
- Typically r2_score is enough
    Reminder: computes how well future instances will be predicted
        - Best possible score is 1.0
        - Constant prediction score is 0.0
- Alternative metrics include:
    - mean_absolute_error (absolute difference of target & predicted values)
    - mean_squared_error (squared difference of target & predicted values)

***Dummy regressors***
strategy parameter options:
- mean : predicts the mean of the training targets.
- median : predicts the median of the training targets.
- quantile : predicts a user-provided quantile of the training target values (e.g. value at the 75th percentile)
- constant : predicts a constant user-provided value.

In [28]:
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.dummy import DummyRegressor

diabetes = datasets.load_diabetes()

X = diabetes.data[:, None, 6]
y = diabetes.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

lm = LinearRegression().fit(X_train, y_train)
lm_dummy_mean = DummyRegressor(strategy = 'mean').fit(X_train, y_train)

y_predict = lm.predict(X_test)
y_predict_dummy_mean = lm_dummy_mean.predict(X_test)

print('Linear model, coefficients: ', lm.coef_)
print("Mean squared error (dummy): {:.2f}".format(mean_squared_error(y_test, 
                                                                     y_predict_dummy_mean)))
print("Mean squared error (linear model): {:.2f}".format(mean_squared_error(y_test, y_predict)))
print("r2_score (dummy): {:.2f}".format(r2_score(y_test, y_predict_dummy_mean)))
print("r2_score (linear model): {:.2f}".format(r2_score(y_test, y_predict)))

# Plot outputs
plt.scatter(X_test, y_test,  color='black')
plt.plot(X_test, y_predict, color='green', linewidth=2)
plt.plot(X_test, y_predict_dummy_mean, color='red', linestyle = 'dashed', 
         linewidth=2, label = 'dummy')

plt.show()

Linear model, coefficients:  [-698.80206267]
Mean squared error (dummy): 4965.13
Mean squared error (linear model): 4646.74
r2_score (dummy): -0.00
r2_score (linear model): 0.06


<IPython.core.display.Javascript object>

### Model selection using evaluation metrics

- **Train/test on same data**
    - Single metric.
    - Typically overfits and likely won't generalize well to new data.
    - But can serve as a **sanity check**: low accuracy on the training set may indicate an implementation problem.
- **Single train/test split**
    - Single metric.
    - Speed and simplicity.
    - In most patterns we don't have the **luxury** of large data set.
    - Splitting the data set into training and testing sets will leave us with either **insufficient training or testing patterns**. 
    - Clearly the testing data set contains **useful information** for learning. Yet, they are **ignored** and not used for training purposes in the data splitting error rate estimation method.
- **K-fold cross-validation**
    - K train-test splits.
    - Average metric over all splits.
    - **every example is used** in testing at some stage and the problem of an *unfortunate split* is avoided
    - Remember to **shuffle**.
    - Can be combined with parameter grid search: GridSearchCV (def. cv = 3)

![cv.PNG](attachment:cv.PNG)

#### Cross-validation example

In [29]:
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

dataset = load_digits()
# again, making this a binary problem with 'digit 1' as positive class 
# and 'not 1' as negative class
X, y = dataset.data, dataset.target == 1
clf = SVC(kernel='linear', C=1)

# accuracy is the default scoring metric
scores = cross_val_score(clf, X, y, cv=5)
print('Cross-validation (accuracy)', scores)

# use AUC as scoring metric
scores = cross_val_score(clf, X, y, cv=5, scoring = 'roc_auc')
print('Cross-validation (AUC)', scores)

# use recall as scoring metric
scores = cross_val_score(clf, X, y, cv=5, scoring = 'recall')
print('Cross-validation (recall)', scores)

Cross-validation (accuracy) [0.91944444 0.98611111 0.97214485 0.97493036 0.96935933]
Cross-validation (AUC) [0.9641871  0.9976571  0.99372205 0.99699002 0.98675611]
Cross-validation (recall) [0.81081081 0.89189189 0.83333333 0.83333333 0.83333333]


In [30]:
print("Average cross-validation score (recall) : {:.2f}".format(scores.mean()))

Average cross-validation score (recall) : 0.84


#### Other variations are possible:
- ***stratified cross validation***
    - The folds are made by preserving the percentage of samples for each class.
    - - *from sklearn.model_selection import StratifiedKFold*
    ![stratified-cv.PNG](attachment:stratified-cv.PNG)
    
- ***Leave-one-out cross-validation:***
    - Extreme case of k-fold cross validation.
    - With 𝑁 data examples perform 𝑁 experiments.
    - Error rate estimator is an almost unbiased estimator of the true error rate of classifier.
    - With large datasets might take forever.
    - *from sklearn.model_selection import LeaveOneOut*
    

- other variations: ***ShuffleSplit*** and ***GroupKFold***