<a href="https://colab.research.google.com/github/kdemertzis/LNexamples/blob/main/1_Crossvalidation_metrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Τεχνικές διασταυρωμένης επικύρωσης

Για τον πειραματισμό μας  θα χρησιμοποιήσουμε ως dataset το *Pima-Indians-diabetes*. Θα φορτώσουμε το dataset και θα το χωρίσουμε σε data και σε labels.

In [None]:
from IPython.display import Image
from IPython.display import display

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# inline plotting instead of popping out
%matplotlib inline


# e.g., plot_decision_regions()
import os, sys
module_path = os.path.abspath(os.path.join('.'))
sys.path.append(module_path)


# Load CSV using Pandas from URL

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv(url, names=names)
print(data.shape)

# separate array into input and output components
# X: Features Y: Labels

y = data['class']
X = data.drop('class', axis=1)

X.shape, y.shape

(768, 9)


((768, 8), (768,))

## Hold out cross validation

![alt text](https://www.kdnuggets.com/wp-content/uploads/dataiku-holdout-strategy.jpg)

Βήμα 1: Χωρίζουμε τα data kai ta labels σε training (70%) και σε testing set (30%)

In [None]:
from sklearn.model_selection import train_test_split
# hold out testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

X_train.shape, X_test.shape

((537, 8), (231, 8))

Βήμα 1: Έπειτα θα  χωρίζουμε το training set σε 70% και σε 30% όπου το 30% θα το χρησιμοποιήσουμε για validation

In [None]:
# hold out validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.3, random_state=0)

X_train.shape, X_val.shape

((375, 8), (162, 8))

Βήμα 2:  Για διαφορετικές υπερπαραμέτρους (n_neighbors = 1, 15, 50) θα εκπαιδεύσουμε έναν k-nearest neighbors classifier, χρησιμοποιώντας το training set με το validation set και θα τυπώσουμε τα αποτελέσματα

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

best_k, best_score = -1, -1
clfs = {}

# hyperparameter tuning
for k in [1, 15, 50]: 
    pipe = Pipeline([['sc', StandardScaler()], ['clf', KNeighborsClassifier(n_neighbors=k)]])
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_val)
    score = accuracy_score(y_val, y_pred)
    print('[{}-NN]\nValidation accuracy: {}'.format(k, score))
    if score > best_score:
        best_k, best_score = k, score
    clfs[k] = pipe
    
# performance reporting
y_pred= clfs[best_k].predict(X_test)
print('\nTest accuracy: %.2f (n_neighbors=%d selected by the holdout method)' % 
      (accuracy_score(y_test, y_pred), best_k))

[1-NN]
Validation accuracy: 0.7407407407407407
[15-NN]
Validation accuracy: 0.7654320987654321
[50-NN]
Validation accuracy: 0.7469135802469136

Test accuracy: 0.75 (n_neighbors=15 selected by the holdout method)


Βήμα 3: Θα επιλέξουμε το καλύτερο αποτέλεσμα και θα το εφαρμόσουμε στα δεδομένα ελέγχου (test set)

In [None]:
# performance reporting
y_pred= clfs[best_k].predict(X_test)

print('Test accuracy: %.2f (n_neighbors=15 selected manually)' % 
      accuracy_score(y_test, y_pred))   

Test accuracy: 0.75 (n_neighbors=15 selected manually)


Ένα σημαντικό μειονέκτημα της  holdout μεθόδου είναι ότι οι επιδόσεις της είναι ευαίσθητες στους τυχαίους διαχωρισμούς και σε μη βέλτιστες υπερπαραμέτρους. Αν θέσουμε n_neighbors = 15 χειροκίνητα τότε ίσως να παρατηρήσουμε ότι οδηγούμαστε σε καλύτερα αποτελέσματα.

In [None]:
k=15
y_pred= clfs[k].predict(X_test)
print('Test accuracy: %.2f (n_neighbors=15 selected manually)' % 
      (accuracy_score(y_test, y_pred)))

Test accuracy: 0.75 (n_neighbors=15 selected manually)


##K-fold Cross Validation

![alt text](https://www.kdnuggets.com/wp-content/uploads/dataiku-kfold-strategy.jpg)


Βήμα 1: Χωρίζουμε τα data kai ta labels σε training (70%) και σε testing set (30%) 

In [None]:
# hold out testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

X_train.shape, X_test.shape

((537, 8), (231, 8))

Για διαφορετικές υπερπαραμέτρους (n_neighbors = 1, 15, 50) θα εκπαιδεύσουμε έναν k-nearest neighbors classifier, χρησιμοποιώντας το training set μαζί με το k fold cross validation.

Βήμα 2: 

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

best_k, best_score = -1, -1
clfs = {}

for k in [1, 15, 50]: # experiment different hyperparameter
    pipe = Pipeline([['sc', StandardScaler()], ['clf', KNeighborsClassifier(n_neighbors=k)]])
    pipe.fit(X_train, y_train)
    # K-Fold CV
    scores = cross_val_score(pipe, X_train, y_train, cv=10)
    print('[%d-NN]\nValidation accuracy: %.3f %s' % (k, scores.mean(), scores))
    if scores.mean() > best_score:
        best_k, best_score = k, scores.mean()
    clfs[k] = pipe

[1-NN]
Validation accuracy: 0.678 [0.62962963 0.72222222 0.68518519 0.7962963  0.64814815 0.66666667
 0.51851852 0.77358491 0.66037736 0.67924528]
[15-NN]
Validation accuracy: 0.734 [0.72222222 0.72222222 0.75925926 0.72222222 0.81481481 0.74074074
 0.64814815 0.64150943 0.75471698 0.81132075]
[50-NN]
Validation accuracy: 0.743 [0.7037037  0.75925926 0.77777778 0.72222222 0.83333333 0.77777778
 0.64814815 0.66037736 0.75471698 0.79245283]


Βήμα 3: To 10-fold Cross Validation  επίλεξε τον καλύτερο ταξινομητή (n_neighbors = 50 ), όπως αναμέναμε.
Μόλις επιλέξουμε τις σωστές τιμές υπερπαραμέτρων, επανεκπαιδεύουμε το μοντέλο στο πλήρες training set και λάβουμε μια τελική εκτίμηση απόδοσης για το test set.

In [None]:
best_clf = clfs[best_k]
best_clf.fit(X_train, y_train)

# performance reporting
y_pred = best_clf.predict(X_test)
print('Test accuracy: %.2f (n_neighbors=%d selected by 10-fold CV)' % 
      (accuracy_score(y_test, y_pred), best_k))

Test accuracy: 0.77 (n_neighbors=50 selected by 10-fold CV)


Συνήθως, ορίζουμε K = 10 στις περισσότερες εφαρμογές και  K = 5 για μεγαλύτερα σύνολα δεδομένων.  Όταν έχουμε πολύ μικρά σύνολα δεδομένων ορίσουμε  K = N, που αντιστοιχεί στο **leave-one-out cross validation**.

## Μετρικές για classification

### Ακρίβεια

$\texttt{accuracy}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples}-1} 1(\hat{y}_i = y_i)$


In [None]:
from sklearn.metrics import accuracy_score
y_true = [0, 1, 2, 0, 1, 1]
y_pred = [0, 2, 1, 0, 0, 1]
accuracy_score(y_true, y_pred), accuracy_score(y_true, y_pred, normalize=False)

(0.5, 3)

Για δυαδικά προβλήματα, μπορούμε να τυπώσουμε  τα εξής: **true negatives, false positives, false negatives** και **true positives**. 

In [None]:
from sklearn.metrics import confusion_matrix
y_true = [0, 0, 0, 1, 1, 1, 1, 1]
y_pred = [0, 1, 0, 1, 0, 1, 0, 1]

tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
tn, fp, fn, tp

(2, 1, 2, 3)

### Jaccard Index

In [None]:
from sklearn.metrics import jaccard_similarity_score

y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]

jaccard_similarity_score(y_true, y_pred),jaccard_similarity_score(y_true, y_pred, normalize=False)

# return the average of Jaccard similarity coefficient
# J(y_pred,y_true) = |y_pred ∩ y_true| / |y_pred ∪ y_true| = |{0,3}| / |{0,1,2,3}| = 2/4 = 0.5

#If normalize=False, return the sum of the Jaccard similarity coefficient over the sample set.



(0.5, 2)

Υπολογισμός του **Confusion Matrix**

In [None]:
from sklearn.metrics import confusion_matrix
y_true = [0, 1, 2, 0, 1, 1]
y_pred = [0, 2, 1, 0, 0, 1]

confusion_matrix(y_true, y_pred)

array([[2, 0, 0],
       [1, 1, 1],
       [0, 1, 0]])

###  Classification report

Απεικονίζει σε πίνακα το **precision**, το **recall** και το **f1-score** για κάθε κλάση του dataset.
Μπορούμε να δούμε και την κάθε μετρική χωριστά.

In [None]:
from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.50      1.00      0.67         1
     class 1       0.00      0.00      0.00         1
     class 2       1.00      0.67      0.80         3

    accuracy                           0.60         5
   macro avg       0.50      0.56      0.49         5
weighted avg       0.70      0.60      0.61         5



### Καμπύλη AUC (Area Under The Curve) ROC (Receiver Operating Characteristics)


In [None]:
from sklearn.metrics import roc_auc_score

y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])

roc_auc_score(y_true, y_scores)

0.75

## Μετρικές για Regression

$\text{MAE}(y, \hat{y}) = \frac{1}{n_{\text{samples}}} \sum_{i=0}^{n_{\text{samples}}-1} \left| y_i - \hat{y}_i \right|$

In [None]:
# return loss : float or ndarray of floats => A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.


from sklearn.metrics import mean_absolute_error

y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

mean_absolute_error(y_true, y_pred)



0.5

In [None]:
y_true = [[0, 2], [-1, 1], [7, -6]]
y_pred = [[0, 1], [-1, 1], [8, -5]]


mean_absolute_error(y_true, y_pred)


# multioutputs: Defines aggregating of multiple output values. Array-like value defines weights used to average errors.

#‘raw_values’ :Returns a full set of errors in case of multioutput input.
# ‘uniform_average’ : Errors of all outputs are averaged with uniform weight.


mean_absolute_error(y_true, y_pred, multioutput='raw_values')

#mean_absolute_error(y_true, y_pred, multioutput=[0.3, 0.7])

array([0.33333333, 0.66666667])

$\text{MSE}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples} - 1} (y_i - \hat{y}_i)^2$

In [None]:
from sklearn.metrics import mean_squared_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
mean_squared_error(y_true, y_pred)



0.375

In [None]:
y_true = [[0.5, 1],[-1, 1],[7, -6]]
y_pred = [[0, 2],[-1, 2],[8, -5]]
mean_squared_error(y_true, y_pred)  

mean_squared_error(y_true, y_pred, multioutput='raw_values')


#mean_squared_error(y_true, y_pred, multioutput=[0.3, 0.7])

array([0.41666667, 1.        ])

$R^2(y, \hat{y}) = 1 - \frac{\sum_{i=0}^{n_{\text{samples}} - 1} (y_i - \hat{y}_i)^2}{\sum_{i=0}^{n_\text{samples} - 1} (y_i - \bar{y})^2}$

In [None]:
from sklearn.metrics import r2_score
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
r2_score(y_true, y_pred)  

y_true = [[0.5, 1], [-1, 1], [7, -6]]
y_pred = [[0, 2], [-1, 2], [8, -5]]
r2_score(y_true, y_pred, multioutput='variance_weighted')


y_true = [[0.5, 1], [-1, 1], [7, -6]]
y_pred = [[0, 2], [-1, 2], [8, -5]]
r2_score(y_true, y_pred, multioutput='uniform_average')


r2_score(y_true, y_pred, multioutput='raw_values')


r2_score(y_true, y_pred, multioutput=[0.3, 0.7])

0.9253456221198156