# Methodologies to validate a classifier 

Adjusting the parameters of a model to make some kind of prediction and evaluate the system on the same data that have been used for training is a methodological error.

A model that repeats the labels of the data you just saw during training might do a perfect job, but what about data you haven't seen? will the output correspond to something useful? This situation is known as overtraining.

To avoid this type of problem and have a better idea of the real behavior of the system (classification or regression) processing data that is not known. We partition the data into two subsets: X_train and X_test. This allows the automated system to be trained using the X_train data and evaluated using the X_test data. 


There is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.

However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.


A solution to this problem is a procedure called cross-validation (CV for short). 

There are different strategies, we will use  the tools available in sk-learn for this purpose

## Using the IRIS database


In [1]:
#load tools
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


In [5]:
##Load iris
###################################################################
def loadIris():
    #define the name of the data file and the labels
    file_name='data_lb1/iris2.data'
    data=np.loadtxt(file_name,delimiter=',') 
    #read all lines
    X=data[:,:-1]
    Y=data[:,-1]

    
    return X,Y.astype(int)

In [6]:
#Verify data
X,Y=loadIris()
print(X.shape)
print(Y.shape)

(150, 4)
(150,)


In [7]:
Y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

 ## K-fold
KFold divides all the samples in $k$ groups of samples, called folds (if $k=n$ , this is equivalent to the Leave One Out strategy), of equal sizes (if possible). The prediction function is learned using $k-1$ folds, and the fold left out is used for test.
<img src="img_ex/kfolds.png" width="450">

There is a problem.....

In [8]:
from sklearn.model_selection import KFold


X,Y=loadIris()
print(X.shape)
print(Y.shape)


#10 fold

kf = KFold(n_splits=10)
for train_index, test_index in kf.split(X):
        #print("TRAIN:", train_index, "TEST:", test_index)
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = Y[train_index], Y[test_index]
        print("test labels: ", y_test)
        #print("train labels: ", y_train)
        
 

(150, 4)
(150,)
test labels:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
test labels:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
test labels:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
test labels:  [0 0 0 0 0 1 1 1 1 1 1 1 1 1 1]
test labels:  [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
test labels:  [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
test labels:  [1 1 1 1 1 1 1 1 1 1 2 2 2 2 2]
test labels:  [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
test labels:  [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
test labels:  [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]



To solve the problem, we can shufle the dataset before applying the algorithm above. We use a random permutation of the indices as follows: 

In [9]:
#values from 10 to 20
x=np.arange(10,20)
print("valores de x:              ", x)
#shufle
y=np.random.permutation(10)
print("nuevos indices para x:     ", y)

#use the new indexes
x=x[y]
print("valores de x, desordenados:", x)

valores de x:               [10 11 12 13 14 15 16 17 18 19]
nuevos indices para x:      [9 3 0 2 5 6 8 7 1 4]
valores de x, desordenados: [19 13 10 12 15 16 18 17 11 14]


In [10]:
#Do the same for the IRIS dataset
ind=np.random.permutation(Y.size)

X=X[ind,:]
Y=Y[ind]

In [11]:

kf = KFold(n_splits=10)

for train_index, test_index in kf.split(X):
        #print("TRAIN:", train_index, "TEST:", test_index)
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = Y[train_index], Y[test_index]
        print("test labels: ", y_test)

test labels:  [0 1 1 0 1 2 2 2 0 2 1 0 1 0 2]
test labels:  [1 0 0 0 2 2 1 1 0 0 0 1 0 1 0]
test labels:  [2 1 1 1 0 2 1 2 2 0 1 1 0 2 0]
test labels:  [1 1 0 2 0 2 2 2 1 2 2 1 0 0 1]
test labels:  [2 1 0 0 2 0 2 1 2 0 0 1 1 1 1]
test labels:  [2 2 0 1 1 2 2 2 2 0 2 2 0 2 2]
test labels:  [0 1 2 0 1 2 1 0 1 1 1 2 1 1 1]
test labels:  [0 2 1 1 0 1 0 2 2 2 0 0 2 1 0]
test labels:  [1 0 0 2 1 1 0 0 2 2 2 2 0 2 2]
test labels:  [0 1 0 1 2 1 0 1 0 2 0 1 0 2 0]


We can train a classifier: 

1. Logistic regression
2. KNN

In [12]:
#generamos dos vectores de ceros para guardar la tasa de acierto (% de muestras clasificadas correctamente) de
#los dos clasificadors, uno pada caso
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

#numero de folds
k=10 

acc1 = []
acc2 = []

kf = KFold(n_splits=k)

for train_index, test_index in kf.split(X):
        
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = Y[train_index], Y[test_index]
        #CLF 1
        clf1 = LogisticRegression(solver='lbfgs', max_iter=1000, multi_class='ovr')
        clf1.fit(X_train, y_train)
        #evaluate
        yp1 = clf1.predict(X_test)
        acc1.append(np.sum(yp1==y_test)/y_test.size*100)
        
        #Clf2
        clf2 = KNeighborsClassifier(n_neighbors=3)
        clf2.fit(X_train, y_train)
        
        #Evaluate
        yp2 = clf2.predict(X_test)
        acc2.append(np.sum(yp2==y_test)/y_test.size*100)
                
acc1=np.array(acc1)
acc2=np.array(acc2)


print("Logistic regression: average = %f, std = %f"% (acc1.mean(), acc1.std()))
print("KNN                : average = %f, std = %f"% (acc2.mean(), acc2.std()))

Logistic regression: average = 94.666667, std = 4.988877
KNN                : average = 96.000000, std = 5.333333


In [14]:
acc2

array([ 86.66666667, 100.        ,  93.33333333,  93.33333333,
       100.        , 100.        ,  93.33333333,  93.33333333,
       100.        , 100.        ])

## Leave One Out (LOO)

LeaveOneOut (or LOO) is a simple cross-validation. Each learning set is created by taking all the samples except one, the test set being the sample left out. Thus, for  $n$ samples, we have different training sets and $n$ different tests set. This cross-validation procedure does not waste much data as only one sample is removed from the training set:

In [13]:
from sklearn.model_selection import LeaveOneOut
acc1 = []
acc2 = []

X,Y=loadIris()
print(X.shape)
print(Y.shape)

#index generator
loo = LeaveOneOut()

for train, test in loo.split(X):
        X_train, X_test = X[train], X[test]
        y_train, y_test = Y[train], Y[test]
        #Clf 1
        clf1 = LogisticRegression(solver='lbfgs', max_iter=1000, multi_class='ovr')
        clf1.fit(X_train, y_train)
        #eval
        yp1 = clf1.predict(X_test)
        acc1.append(yp1==y_test)
        
        #clf2
        clf2 = KNeighborsClassifier(n_neighbors=3)
        clf2.fit(X_train, y_train)
        
        #eval
        yp2 = clf2.predict(X_test)
        acc2.append(yp2==y_test)
                
acc1=np.array(acc1).sum()/len(acc1)*100
acc2=np.array(acc2).sum()/len(acc2)*100



print("Logistic regression: accuracy = ", acc1)
print("KNN                : accuracy = ", acc2)

(150, 4)
(150,)
Logistic regression: accuracy =  94.0
KNN                : accuracy =  96.0


Se puede notar que la tasa de acierto es igual a la que se obtuvo con KFOLDS, sin embargo en este caso no es posible calcular un promedio o una desviación estandard, pues en cada iteración solo habia una muestra, por lo que el acierto es 100% si esa muestra se clasifica bien, o 0% si se clasifica mal

## Random cross-validation = Shuffle & Split

The ShuffleSplit iterator will generate a user defined number of independent train / test dataset splits. Samples are first shuffled and then split into a pair of train and test sets.

It is possible to control the randomness for reproducibility of the results by explicitly seeding the random_state pseudo random number generator.


ShuffleSplit(n_splits=20, test_size=0.3, random_state=0)

n_splits number of splits

test_size portion of data to use in testing

70% - 30%

Here is a visualization of the cross-validation behavior. Note that ShuffleSplit is not affected by classes or groups.

<img src="img_ex/random.png" width="500">

In [20]:
from sklearn.model_selection import ShuffleSplit

acc1 = []
acc2 = []

X,Y=loadIris()
print(X.shape)
print(Y.shape)


#
ss = ShuffleSplit(n_splits=10, test_size=0.3, random_state=0)
for train_index, test_index in ss.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = Y[train_index], Y[test_index]
        #clf1
        clf1 = LogisticRegression(solver='lbfgs', max_iter=1000, multi_class='ovr')
        clf1.fit(X_train, y_train)
        #evaluate
        yp1 = clf1.predict(X_test)
        acc1.append(np.sum(yp1==y_test)/y_test.size*100)
        
        #clf2
        clf2 = KNeighborsClassifier(n_neighbors=3)
        clf2.fit(X_train, y_train)
        
        #Evaluate
        yp2 = clf2.predict(X_test)
        acc2.append(np.sum(yp2==y_test)/y_test.size*100)

               
acc1=np.array(acc1)
acc2=np.array(acc2)

print("Logistic regression: Average = %f, std = %f"% (acc1.mean(), acc1.std()))
print("KNN                : Average = %f, std = %f"% (acc2.mean(), acc2.std()))    
del X, Y

(150, 4)
(150,)
Logistic regression: Average = 93.333333, std = 2.981424
KNN                : Average = 96.000000, std = 3.265986


# Eexample


In [17]:
def loadDiab():
    #define el nombre del archivo y las etiquetas para cada tipo de flor
    file_name='data_lb1/diabetes.data'
    data=np.loadtxt(file_name,delimiter=',') 
    x=data[:,:-1]
    y=data[:,-1]

    
    return x,y.astype(int)

X,Y=loadDiab()
print(X.shape)
print(Y.shape)

(768, 8)
(768,)


 ## K-fold

In [18]:
#se realiza primero una permutación aleatoria de los datos llamaremos a los nuevos indices ind
ind=np.random.permutation(Y.size)

#tomamos el nuevo orden indicado por la permutación aleatoria
X=X[ind,:]
Y=Y[ind]

#numero de folds
k=10 

acc1 = []
acc2 = []

kf = KFold(n_splits=k)

for train_index, test_index in kf.split(X):
        
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = Y[train_index], Y[test_index]
        #clasficador 1
        clf1 = LogisticRegression(solver='lbfgs', max_iter=1000, multi_class='ovr')
        clf1.fit(X_train, y_train)
        #evaluar clf1 y guardar el resultado en acc1
        yp1 = clf1.predict(X_test)
        acc1.append(np.sum(yp1==y_test)/y_test.size*100)
        
        #clasificador 2
        clf2 = KNeighborsClassifier(n_neighbors=9)
        clf2.fit(X_train, y_train)
        
        #evaluar clf2 y guardar el resultado en acc2
        yp2 = clf2.predict(X_test)
        acc2.append(np.sum(yp2==y_test)/y_test.size*100)
                
acc1=np.array(acc1)
acc2=np.array(acc2)

print("Comparación de rendimiento de los dos clasificadores:\n")        
print("Logistic regression: promedio = %f, std = %f"% (acc1.mean(), acc1.std()))
print("KNN                : promedio = %f, std = %f"% (acc2.mean(), acc2.std()))

Comparación de rendimiento de los dos clasificadores:

Logistic regression: promedio = 77.084757, std = 3.131867
KNN                : promedio = 73.171565, std = 5.515538


## Leave One Out (LOO)


In [28]:
X,Y=loadDiab()
print(X.shape)
print(Y.shape)
acc1 = []
acc2 = []


#crear el generador de indices
loo = LeaveOneOut()

for train, test in loo.split(X):
        X_train, X_test = X[train], X[test]
        y_train, y_test = Y[train], Y[test]
        #clasficador 1
        clf1 = LogisticRegression(solver='lbfgs', max_iter=1000, multi_class='ovr')
        clf1.fit(X_train, y_train)
        #evaluar clf1 y guardar el resultado en acc1, en este caso como es una sola muestra no es necesario
        #sumar ni dividir entre 100
        yp1 = clf1.predict(X_test)
        acc1.append(yp1==y_test)
        
        #clasificador 2
        clf2 = KNeighborsClassifier(n_neighbors=3)
        clf2.fit(X_train, y_train)
        
        #evaluar clf2 y guardar el resultado en acc2
        yp2 = clf2.predict(X_test)
        acc2.append(yp2==y_test)
                
acc1=np.array(acc1).sum()/len(acc1)*100
acc2=np.array(acc2).sum()/len(acc2)*100


print("Comparación de rendimiento de los dos clasificadores:\n")        
print("Logistic regression: Tasa acierto = ", acc1)
print("KNN                : Tasa acierto = ", acc2)

(768, 8)
(768,)
Comparación de rendimiento de los dos clasificadores:

Logistic regression: Tasa acierto =  77.60416666666666
KNN                : Tasa acierto =  69.40104166666666


## Random cross-validation = Shuffle & Split


In [21]:
X,Y=loadDiab()
print(X.shape)
print(Y.shape)

acc1 = []
acc2 = []

#generador de indices
ss = ShuffleSplit(n_splits=10, test_size=0.3, random_state=0)
for train_index, test_index in ss.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = Y[train_index], Y[test_index]
        #clasficador 1
        clf1 = LogisticRegression(solver='lbfgs', max_iter=1000, multi_class='ovr')
        clf1.fit(X_train, y_train)
        #evaluar clf1 y guardar el resultado en acc1
        yp1 = clf1.predict(X_test)
        acc1.append(np.sum(yp1==y_test)/y_test.size*100)
        
        #clasificador 2
        clf2 = KNeighborsClassifier(n_neighbors=3)
        clf2.fit(X_train, y_train)
        
        #evaluar clf2 y guardar el resultado en acc2
        yp2 = clf2.predict(X_test)
        acc2.append(np.sum(yp2==y_test)/y_test.size*100)

               
acc1=np.array(acc1)
acc2=np.array(acc2)

print("Comparación de rendimiento de los dos clasificadores:\n")        
print("Logistic regression: promedio = %f, std = %f"% (acc1.mean(), acc1.std()))
print("KNN                : promedio = %f, std = %f"% (acc2.mean(), acc2.std()))    


(768, 8)
(768,)
Comparación de rendimiento de los dos clasificadores:

Logistic regression: promedio = 77.922078, std = 2.306998
KNN                : promedio = 69.826840, std = 1.764302


## Cross-validation iterators with stratification based on class labels.
Some classification problems can exhibit a large imbalance in the distribution of the target classes: for instance there could be several times more negative samples than positive samples. In such cases it is recommended to use stratified sampling as implemented in StratifiedKFold and StratifiedShuffleSplit to ensure that relative class frequencies is approximately preserved in each train and validation fold.




### Stratified k-fold

StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

In [30]:
#Stratified k-fold

from sklearn.model_selection import StratifiedKFold, KFold

X,Y=loadDiab()
print(X.shape)
print(Y.shape)


acc1 = []
acc2 = []

#se genera el generador de indices de forma estratificada
skf = StratifiedKFold(n_splits=10)
   
for train_index, test_index in skf.split(X,Y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = Y[train_index], Y[test_index]
        #clasficador 1
        clf1 = LogisticRegression(solver='lbfgs', max_iter=1000, multi_class='ovr')
        clf1.fit(X_train, y_train)
        #evaluar clf1 y guardar el resultado en acc1
        yp1 = clf1.predict(X_test)
        acc1.append(np.sum(yp1==y_test)/y_test.size*100)
        
        #clasificador 2
        clf2 = KNeighborsClassifier(n_neighbors=3)
        clf2.fit(X_train, y_train)
        
        #evaluar clf2 y guardar el resultado en acc2
        yp2 = clf2.predict(X_test)
        acc2.append(np.sum(yp2==y_test)/y_test.size*100)
                
acc1=np.array(acc1)
acc2=np.array(acc2)

print("Comparación de rendimiento de los dos clasificadores:\n")        
print("Logistic regression: promedio = %f, std = %f"% (acc1.mean(), acc1.std()))
print("KNN                : promedio = %f, std = %f"% (acc2.mean(), acc2.std()))
    

(768, 8)
(768,)
Comparación de rendimiento de los dos clasificadores:

Logistic regression: promedio = 77.347915, std = 3.574822
KNN                : promedio = 70.305878, std = 3.763358


### Stratified Shuffle Split

StratifiedShuffleSplit is a variation of ShuffleSplit, which returns stratified splits, i.e which creates splits by preserving the same percentage for each target class as in the complete set.

In [22]:
#Stratified ShuffleSplit 
from sklearn.model_selection import StratifiedShuffleSplit

X,Y=loadDiab()
print(X.shape)
print(Y.shape)


acc1 = []
acc2 = []

#se genera el generador de indices de forma estratificada
sss = StratifiedShuffleSplit(n_splits=10)
   
for train_index, test_index in sss.split(X,Y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = Y[train_index], Y[test_index]
        #clasficador 1
        clf1 = LogisticRegression(solver='lbfgs', max_iter=1000, multi_class='ovr')
        clf1.fit(X_train, y_train)
        #evaluar clf1 y guardar el resultado en acc1
        yp1 = clf1.predict(X_test)
        acc1.append(np.sum(yp1==y_test)/y_test.size*100)
        
        #clasificador 2
        clf2 = KNeighborsClassifier(n_neighbors=3)
        clf2.fit(X_train, y_train)
        
        #evaluar clf2 y guardar el resultado en acc2
        yp2 = clf2.predict(X_test)
        acc2.append(np.sum(yp2==y_test)/y_test.size*100)
                
acc1=np.array(acc1)
acc2=np.array(acc2)

print("Comparación de rendimiento de los dos clasificadores:\n")        
print("Logistic regression: promedio = %f, std = %f"% (acc1.mean(), acc1.std()))
print("KNN                : promedio = %f, std = %f"% (acc2.mean(), acc2.std()))
    

(768, 8)
(768,)
Comparación de rendimiento de los dos clasificadores:

Logistic regression: promedio = 76.233766, std = 3.535236
KNN                : promedio = 69.350649, std = 3.396545
