# Cross Validation

<img src="gridsearch.png"/>

<p>When evaluating different settings ("hyperparameters") for estimators such as n_neighbors for the KNearest Neighbor Classifier that must be manually set , there is risk of overfitting on the test set because the parameters can be tweaked until the estimator perfoms optimally. This way knowledge about the test set can "leak" into the model and evaluation metrics will no longer report on generalization perfomance. To solve this problem another part of the datasetcan be held out as so-called "validation set"</p>

<p>Training proceeds on the training set , after which evaluation is done on the validation set ans when the experiment seems to be successful , final evaluation can be done on the test set.</p>

<p>By partitioning the available data into three sets , we drastically reduce the number of samples which can be used for learning the model , and the results can depend on a particular random choice for the pair of sets</p>

<p>A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:</p>

<img src="kfold.png">

<li>
<ul>A model is trained using k-1  of the folds as training data;</ul>
<ul>the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).</ul>
</li>

In [1]:
import numpy as np 
import pandas as pd
import os

In [2]:
from sklearn.datasets import load_iris
iris=load_iris()

X=iris.data
y=iris.target

In [4]:
X=(X-np.min(X))/(np.max(X)-np.min(X))

In [5]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(X,y,test_size=0.3) 

In [6]:
from sklearn.neighbors  import KNeighborsClassifier
knn= KNeighborsClassifier(n_neighbors=3)

In [7]:
from sklearn.model_selection import cross_val_score
accuracies=cross_val_score(estimator=knn,X=x_train,y=y_train,cv=10)
accuracies

array([0.83333333, 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ])

In [8]:
print("average accuracy :",np.mean(accuracies))
print("average std :",np.std(accuracies))

average accuracy : 0.9833333333333334
average std : 0.04999999999999999


In [9]:
knn.fit(x_train,y_train)
print("test accuracy :",knn.score(x_test,y_test))

test accuracy : 0.9555555555555556


## GridSearch

In [10]:
from sklearn.model_selection import GridSearchCV

grid ={"n_neighbors":np.arange(1,50)}
knn= KNeighborsClassifier()
knn_cv=GridSearchCV(knn,grid,cv=10) #GridSearchCV
knn_cv.fit(X,y)

GridSearchCV(cv=10, error_score='raise-deprecating',
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='warn', n_jobs=None,
             param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [11]:
print("tuned hyperparameter K:",knn_cv.best_params_)
print("tuned parametreye göre en iyi accuracy (best score):",knn_cv.best_score_)

tuned hyperparameter K: {'n_neighbors': 13}
tuned parametreye göre en iyi accuracy (best score): 0.98


# What is K-Fold Cross Validation

<img src="kfoldpic.png">

<p>K-Fold CV is where a given data set is split into a K number of sections/folds where each fold is used as a testing set at some point. Lets take the scenario of 5-Fold cross validation(K=5). Here, the data set is split into 5 folds. In the first iteration, the first fold is used to test the model and the rest are used to train the model. In the second iteration, 2nd fold is used as the testing set while the rest serve as the training set. This process is repeated until each fold of the 5 folds have been used as the testing set.</p>

In [12]:
from sklearn.model_selection import KFold
scores=[]
kFold=KFold(n_splits=10,random_state=42,shuffle=False)
for train_index,test_index in kFold.split(X):
    print("Train Index: ", train_index, "\n")
    print("Test Index: ", test_index)
    
    X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
    knn.fit(X_train, y_train)
    scores.append(knn.score(X_test, y_test))
    
    

Train Index:  [ 15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32
  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50
  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86
  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104
 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122
 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
 141 142 143 144 145 146 147 148 149] 

Test Index:  [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
Train Index:  [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  30  31  32
  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50
  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86
  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 

In [13]:
knn.fit(X_train,y_train)
scores.append(knn.score(X_test,y_test))

In [14]:
print(np.mean(scores))

0.9393939393939394


In [15]:
cross_val_score(knn, X, y, cv=10)

array([1.        , 0.93333333, 1.        , 1.        , 0.86666667,
       0.93333333, 0.93333333, 1.        , 1.        , 1.        ])