<h2> ======================================================</h2>
 <h1>MA477 - Theory and Applications of Data Science</h1> 
  <h1>Lesson 8: Cross-Validation </h1> 
 
 <h4>Dr. Valmir Bucaj</h4>
 United States Military Academy, West Point 
AY20-2
<h2>======================================================</h2>

<h2>Lecture Outline</h2>

<ul>
    <li>What is Cross-Validation?</li>
    <li> Validation Set Method</li>
    <li>Leave-One-Out Cross-Validation (LOOCV)</li>
    <li>$k-$Fold Cross-Validation</li>
    <li>Bias-Variance Trade-Off for k-Fold CV</li>
    <li>Implementing Cross-Validation with Python</li>
 
    
</ul>

<h3> What is Cross-Validation?</h3>

Cross-Validation (short CV) is a <i> resampling method </i> most commonly used for <i> model assesment</i>; that is, to evaluate a model's performance via estimating the test error associated with the respective machine-learning method.

For example, in order to gain an idea of the variability of our model, what one may want to do is repeatedly draw different samples from the training data, fit the machine-learning model to each of the drawn samples, and compute some metric to examine the extent to which each of the fitted models differ. This kind of insight is impossible to be gained if we only fit once our model to the training data. 

<h3>Validation Set Method</h3>

Recall that when assessing the performance of a machine-learning model we are interested in assessing how well our model performs in making predictions on the new data, previously unseen by the model. In other words, we want to estimate the <i> test error rate</i>. 

The <i> validation set approach</i> is most appropriate when we have a large dataset and we can afford to split it into a <i> training set</i>(used to train our model) and a <i> test set</i> or a <i> hold-out set</i> which has not been seen by the model before and will be used to compute the <i> test error rate</i>, such as $R^2$ score and $MSE$ in the regression setting.

There are two points one has to keep in mind when using the <i> validation set approach</i>:
<ul>
    <li> The estimates of the test error rates obtained via using the validation set may have a high variance depending on what points are included in the training set</li>
    <li> May result in an overestimate of the test error rates due to the fact that machine-learning algorithms tend to perform better with larger training sets</li>
    </ul>
    
We have already implemented the validation set approache when we discussed KNN Regressor and Linear Regression.

<h3> Leave-One-Out Cross-Validation</h3>

LOOCV is very similar to the <i>validation set approach</i> in the sense that it also involves splitting the dataset in two parts. Despite the similarities, it attempts to overcome the two drawbacks that the validation set approach has, namely, the high variance due to the random split into training and test sets, and the potential to overestimate the test error. 

Suppose we have $n$ data points $(x_1,y_1),\dots, (x_n,y_n)$. LOOCV splits the dataset into a single-element validation set and a training set which contains the rest of the data. Specifically, on the first iteration, only the data point $(x_1,y_1)$ will be designated as the validation set and the remaining $n-1$ point $(x_2,y_2),\dots, (x_n,y_n)$ will be used to train the model. Once the model has been trained, and a prediciton $\hat y_1$ is made using the excluded observation $x_1$ one computes $MSE_1=(y_1-\hat{y_1})^2$ to obtain an estimate of the test $MSE$.

For obvious reasons, this estimate is poor as it depents on a single point and thus suffers from high-variance. To get around this drawback, we repeat the process by iteratively designating each of $(x_i,y_i)$ as a validation point and using the remaining $n-1$ points $(x_1,y_1),\dots,(x_{i-1},y_{i-1}),(x_{i+1},y_{i+1}),\dots,(x_n,y_n)$ to train the model. After the model has been trained, a prediction $\hat{y_i}$ is made using the point $x_i$ which has not been seen by the model previously and we compute $MSE_i=(y_i-\hat{y_i})^2$ and average them over the $n$ points to obtain a more roboust estimate of the test $MSE$:

$$CV_{(n)}=\frac{1}{n}\sum_{i=1}^nMSE_i$$

Let's discuss how LOOCV gets around the two drawbacks that the validation set approach suffers from. First off, because there is no random split of the dataset $CV_{(n)}$ will always be the same regardless how many times the model is run. Second, because each time the model is fit, it essentially uses the entire original dataset, it has less tendency to overestimate the test error rate compared to the validation set approach.

However, LOOCV does suffer frome drawbacks. Maybe the major one is the fact that it is computationally costly especially if $n$ is large and the machine-learning models we are fitting are complex and take a long time. Which leads us to the next validation method.

<h3>k-Fold Cross-Validation</h3>

$k-$fold CV is essentially a generalization of the LOOCV. In this case the dataset is randomly split into $k$ approximately equal size subsets. Then, iteratively, one of these subsets (folds) is designated as a validation set and the remaining $k-1$ subsets(folds) are used to fit the model. Once the model is fit, predictions are made using the designated validation set and the mean square error $MSE_i\approx \lfloor\frac{k}{n}\rfloor\sum_{j=1}^{\lfloor\frac{n}{k}\rfloor}(y_{ij}-\hat{y_{ij}})^2$ is computed for $i=1,2,\dots,k$. Finally, the $k-$fold CV estimate of the test $MSE$ is the average of these values:

$$CV_{(k)}=\frac{1}{k}\sum_{i=1}^kMSE_i$$

As we mentioned, when $k=1$ then $k-$fold CV is simply LOOCV.

In contract to LOOCV the $k-$fold CV is computationaly less expensive because instead of fitting the model $n$ times, it only requires fitting it $k$ times, where $k$ is typically taken to be $3,5$ or $10$. 

<h2>Implementing Cross-Validation in Python</h2>

Go ahead and import the standard libraries and the `Boston` dataset.

In [499]:
import pandas as pd
import numpy as np


In [500]:
import matplotlib.pyplot as plt
import seaborn as sns

In [501]:
df=pd.read_excel("Boston_Dataset.xlsx", index_col=[0])

In [502]:
df.head()

Unnamed: 0_level_0,CRIM,CHAS,RM,DIS,RAD,TAX,PTRATIO,LSTAT,Price
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,0.00632,No,6.575,4.09,1,296,15.3,4.98,24.0
1,0.02731,No,6.421,4.9671,2,242,17.8,9.14,21.6
2,0.02729,No,7.185,4.9671,2,242,17.8,4.03,34.7
3,0.03237,No,6.998,6.0622,3,222,18.7,2.94,33.4
4,0.06905,No,7.147,6.0622,3,222,18.7,5.33,36.2


<h3> Data Pre-processing</h3>

<font size=4 color='red'>Exercise</font>

  Scale the data, but not the response variable, `PRICE`.

In [503]:
from sklearn.preprocessing import StandardScaler

In [504]:
X=df.drop('Price',axis=1)
y=df['Price']

In [505]:
X.head()

Unnamed: 0_level_0,CRIM,CHAS,RM,DIS,RAD,TAX,PTRATIO,LSTAT
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0.00632,No,6.575,4.09,1,296,15.3,4.98
1,0.02731,No,6.421,4.9671,2,242,17.8,9.14
2,0.02729,No,7.185,4.9671,2,242,17.8,4.03
3,0.03237,No,6.998,6.0622,3,222,18.7,2.94
4,0.06905,No,7.147,6.0622,3,222,18.7,5.33


In [506]:
X['CHAS']=X['CHAS'].apply(lambda x: 1 if x=='Yes' else 0)

In [507]:
scaler=StandardScaler()

In [508]:
scaled=scaler.fit_transform(X)

In [509]:
X_sc=pd.DataFrame(scaled,columns=X.columns)

In [510]:
X_sc.head()

Unnamed: 0,CRIM,CHAS,RM,DIS,RAD,TAX,PTRATIO,LSTAT
0,-0.419782,-0.272599,0.413672,0.140214,-0.982843,-0.666608,-1.459,-1.075562
1,-0.417339,-0.272599,0.194274,0.55716,-0.867883,-0.987329,-0.303094,-0.492439
2,-0.417342,-0.272599,1.282714,0.55716,-0.867883,-0.987329,-0.303094,-1.208727
3,-0.41675,-0.272599,1.016303,1.077737,-0.752922,-1.106115,0.113032,-1.361517
4,-0.412482,-0.272599,1.228577,1.077737,-0.752922,-1.106115,0.113032,-1.026501


One way to implemnet CV is to use the `cross_validate()` method. Below we import this method along with KNN Regressor and Linear Regression which we will use to fit our data.

In [488]:
from sklearn.model_selection import cross_validate, cross_val_score, ShuffleSplit,  KFold
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor

In [511]:
lg=LinearRegression()
knn=KNeighborsRegressor(n_neighbors=50)


When implementing CV we need to provide an estimator (eg. Linear Regression, KNN etc.) Below, we first implemnet CV with LinearRegression as an estimator and then we do the same thing using KNN Regressor and compare the results. WE also need to provide a way to measure the predictions, such as R2 score, MSE etc.

One very important thing to keep in mind is that when applying CV you need to make sure you are shuffling the indicies that are used to split the dataset. We will demonstrate this concept below so that it is clear what we mean.

In [513]:
cv_lg=cross_validate(lg,X_sc,y,cv=5,scoring={'r2','neg_mean_squared_error'},return_estimator=True,return_train_score=True)
cv_knn=cross_validate(knn,X_sc,y,cv=3,scoring={'r2'},return_estimator=True,return_train_score=True)

In [514]:
cv_lg.keys()

dict_keys(['fit_time', 'score_time', 'estimator', 'test_r2', 'train_r2', 'test_neg_mean_squared_error', 'train_neg_mean_squared_error'])

In [517]:
cv_lg['test_r2']

array([ 0.71913999,  0.68726767,  0.63332009,  0.15154911, -0.28798347])

In [518]:
cv_lg['train_r2']

array([0.70972238, 0.70431563, 0.64587289, 0.80614078, 0.71961333])

In [493]:
cv_knn['test_r2']

array([0.64817529, 0.57005844, 0.31596906])

In [494]:
cv_knn['estimator']

(KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                     weights='uniform'),
 KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                     weights='uniform'),
 KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                     weights='uniform'))

<font color='red' size=4>STOP & Reflect</font>

As we can see, the $R2$ scores are very low. We worked with this ame set last lecture and were getting better results. This is a great opportunity to stop and reflect about what could possibly be going on??

<h4> Shuffling First</h4>

Now, let's demonstrate the significant change in the computed metrics once we shuffle the indices frist and then carry out the split. Think about why this is important and why you should always do this first.

Below we will demonstrate two differnet ways we may achieve this. 

In [554]:
kf=KFold(n_splits=10,random_state=4,shuffle=True)
ss=ShuffleSplit(n_splits=2,test_size=0.2, random_state=1)

In [536]:
cv_lg=cross_validate(lg,X_sc,y,cv=kf,scoring={'r2'},return_estimator=True,return_train_score=True)
cv_knn=cross_validate(knn,X_sc,y,cv=ss,scoring={'r2'},return_estimator=True,return_train_score=True)

In [539]:
cv_lg['test_r2']

array([0.75841073, 0.63908976, 0.67856122, 0.72150264, 0.65296235,
       0.76868821, 0.74182405, 0.73488563, 0.54513532, 0.67925968])

In [538]:
cv_knn['test_r2'].mean()

0.6810116370146042

In [527]:
for train, test in kf.split(X_sc):
    print('Train:',train)
    print("Test",test)

Train: [  0   2   3   4   5   6   7   8   9  10  11  13  15  16  17  19  21  22
  23  25  26  27  29  30  31  33  34  35  36  37  38  39  42  43  44  45
  46  47  48  49  50  51  52  53  54  56  57  58  59  60  61  62  63  64
  68  70  71  72  73  74  75  76  78  79  80  81  82  83  85  86  87  88
  90  91  92  93  94  95  96  97  98  99 101 102 104 105 106 107 108 110
 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
 130 131 132 133 134 135 136 137 138 139 140 142 143 144 145 147 148 149
 151 153 154 155 156 157 158 159 160 162 163 164 165 166 167 168 169 170
 172 174 177 179 180 181 182 184 185 186 187 188 189 190 191 195 196 197
 199 201 202 204 205 206 207 208 209 210 211 213 214 215 216 217 218 219
 220 223 224 225 227 229 231 232 233 234 235 237 238 239 241 242 243 244
 245 246 247 248 249 250 251 252 253 254 255 257 260 261 262 263 265 267
 269 270 271 272 273 274 275 276 277 278 280 282 283 284 286 287 288 290
 292 293 294 295 296 298 299 302 303 304 305

<font color='red' size='5'>Exercise</font>

Load the `diabetes` dataset from `sklearn.datasets`. Designate $20\%$ of the data as `Validation` data. Use the remaining data to train the model using either KNN Regressor or Linear Regresson to predict the disease progression one year after baseline. Once you have trained the model, compute the test R2 score using the Validation set to get an assessment of the model's performance.

In [540]:
from sklearn.datasets import load_diabetes

In [541]:
diabet=load_diabetes()

In [542]:
print(diabet.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - Age
      - Sex
      - Body mass index
      - Average blood pressure
      - S1
      - S2
      - S3
      - S4
      - S5
      - S6

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bra

In [545]:
diabet.keys()

dict_keys(['data', 'target', 'DESCR', 'feature_names', 'data_filename', 'target_filename'])

In [549]:
diabet.feature_names

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

In [550]:
df=pd.DataFrame(diabet['data'],columns=diabet['feature_names'])

In [551]:
df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641


In [552]:
df['target']=diabet['target']

In [553]:
df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,135.0


In [561]:
ss=ShuffleSplit(n_splits=1,test_size=0.2, random_state=1)

In [562]:
for train, test in ss.split(df):
    print('Train:',train)
    print("Test:",test)

Train: [438 232  80  46 381 224  85 338  81 400 369 233 260 228 185 122  11 117
 139 218  93 420 168 324  41 180 162  95 106  92  66 370  23 327  13 326
  61 189 271 340  39 191 320  98 332 307 328 275 414 127  82 403 173  27
  89 339  73  69 179  91 363 161 330 125  59 120  12 337 204 441 187 157
 295 374 349 365 172 409 325  88 131 124 274  14 346 201 123 138 111  51
 112   9 287  16 229   0 305 105 236 426 261 372  70 244  38 150 342 163
 167 247 223 412 415 145 238 303 401  42 160 354 355 306 245 434 225 361
 312 418 323 368 147 250 100  34 188 110 175  53 227 135 411 154  19 142
 272 158 300  44 298 268 113 296 211 108 378 169  79 416 388  84   8  32
  99 382 360 350 222  28 322  55 419 341  48 362 309  33  35  63 270 234
 432  45 177 219 343  21 248 199 137  24 184 205 134 116 212 397 376 291
 231 217  56 345 344 407 181  97 114 278 118 329 170 379  54 176 194 198
 182 358 103 220 130 406  60  94 193 140 148 202 152 423  10 269  96 210
 240  57 301 334  36 394  20 392  75 200  77

In [563]:
X,X_test=train_test_split(df,test_size=0.2,random_state=1)

In [564]:
X.shape

(353, 11)

In [565]:
X_test.shape

(89, 11)

In [566]:
X_test.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
246,0.041708,-0.044642,-0.032073,-0.061904,0.079612,0.050982,0.056003,-0.009972,0.045066,-0.059067,78.0
425,-0.078165,-0.044642,-0.040696,-0.081414,-0.100638,-0.112795,0.022869,-0.076395,-0.020289,-0.050783,152.0
293,-0.0709,-0.044642,0.092953,0.012691,0.020446,0.042527,0.000779,0.00036,-0.054544,-0.001078,200.0
31,-0.023677,-0.044642,-0.065486,-0.081414,-0.03872,-0.05361,0.059685,-0.076395,-0.037128,-0.042499,59.0
359,0.038076,0.05068,0.00565,0.032201,0.006687,0.017475,-0.024993,0.034309,0.014823,0.061054,311.0


In [567]:
X.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
438,-0.005515,0.05068,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018118,0.044485,104.0
232,0.012648,0.05068,0.000261,-0.011409,0.03971,0.057245,-0.039719,0.056081,0.024053,0.032059,259.0
80,0.070769,-0.044642,0.012117,0.04253,0.071357,0.053487,0.052322,-0.002592,0.025393,-0.00522,143.0
46,-0.05637,-0.044642,-0.011595,-0.033214,-0.046975,-0.04766,0.00446,-0.039493,-0.007979,-0.088062,190.0
381,-0.0709,0.05068,-0.089197,-0.074528,-0.042848,-0.025739,-0.032356,-0.002592,-0.012908,-0.054925,104.0
