<h2> ======================================================</h2>
 <h1>MA477 - Theory and Applications of Data Science</h1> 
  <h1>Lesson 8: Cross-Validation </h1> 
 
 <h4>Dr. Valmir Bucaj</h4>
 United States Military Academy, West Point 
AY20-2
<h2>======================================================</h2>

<h2>Lecture Outline</h2>

<ul>
    <li>What is Cross-Validation?</li>
    <li> Validation Set Method</li>
    <li>Leave-One-Out Cross-Validation (LOOCV)</li>
    <li>$k-$Fold Cross-Validation</li>
    <li>Bias-Variance Trade-Off for k-Fold CV</li>
    <li>Implementing Cross-Validation with Python</li>
 
    
</ul>

<h3> What is Cross-Validation?</h3>

Cross-Validation (short CV) is a <i> resampling method </i> most commonly used for <i> model assesment</i>; that is, to evaluate a model's performance via estimating the test error associated with the respective machine-learning method.

For example, in order to gain an idea of the variability of our model, what one may want to do is repeatedly draw different samples from the training data, fit the machine-learning model to each of the drawn samples, and compute some metric to examine the extent to which each of the fitted models differ. This kind of insight is impossible to be gained if we only fit once our model to the training data. 

<h3>Validation Set Method</h3>

Recall that when assessing the performance of a machine-learning model we are interested in assessing how well our model performs in making predictions on the new data, previously unseen by the model. In other words, we want to estimate the <i> test error rate</i>. 

The <i> validation set approach</i> is most appropriate when we have a large dataset and we can afford to split it into a <i> training set</i>(used to train our model) and a <i> test set</i> or a <i> hold-out set</i> which has not been seen by the model before and will be used to compute the <i> test error rate</i>, such as $R^2$ score and $MSE$ in the regression setting.

There are two points one has to keep in mind when using the <i> validation set approach</i>:
<ul>
    <li> The estimates of the test error rates obtained via using the validation set may have a high variance depending on what points are included in the training set</li>
    <li> May result in an overestimate of the test error rates due to the fact that machine-learning algorithms tend to perform better with larger training sets</li>
    </ul>
    
We have already implemented the validation set approache when we discussed KNN Regressor and Linear Regression.

<h3> Leave-One-Out Cross-Validation</h3>

LOOCV is very similar to the <i>validation set approach</i> in the sense that it also involves splitting the dataset in two parts. Despite the similarities, it attempts to overcome the two drawbacks that the validation set approach has, namely, the high variance due to the random split into training and test sets, and the potential to overestimate the test error. 

Suppose we have $n$ data points $(x_1,y_1),\dots, (x_n,y_n)$. LOOCV splits the dataset into a single-element validation set and a training set which contains the rest of the data. Specifically, on the first iteration, only the data point $(x_1,y_1)$ will be designated as the validation set and the remaining $n-1$ point $(x_2,y_2),\dots, (x_n,y_n)$ will be used to train the model. Once the model has been trained, and a prediciton $\hat y_1$ is made using the excluded observation $x_1$ one computes $MSE_1=(y_1-\hat{y_1})^2$ to obtain an estimate of the test $MSE$.

For obvious reasons, this estimate is poor as it depents on a single point and thus suffers from high-variance. To get around this drawback, we repeat the process by iteratively designating each of $(x_i,y_i)$ as a validation point and using the remaining $n-1$ points $(x_1,y_1),\dots,(x_{i-1},y_{i-1}),(x_{i+1},y_{i+1}),\dots,(x_n,y_n)$ to train the model. After the model has been trained, a prediction $\hat{y_i}$ is made using the point $x_i$ which has not been seen by the model previously and we compute $MSE_i=(y_i-\hat{y_i})^2$ and average them over the $n$ points to obtain a more roboust estimate of the test $MSE$:

$$CV_{(n)}=\frac{1}{n}\sum_{i=1}^nMSE_i$$

Let's discuss how LOOCV gets around the two drawbacks that the validation set approach suffers from. First off, because there is no random split of the dataset $CV_{(n)}$ will always be the same regardless how many times the model is run. Second, because each time the model is fit, it essentially uses the entire original dataset, it has less tendency to overestimate the test error rate compared to the validation set approach.

However, LOOCV does suffer frome drawbacks. Maybe the major one is the fact that it is computationally costly especially if $n$ is large and the machine-learning models we are fitting are complex and take a long time. Which leads us to the next validation method.

<h3>k-Fold Cross-Validation</h3>

$k-$fold CV is essentially a generalization of the LOOCV. In this case the dataset is randomly split into $k$ approximately equal size subsets. Then, iteratively, one of these subsets (folds) is designated as a validation set and the remaining $k-1$ subsets(folds) are used to fit the model. Once the model is fit, predictions are made using the designated validation set and the mean square error $MSE_i\approx \lfloor\frac{k}{n}\rfloor\sum_{j=1}^{\lfloor\frac{n}{k}\rfloor}(y_{ij}-\hat{y_{ij}})^2$ is computed for $i=1,2,\dots,k$. Finally, the $k-$fold CV estimate of the test $MSE$ is the average of these values:

$$CV_{(k)}=\frac{1}{k}\sum_{i=1}^kMSE_i$$

As we mentioned, when $k=1$ then $k-$fold CV is simply LOOCV.

In contract to LOOCV the $k-$fold CV is computationaly less expensive because instead of fitting the model $n$ times, it only requires fitting it $k$ times, where $k$ is typically taken to be $3,5$ or $10$. 

<h2>Implementing Cross-Validation in Python</h2>

Go ahead and import the standard libraries and the `Boston` dataset.

In [2]:
import pandas as pd
import numpy as np


In [49]:
import matplotlib.pyplot as plt
import seaborn as sns

In [50]:
df=pd.read_excel("Boston_Dataset.xlsx")

In [51]:
df.head()

Unnamed: 0.1,Unnamed: 0,CRIM,CHAS,RM,DIS,RAD,TAX,PTRATIO,LSTAT,Price
0,0,0.00632,No,6.575,4.09,1,296,15.3,4.98,24.0
1,1,0.02731,No,6.421,4.9671,2,242,17.8,9.14,21.6
2,2,0.02729,No,7.185,4.9671,2,242,17.8,4.03,34.7
3,3,0.03237,No,6.998,6.0622,3,222,18.7,2.94,33.4
4,4,0.06905,No,7.147,6.0622,3,222,18.7,5.33,36.2


<h3> Data Pre-processing</h3>

<font size=4 color='red'>Exercise</font>

  Scale the data, but not the response variable, `PRICE`.

In [13]:
from sklearn.preprocessing import StandardScaler

In [52]:
X=df.drop('Price',axis=1)
y=df['Price']

In [57]:
X.head()

Unnamed: 0.1,Unnamed: 0,CRIM,CHAS,RM,DIS,RAD,TAX,PTRATIO,LSTAT
0,0,0.00632,No,6.575,4.09,1,296,15.3,4.98
1,1,0.02731,No,6.421,4.9671,2,242,17.8,9.14
2,2,0.02729,No,7.185,4.9671,2,242,17.8,4.03
3,3,0.03237,No,6.998,6.0622,3,222,18.7,2.94
4,4,0.06905,No,7.147,6.0622,3,222,18.7,5.33


In [59]:
X['CHAS']=X['CHAS'].apply(lambda x: 1 if x=='Yes' else 0)

In [60]:
scaler=StandardScaler()

In [61]:
scaled=scaler.fit_transform(X)

In [62]:
X_sc=pd.DataFrame(scaled,columns=X.columns)

In [63]:
X_sc.head()

Unnamed: 0.1,Unnamed: 0,CRIM,CHAS,RM,DIS,RAD,TAX,PTRATIO,LSTAT
0,-1.728631,-0.419782,-0.272599,0.413672,0.140214,-0.982843,-0.666608,-1.459,-1.075562
1,-1.721785,-0.417339,-0.272599,0.194274,0.55716,-0.867883,-0.987329,-0.303094,-0.492439
2,-1.714939,-0.417342,-0.272599,1.282714,0.55716,-0.867883,-0.987329,-0.303094,-1.208727
3,-1.708093,-0.41675,-0.272599,1.016303,1.077737,-0.752922,-1.106115,0.113032,-1.361517
4,-1.701247,-0.412482,-0.272599,1.228577,1.077737,-0.752922,-1.106115,0.113032,-1.026501


One way to implemnet CV is to use the `cross_validate()` method. Below we import this method along with KNN Regressor and Linear Regression which we will use to fit our data.

In [303]:
from sklearn.model_selection import cross_validate, cross_val_score, ShuffleSplit,  KFold
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor

In [453]:
lg=LinearRegression()
knn=KNeighborsRegressor(n_neighbors=10)


When implementing CV we need to provide an estimator (eg. Linear Regression, KNN etc.) Below, we first implemnet CV with LinearRegression as an estimator and then we do the same thing using KNN Regressor and compare the results. WE also need to provide a way to measure the predictions, such as R2 score, MSE etc.

One very important thing to keep in mind is that when applying CV you need to make sure you are shuffling the indicies that are used to split the dataset. We will demonstrate this concept below so that it is clear what we mean.

In [408]:
cv_lg=cross_validate(lg,X_sc,y,cv=3,scoring={'r2'},return_estimator=True,return_train_score=True)
cv_knn=cross_validate(knn,X_sc,y,cv=3,scoring={'r2'},return_estimator=True,return_train_score=True)

In [409]:
cv_lg.keys()

dict_keys(['fit_time', 'score_time', 'estimator', 'test_r2', 'train_r2'])

In [410]:
cv_lg['test_r2']

array([ 0.34337723,  0.46592017, -0.28248462])

In [411]:
cv_knn['test_r2']

array([0.37601945, 0.13763878, 0.0729882 ])

In [412]:
cv_knn['estimator']

(KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=50, p=2,
                     weights='uniform'),
 KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=50, p=2,
                     weights='uniform'),
 KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=50, p=2,
                     weights='uniform'))

<font color='red' size=4>STOP & Reflect</font>

As we can see, the $R2$ scores are very low. We worked with this ame set last lecture and were getting better results. This is a great opportunity to stop and reflect about what could possibly be going on??

<h4> Shuffling First</h4>

Now, let's demonstrate the significant change in the computed metrics once we shuffle the indices frist and then carry out the split. Think about why this is important and why you should always do this first.

Below we will demonstrate two differnet ways we may achieve this. 

In [454]:
kf=KFold(n_splits=10,random_state=2,shuffle=True)
ss=ShuffleSplit(n_splits=10,test_size=0.2, random_state=2)

In [458]:
cv_lg=cross_validate(lg,X_sc,y,cv=ss,scoring={'r2'},return_estimator=True,return_train_score=True)
cv_knn=cross_validate(knn,X_sc,y,cv=ss,scoring={'r2'},return_estimator=True,return_train_score=True)

In [459]:
cv_lg['test_r2'].mean()

0.6811674465843506

In [460]:
cv_knn['test_r2'].mean()

0.7525084661567873

In [423]:
# for train, test in kf.split(X_sc):
#     print('Train:',train)
#     print("Test",test)

<font color='red' size='5'>Exercise</font>

Load the `diabetes` dataset from `sklearn.datasets`. Designate $20\%$ of the data as `Validation` data. Use the remaining data to train the model using either KNN Regressor or Linear Regresson to predict the disease progression one year after baseline. Once you have trained the model, compute the test R2 score using the Validation set to get an assessment of the model's performance.

In [475]:
from sklearn.datasets import load_diabetes

In [476]:
diabet=load_diabetes()

In [477]:
print(diabet.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - Age
      - Sex
      - Body mass index
      - Average blood pressure
      - S1
      - S2
      - S3
      - S4
      - S5
      - S6

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bra