# Introduction to Cross-Validation - Lab

## Introduction

In this lab, you'll be able to practice your cross-validation skills!


## Objectives

You will be able to:

- Compare the results with normal holdout validation
- Apply 5-fold cross validation for regression

## Let's get started

This time, let's only include the variables that were previously selected using recursive feature elimination. We included the code to preprocess below.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_boston

boston = load_boston()

boston_features = pd.DataFrame(boston.data, columns = boston.feature_names)
b = boston_features["B"]
logdis = np.log(boston_features["DIS"])
loglstat = np.log(boston_features["LSTAT"])

# minmax scaling
boston_features["B"] = (b-min(b))/(max(b)-min(b))
boston_features["DIS"] = (logdis-min(logdis))/(max(logdis)-min(logdis))

#standardization
boston_features["LSTAT"] = (loglstat-np.mean(loglstat))/np.sqrt(np.var(loglstat))

In [2]:
X = boston_features
y = pd.DataFrame(boston.target, columns = ['price'])

In [3]:
y.head()

Unnamed: 0,price
0,24.0
1,21.6
2,34.7
3,33.4
4,36.2


## Train test split

Perform a train-test-split with a test set of 0.20.

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [6]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()

Fit the model and apply the model to the make test set predictions

In [7]:
linreg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Calculate the residuals and the mean squared error

In [9]:
y_hat_train = linreg.predict(X_train)
y_hat_test = linreg.predict(X_test)

In [12]:
train_residuals = y_hat_train - y_train
test_residuals = y_hat_test - y_test
print("Train Residuals: ",train_residuals)
print("Test Residuals: ",test_residuals)

Train Residuals:           price
268  -0.708100
300   5.223714
434   3.002658
171   4.392264
494  -4.625898
392   1.632405
181  -7.792319
5    -0.832033
322   2.784904
378   2.841320
419   5.593921
149  -0.103396
451   2.157569
438  -0.874081
16   -1.652887
333   2.038847
95    1.158497
325   1.927740
64   -8.719828
18   -4.427139
503   6.316918
377   7.113619
122  -0.111506
308   8.693959
395   6.360179
444   0.782248
0     7.286569
482   2.663732
105  -0.424301
152   4.145980
..         ...
502   2.674234
7    -9.578920
483  -2.246878
103   0.448468
252   0.003087
463   0.425929
380  -0.929346
291  -1.779214
278  -0.005461
37    2.120263
399   6.815201
505  11.778540
484  -3.351377
233  -9.231564
280  -4.628854
84    0.044841
120  -0.636188
353  -0.434635
48   -3.472392
435  -1.018545
366  -5.145465
237   2.806795
12   -3.242264
330   1.086268
472  -2.968036
110  -1.999083
86   -1.583775
460   0.379317
21   -3.034437
401  10.215086

[404 rows x 1 columns]
Test Residuals:           pr

In [13]:
from sklearn.metrics import mean_squared_error

train_mse = mean_squared_error(y_train, y_hat_train)
test_mse = mean_squared_error(y_test, y_hat_test)
print('Train Mean Squared Error:', train_mse)
print('Test Mean Squared Error:', test_mse)

Train Mean Squared Error: 16.433966136332202
Test Mean Squared Error: 17.481245322183785


## Cross-Validation: let's build it from scratch!

### Create a cross-validation function

Write a function k-folds that splits a dataset into k evenly sized pieces.
If the full dataset is not divisible by k, make the first few folds one larger then later ones.

We want the folds to be a list of subsets of data!

In [15]:
import numpy as np

In [16]:
def kfolds(data, k):
    # Force data as pandas dataframe
    # add 1 to fold size to account for leftovers 
    return np.array_split(data,k)

### Apply it to the Boston Housing Data

In [17]:
# Make sure to concatenate the data again
data = pd.concat([X,y], axis =1)

In [18]:
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,price
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,0.542096,1.0,296.0,15.3,1.0,-1.27526,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,0.623954,2.0,242.0,17.8,1.0,-0.263711,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,0.623954,2.0,242.0,17.8,0.989737,-1.627858,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,0.707895,3.0,222.0,18.7,0.994276,-2.153192,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,0.707895,3.0,222.0,18.7,1.0,-1.162114,36.2


In [20]:
data = data.drop(['CRIM','ZN','INDUS','NOX','AGE','RAD','TAX','PTRATIO'], axis =1)

In [21]:
data.head()

Unnamed: 0,CHAS,RM,DIS,B,LSTAT,price
0,0.0,6.575,0.542096,1.0,-1.27526,24.0
1,0.0,6.421,0.623954,1.0,-0.263711,21.6
2,0.0,7.185,0.623954,0.989737,-1.627858,34.7
3,0.0,6.998,0.707895,0.994276,-2.153192,33.4
4,0.0,7.147,0.707895,1.0,-1.162114,36.2


In [22]:
data_folds = kfolds(data,5)

In [23]:
len(data_folds)

5

In [24]:
for folds in data_folds:
    print(len(folds))

102
101
101
101
101


### Perform a linear regression for each fold, and calculate the training and test error

Perform linear regression on each and calculate the training and test error.

In [26]:
n=0
train_list = []
for counts,fold in enumerate(data_folds):
    if counts != n:
        train_list.append(fold)
df_train = pd.concat(train_list)
df_test = data_folds[n]
    

In [47]:
test_errs = []
train_errs = []
k=5

for n in range(k):
    # Split in train and test for the fold
    train_list = []
    for counts,fold in enumerate(data_folds):
        if counts != n:
            train_list.append(fold)
    train = pd.concat(train_list)
    test = data_folds[n]
    # Fit a linear regression model
    linreg.fit(train[['CHAS','RM','DIS','B','LSTAT']],train['price'])
    #Evaluate Train and Test Errors
    y_hat_train = linreg.predict(train[['CHAS','RM','DIS','B','LSTAT']])
    train_errs.append(mean_squared_error(train['price'],y_hat_train))
    y_hat_test = linreg.predict(test[['CHAS','RM','DIS','B','LSTAT']])
    test_errs.append(mean_squared_error(test['price'],y_hat_test))
print(train_errs)
print(test_errs)

[24.195577370388616, 23.032087348477972, 19.745072857978982, 15.31710138425119, 22.32997280754659]
[13.405144922008187, 17.44401680068205, 37.03271139002074, 58.279543847842305, 26.097988757148393]


In [28]:
df_train.head()

Unnamed: 0,CHAS,RM,DIS,B,LSTAT,price
102,0.0,6.405,0.369415,0.17772,-0.012136,18.6
103,0.0,6.137,0.369415,0.993873,0.378596,19.3
104,0.0,6.167,0.321174,0.989384,0.235,20.1
105,0.0,5.851,0.262627,0.992814,0.71727,19.5
106,0.0,5.836,0.282946,0.996898,0.925237,19.5


## Cross-Validation using Scikit-Learn

This was a bit of work! Now, let's perform 5-fold cross-validation to get the mean squared error through scikit-learn. Let's have a look at the five individual MSEs and explain what's going on.

In [32]:
from sklearn.model_selection import cross_val_score

In [35]:
cv_5_results = cross_val_score(linreg,X,y,cv=5, scoring = "neg_mean_squared_error")

Next, calculate the mean of the MSE over the 5 cross-validations and compare and contrast with the result from the train-test-split case.

Next, calculate the mean of the MSE over the 5 cross-validations and compare and contrast with the result from the train-test-split case.

In [36]:
cv_5_results

array([-13.04498602, -14.64079171, -24.85806428, -55.30023395,
       -19.21846011])

##  Summary 

Congratulations! You now practiced your knowledge on k-fold crossvalidation!