# Introduction to Cross-Validation - Lab

## Introduction

In this lab, you'll be able to practice your cross-validation skills!


## Objectives

You will be able to:

- Compare the results with normal holdout validation
- Apply 5-fold cross validation for regression

## Let's get started

This time, let's only include the variables that were previously selected using recursive feature elimination. We included the code to preprocess below.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_boston

boston = load_boston()

boston_features = pd.DataFrame(boston.data, columns = boston.feature_names)
b = boston_features["B"]
logdis = np.log(boston_features["DIS"])
loglstat = np.log(boston_features["LSTAT"])

# minmax scaling
boston_features["B"] = (b-min(b))/(max(b)-min(b))
boston_features["DIS"] = (logdis-min(logdis))/(max(logdis)-min(logdis))

#standardization
boston_features["LSTAT"] = (loglstat-np.mean(loglstat))/np.sqrt(np.var(loglstat))

In [2]:
X = boston_features
y = boston.target

## Train test split

Perform a train-test-split with a test set of 0.20.

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = .2)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_hat_train = linreg.predict(X_train)
y_hat_test = linreg.predict(X_test)
train_mse = mean_squared_error(y_train, y_hat_train)
test_mse = mean_squared_error(y_test, y_hat_test)
print(train_mse, test_mse)

16.730341668656664 16.269507198316624


Fit the model and apply the model to the make test set predictions

Calculate the residuals and the mean squared error

In [9]:
train_residual = y_hat_train - y_train
test_residual = y_hat_test - y_test
print(train_residual, test_residual)
print(train_mse, test_mse)

[ 3.09380115e+00  1.36037484e+00  1.21133801e+00 -3.13061013e+00
  2.73566848e+00 -1.71536207e-01  1.83885583e+00  1.27095350e+00
 -6.14907699e+00 -1.89763453e-01  5.73846069e+00  1.84188051e-01
  4.01221217e-01  1.27059947e+00  9.73884324e-01  4.76114634e+00
  6.72987823e-01  6.11724109e+00  1.03868218e+00 -1.68573949e+00
 -1.48991106e+00  3.41621887e+00  3.61817171e+00 -3.05618673e+00
  1.29381774e+00  1.08536472e+00  1.77824536e+00  1.88257125e+00
  5.74107121e-01 -9.48939230e+00 -7.16576803e-02  3.36993677e+00
  4.14067987e+00 -2.26706563e+01 -3.51404311e+00  1.25352358e+00
 -2.06016755e+00  8.72624320e+00 -5.22415141e+00 -3.11428298e+00
 -1.60647304e+00 -2.87757244e+00 -2.14409431e-01  5.54157888e-02
  3.25950516e+00  1.83444886e+00 -4.13581562e+00  7.42650004e-01
  2.84636094e+00 -3.38743996e-01  4.57100665e+00 -3.46680342e+00
  2.12446307e+00 -6.64243715e+00 -5.35704789e+00  6.12649053e-01
 -1.10902700e+00  6.45905759e+00  1.15768784e+00 -7.66438657e+00
  2.79265755e+00  7.18915

## Cross-Validation: let's build it from scratch!

### Create a cross-validation function

Write a function k-folds that splits a dataset into k evenly sized pieces.
If the full dataset is not divisible by k, make the first few folds one larger then later ones.

We want the folds to be a list of subsets of data!

In [18]:
def kfolds(data, k):
    # Force data as pandas dataframe
    df = pd.DataFrame(data)
    n = len(df)
    size = n//k
    remainder = n % k
    grouping = []
    for i in range (0,k):
        if remainder >0:
            grouping.append(size+1)
            remainder -= 1
        else:
            grouping.append(size)
    # add 1 to fold size to account for leftovers
    fold_size = 1
    groups = [df.iloc[0:grouping[0]]]
    while fold_size < k:
        start = sum(grouping[0:fold_size])
        end = sum(grouping[0:fold_size+1])
        groups.append(df.iloc[start:end])
        fold_size+=1
    return groups


### Apply it to the Boston Housing Data

In [28]:
# Make sure to concatenate the data again
boston_target = pd.DataFrame(data = boston.target, columns = ['price'])
boston_df = pd.concat([boston_target, boston_features], axis = 1)
boston_folds = kfolds(boston_df, 5)
boston_folds[0]

Unnamed: 0,price,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,24.0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,0.542096,1.0,296.0,15.3,1.000000,-1.275260
1,21.6,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,0.623954,2.0,242.0,17.8,1.000000,-0.263711
2,34.7,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,0.623954,2.0,242.0,17.8,0.989737,-1.627858
3,33.4,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,0.707895,3.0,222.0,18.7,0.994276,-2.153192
4,36.2,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,0.707895,3.0,222.0,18.7,1.000000,-1.162114
5,28.7,0.02985,0.0,2.18,0.0,0.458,6.430,58.7,0.707895,3.0,222.0,18.7,0.992990,-1.200048
6,22.9,0.08829,12.5,7.87,0.0,0.524,6.012,66.6,0.671500,5.0,311.0,15.2,0.996722,0.248456
7,27.1,0.14455,12.5,7.87,0.0,0.524,6.172,96.1,0.700059,5.0,311.0,15.2,1.000000,0.968416
8,16.5,0.21124,12.5,7.87,0.0,0.524,5.631,100.0,0.709276,5.0,311.0,15.2,0.974104,1.712312
9,18.9,0.17004,12.5,7.87,0.0,0.524,6.004,85.9,0.743201,5.0,311.0,15.2,0.974305,0.779802


### Perform a linear regression for each fold, and calculate the training and test error

Perform linear regression on each and calculate the training and test error.

In [30]:
test_errs = []
train_errs = []
k=5
i = 0

for n in range(k):
    # Split in train and test for the fold
    train = pd.concat([fold for i, fold in enumerate(boston_folds) if i!=n])
    test = boston_folds[n]
    X_train = train.drop(['price'], axis = 1)
    X_test = test.drop(['price'], axis = 1)
    y_train = pd.DataFrame(train['price'])
    y_test = pd.DataFrame(test['price'])
    # Fit a linear regression model
    linreg.fit(X_train, y_train)
    #Evaluate Train and Test Errors
    y_hat_train = linreg.predict(X_train)
    y_hat_test = linreg.predict(X_test)
    train_errs.append(mean_squared_error(y_train, y_hat_train))
    test_errs.append(mean_squared_error(y_test, y_hat_test))
    train_residuals = y_hat_train - y_train
    test_residuals = y_hat_test - y_test
    train_errs.append(np.mean(train_residuals))
    test_errs.append(np.mean(test_residuals))
print(train_errs)
print(test_errs)

[17.91857867030195, price    2.519437e-15
dtype: float64, 17.361583277713564, price   -5.732589e-15
dtype: float64, 15.543427264673557, price   -3.671137e-15
dtype: float64, 11.040378388195778, price   -1.434244e-15
dtype: float64, 17.23404426556592, price   -4.662388e-15
dtype: float64]
[13.04498602423651, price   -0.159047
dtype: float64, 14.640791709688838, price   -0.54907
dtype: float64, 24.858064276132193, price   -0.977625
dtype: float64, 55.30023394856501, price    0.195622
dtype: float64, 19.218460110193845, price    1.092927
dtype: float64]


## Cross-Validation using Scikit-Learn

This was a bit of work! Now, let's perform 5-fold cross-validation to get the mean squared error through scikit-learn. Let's have a look at the five individual MSEs and explain what's going on.

In [34]:
new_x = boston_df.drop(['price'], axis =1)
new_y = boston_df['price']
cv_5_results = cross_val_score(linreg, new_x, new_y, cv = 5, scoring = 'neg_mean_squared_error')
cv_5_results

array([-13.04498602, -14.64079171, -24.85806428, -55.30023395,
       -19.21846011])

Next, calculate the mean of the MSE over the 5 cross-validations and compare and contrast with the result from the train-test-split case.

In [35]:
np.mean(cv_5_results)

-25.41250721376328

##  Summary 

Congratulations! You now practiced your knowledge on k-fold crossvalidation!