# Introduction to Cross-Validation - Lab

## Introduction

In this lab, you'll be able to practice your cross-validation skills!


## Objectives

You will be able to:

- Compare the results with normal holdout validation
- Apply 5-fold cross validation for regression

## Let's get started

This time, let's only include the variables that were previously selected using recursive feature elimination. We included the code to preprocess below.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_boston

boston = load_boston()

boston_features = pd.DataFrame(boston.data, columns = boston.feature_names)
b = boston_features["B"]
logdis = np.log(boston_features["DIS"])
loglstat = np.log(boston_features["LSTAT"])

# minmax scaling
boston_features["B"] = (b-min(b))/(max(b)-min(b))
boston_features["DIS"] = (logdis-min(logdis))/(max(logdis)-min(logdis))

#standardization
boston_features["LSTAT"] = (loglstat-np.mean(loglstat))/np.sqrt(np.var(loglstat))

In [2]:
boston_target = pd.DataFrame(boston.target, columns = ['price'])
boston_df = pd.concat([boston_target, boston_features], axis=1)
X = boston_df.drop(['price'], axis=1)
y = boston_df['price']

## Train test split

Perform a train-test-split with a test set of 0.20.

In [3]:
from sklearn.model_selection import train_test_split

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2)

Fit the model and apply the model to the make test set predictions

In [5]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)

y_hat_train = linreg.predict(X_train)
y_hat_test = linreg.predict(X_test)

Calculate the residuals and the mean squared error

In [6]:
from sklearn.metrics import mean_squared_error
train_mse = mean_squared_error(y_train, y_hat_train)
test_mse = mean_squared_error(y_test, y_hat_test)
print('Train Mean Squared Error:', train_mse)
print('Test Mean Squared Error:', test_mse)

Train Mean Squared Error: 17.18840416445454
Test Mean Squared Error: 14.378218463353283


## Cross-Validation: let's build it from scratch!

### Create a cross-validation function

Write a function k-folds that splits a dataset into k evenly sized pieces.
If the full dataset is not divisible by k, make the first few folds one larger then later ones.

We want the folds to be a list of subsets of data!

In [7]:
boston_df

Unnamed: 0,price,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,24.0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,0.542096,1.0,296.0,15.3,1.000000,-1.275260
1,21.6,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,0.623954,2.0,242.0,17.8,1.000000,-0.263711
2,34.7,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,0.623954,2.0,242.0,17.8,0.989737,-1.627858
3,33.4,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,0.707895,3.0,222.0,18.7,0.994276,-2.153192
4,36.2,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,0.707895,3.0,222.0,18.7,1.000000,-1.162114
5,28.7,0.02985,0.0,2.18,0.0,0.458,6.430,58.7,0.707895,3.0,222.0,18.7,0.992990,-1.200048
6,22.9,0.08829,12.5,7.87,0.0,0.524,6.012,66.6,0.671500,5.0,311.0,15.2,0.996722,0.248456
7,27.1,0.14455,12.5,7.87,0.0,0.524,6.172,96.1,0.700059,5.0,311.0,15.2,1.000000,0.968416
8,16.5,0.21124,12.5,7.87,0.0,0.524,5.631,100.0,0.709276,5.0,311.0,15.2,0.974104,1.712312
9,18.9,0.17004,12.5,7.87,0.0,0.524,6.004,85.9,0.743201,5.0,311.0,15.2,0.974305,0.779802


In [8]:
from sklearn.model_selection import cross_val_score

def kfolds(data, k):
    # Force data as pandas dataframe
    df = pd.DataFrame(data)
    n = len(data)
    group_size = n // k
    remainder = n % k
    group_counts = []
    for i in range(0, k):
        if remainder > 0:
            group_counts.append(group_size + 1)
            remainder -= 1
        else:
            group_counts.append(group_size)
    # add 1 to fold size to account for leftovers   
    j = 1
    groups = [data.iloc[0:group_counts[0]]]
    while j < k:
        start = sum(group_counts[0:j])
        finish = sum(group_counts[0:j+1])
        groups.append(data.iloc[start:finish])
        j += 1
        
    return groups

### Apply it to the Boston Housing Data

In [9]:
# Make sure to concatenate the data again
results = kfolds(boston_df, 5)
results

[     price     CRIM    ZN  INDUS  CHAS    NOX     RM    AGE       DIS  RAD  \
 0     24.0  0.00632  18.0   2.31   0.0  0.538  6.575   65.2  0.542096  1.0   
 1     21.6  0.02731   0.0   7.07   0.0  0.469  6.421   78.9  0.623954  2.0   
 2     34.7  0.02729   0.0   7.07   0.0  0.469  7.185   61.1  0.623954  2.0   
 3     33.4  0.03237   0.0   2.18   0.0  0.458  6.998   45.8  0.707895  3.0   
 4     36.2  0.06905   0.0   2.18   0.0  0.458  7.147   54.2  0.707895  3.0   
 5     28.7  0.02985   0.0   2.18   0.0  0.458  6.430   58.7  0.707895  3.0   
 6     22.9  0.08829  12.5   7.87   0.0  0.524  6.012   66.6  0.671500  5.0   
 7     27.1  0.14455  12.5   7.87   0.0  0.524  6.172   96.1  0.700059  5.0   
 8     16.5  0.21124  12.5   7.87   0.0  0.524  5.631  100.0  0.709276  5.0   
 9     18.9  0.17004  12.5   7.87   0.0  0.524  6.004   85.9  0.743201  5.0   
 10    15.0  0.22489  12.5   7.87   0.0  0.524  6.377   94.3  0.727217  5.0   
 11    18.9  0.11747  12.5   7.87   0.0  0.524  6.00

### Perform a linear regression for each fold, and calculate the training and test error

Perform linear regression on each and calculate the training and test error.

In [26]:
test_errs = []
train_errs = []
k=5
i=0
for n in range(k):
    # Split in train and test for the fold
    train = pd.concat([fold for i, fold in enumerate(results) if i!=n])
    test = results[n]
    X_train = train.drop(['price'], axis = 1)
    X_test = test.drop(['price'], axis = 1)
    y_train = pd.DataFrame(train['price'])
    y_test = pd.DataFrame(test['price'])
    # Fit a linear regression model
    linreg.fit(X_train, y_train)
    #Evaluate Train and Test Errors
    y_hat_train = linreg.predict(X_train)
    y_hat_test = linreg.predict(X_test)
    train_errs.append(mean_squared_error(y_train, y_hat_train))
    test_errs.append(mean_squared_error(y_test, y_hat_test))
#     train_residuals = y_hat_train - y_train
#     test_residuals = y_hat_test - y_test
#     train_errs.append(np.mean(train_residuals.astype(float)**2))
#     test_errs.append(np.mean(test_residuals.astype(float)**2))
print(train_errs)
print(test_errs)

[17.9185670542463, 17.3577081046629, 15.545678258525871, 11.03762238964458, 17.23404426556592]
[13.016192102045745, 14.62832183142464, 24.81432997168215, 55.241077726377355, 19.022337999169658]


## Cross-Validation using Scikit-Learn

This was a bit of work! Now, let's perform 5-fold cross-validation to get the mean squared error through scikit-learn. Let's have a look at the five individual MSEs and explain what's going on.

In [29]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

cv_5_results = cross_val_score(linreg, boston_df.drop(['price'], axis = 1), boston_df['price'], cv=5, scoring="neg_mean_squared_error")

Next, calculate the mean of the MSE over the 5 cross-validations and compare and contrast with the result from the train-test-split case.

In [31]:
cv_5_results, np.mean(cv_5_results)

(array([-13.0161921 , -14.62832183, -24.81432997, -55.24107773,
        -19.022338  ]), -25.344451926139907)

##  Summary 

Congratulations! You now practiced your knowledge on k-fold crossvalidation!