# Introduction to Cross-Validation - Lab

## Introduction

In this lab, you'll be able to practice your cross-validation skills!


## Objectives

You will be able to:

- Compare the results with normal holdout validation
- Apply 5-fold cross validation for regression

## Let's get started

This time, let's only include the variables that were previously selected using recursive feature elimination. We included the code to preprocess below.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_boston

boston = load_boston()

boston_features = pd.DataFrame(boston.data, columns = boston.feature_names)
b = boston_features["B"]
logdis = np.log(boston_features["DIS"])
loglstat = np.log(boston_features["LSTAT"])

# minmax scaling
boston_features["B"] = (b-min(b))/(max(b)-min(b))
boston_features["DIS"] = (logdis-min(logdis))/(max(logdis)-min(logdis))

#standardization
boston_features["LSTAT"] = (loglstat-np.mean(loglstat))/np.sqrt(np.var(loglstat))

In [3]:
boston['target']

array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
       18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
       15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
       13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
       21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
       35.4, 24.7, 31.6, 23.3, 19.6, 18.7, 16. , 22.2, 25. , 33. , 23.5,
       19.4, 22. , 17.4, 20.9, 24.2, 21.7, 22.8, 23.4, 24.1, 21.4, 20. ,
       20.8, 21.2, 20.3, 28. , 23.9, 24.8, 22.9, 23.9, 26.6, 22.5, 22.2,
       23.6, 28.7, 22.6, 22. , 22.9, 25. , 20.6, 28.4, 21.4, 38.7, 43.8,
       33.2, 27.5, 26.5, 18.6, 19.3, 20.1, 19.5, 19.5, 20.4, 19.8, 19.4,
       21.7, 22.8, 18.8, 18.7, 18.5, 18.3, 21.2, 19.2, 20.4, 19.3, 22. ,
       20.3, 20.5, 17.3, 18.8, 21.4, 15.7, 16.2, 18. , 14.3, 19.2, 19.6,
       23. , 18.4, 15.6, 18.1, 17.4, 17.1, 13.3, 17.8, 14. , 14.4, 13.4,
       15.6, 11.8, 13.8, 15.6, 14.6, 17.8, 15.4, 21

In [4]:
X = boston_features[['CHAS', 'RM', 'DIS', 'B', 'LSTAT']]
y = pd.DataFrame(boston.target,columns = ['target'])
type(X)

pandas.core.frame.DataFrame

## Train test split

Perform a train-test-split with a test set of 0.20.

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
linreg = LinearRegression()

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)



Fit the model and apply the model to the make test set predictions

In [7]:
linreg.fit(X_train, y_train)
y_h_train = linreg.predict(X_train)
y_h_test = linreg.predict(X_test)


Calculate the residuals and the mean squared error

In [52]:
mean_squared_error(y_test, y_h_test)

ValueError: Found input variables with inconsistent numbers of samples: [102, 42]

## Cross-Validation: let's build it from scratch!

### Create a cross-validation function

Write a function k-folds that splits a dataset into k evenly sized pieces.
If the full dataset is not divisible by k, make the first few folds one larger then later ones.

We want the folds to be a list of subsets of data!

In [9]:
def kfolds(data, k):
    # Force data as pandas dataframe
    data = pd.DataFrame(data)
    # add 1 to fold size to account for leftovers
    return np.array_split(data,k)

### Apply it to the Boston Housing Data

In [10]:
# Make sure to concatenate the data again
b_data = pd.concat([X.reset_index(drop=True),y], axis=1)

In [11]:
b_data.columns

Index(['CHAS', 'RM', 'DIS', 'B', 'LSTAT', 'target'], dtype='object')

### Perform a linear regression for each fold, and calculate the training and test error

Perform linear regression on each and calculate the training and test error.

In [41]:
folds = kfolds(b_data, 6)

In [19]:
def mse(residual_col):
    residual_col = pd.Series(residual_col)
    return np.mean(residual_col.astype(float).map(lambda x: x**2))

In [44]:
test_errs = []
train_errs = []
k=6

for n in range(k):
    # Split in train and test for the fold
    train = pd.concat([folds[i] for i in range(k) if i != n])
    test = folds[n]
    # Fit a linear regression model
    linreg.fit(train[X.columns], train[y.columns])
    y_h_train = linreg.predict(train[X.columns])
    y_h_test = linreg.predict(test[X.columns])
    #Evaluate Train and Test Errors
    train_err = y_h_train-train[y.columns]
    test_err = y_h_test - test[y.columns]
    train_errs.append(np.mean(train_err**2))
    test_errs.append(np.mean(test_err**2))

print(train_errs)
print(test_errs)

[target    12.27084
dtype: float64, target    13.477562
dtype: float64, target    12.591612
dtype: float64, target    10.868045
dtype: float64, target    9.599414
dtype: float64, target    11.434433
dtype: float64]
[target    10.251374
dtype: float64, target    4.351297
dtype: float64, target    9.296298
dtype: float64, target    18.949694
dtype: float64, target    26.759849
dtype: float64, target    14.850669
dtype: float64]


## Cross-Validation using Scikit-Learn

This was a bit of work! Now, let's perform 5-fold cross-validation to get the mean squared error through scikit-learn. Let's have a look at the five individual MSEs and explain what's going on.

In [51]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

f_fold = cross_val_score(linreg, X, y, cv=5, scoring='neg_mean_squared_error')
f_fold

array([-13.40514492, -17.4440168 , -37.03271139, -58.27954385,
       -26.09798876])

Next, calculate the mean of the MSE over the 5 cross-validations and compare and contrast with the result from the train-test-split case.

##  Summary 

Congratulations! You now practiced your knowledge on k-fold crossvalidation!