# Introduction to Cross-Validation - Lab

## Introduction

In this lab, you'll be able to practice your cross-validation skills!


## Objectives

You will be able to:

- Compare the results with normal holdout validation
- Apply 5-fold cross validation for regression

## Let's get started

This time, let's only include the variables that were previously selected using recursive feature elimination. We included the code to preprocess below.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_boston

boston = load_boston()

boston_features = pd.DataFrame(boston.data, columns = boston.feature_names)
b = boston_features["B"]
logdis = np.log(boston_features["DIS"])
loglstat = np.log(boston_features["LSTAT"])

# minmax scaling
boston_features["B"] = (b-min(b))/(max(b)-min(b))
boston_features["DIS"] = (logdis-min(logdis))/(max(logdis)-min(logdis))

#standardization
boston_features["LSTAT"] = (loglstat-np.mean(loglstat))/np.sqrt(np.var(loglstat))

In [3]:
boston_features.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,0.542096,1.0,296.0,15.3,1.0,-1.27526
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,0.623954,2.0,242.0,17.8,1.0,-0.263711
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,0.623954,2.0,242.0,17.8,0.989737,-1.627858
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,0.707895,3.0,222.0,18.7,0.994276,-2.153192
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,0.707895,3.0,222.0,18.7,1.0,-1.162114


In [11]:
boston_features.shape[1]

13

In [5]:
X = boston_features[['CHAS', 'RM', 'DIS', 'B', 'LSTAT']]
y = pd.DataFrame(boston.target, columns=['price'])

## Train test split

Perform a train-test-split with a test set of 0.20.

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print(len(X_train), len(X_test), len(y_train), len(y_test))

404 102 404 102


Fit the model and apply the model to the make test set predictions

In [8]:
from sklearn.linear_model import LinearRegression

linreg = LinearRegression().fit(X_train, y_train)

# y_hat_train = linreg.predict(X_train)
y_hat_test = linreg.predict(X_test)

  linalg.lstsq(X, y)


Calculate the residuals and the mean squared error

In [9]:
from sklearn.metrics import mean_squared_error

test_res = y_hat_test - y_test
mse_test = mean_squared_error(y_test, y_hat_test)
print(mse_test)

17.061226246037492


## Cross-Validation: let's build it from scratch!

### Create a cross-validation function

Write a function k-folds that splits a dataset into k evenly sized pieces.
If the full dataset is not divisible by k, make the first few folds one larger then later ones.

We want the folds to be a list of subsets of data!

In [17]:
def kfolds(data, k):
    # Force data as pandas dataframe
    # add 1 to fold size to account for leftovers  
    
    df = pd.DataFrame(data)
    num = df.shape[1]
    
    num_folds = num//k
    leftovers = num%k
    
    folds = []
    count = 0
    
    for n in range(1, k+1):
        if n <= leftovers:
            fold = df.iloc[count : (count + num_folds + 1)]
            folds.append(fold)
            count += num_folds + 1
        else:
            fold = df.iloc[count : (count + num_folds)]
            folds.append(fold)
            count += num_folds
             
    return folds

### Apply it to the Boston Housing Data

In [32]:
# Make sure to concatenate the data 
# (boston_features and boston.target)
housing_data = pd.concat([X, y], axis=1)

In [33]:
housing_folds = kfolds(housing_data, 5)
print(housing_folds)

[   CHAS     RM       DIS    B     LSTAT  price
0   0.0  6.575  0.542096  1.0 -1.275260   24.0
1   0.0  6.421  0.623954  1.0 -0.263711   21.6,    CHAS     RM       DIS         B     LSTAT  price
2   0.0  7.185  0.623954  0.989737 -1.627858   34.7,    CHAS     RM       DIS         B     LSTAT  price
3   0.0  6.998  0.707895  0.994276 -2.153192   33.4,    CHAS     RM       DIS    B     LSTAT  price
4   0.0  7.147  0.707895  1.0 -1.162114   36.2,    CHAS    RM       DIS        B     LSTAT  price
5   0.0  6.43  0.707895  0.99299 -1.200048   28.7]


### Perform a linear regression for each fold, and calculate the training and test error

Perform linear regression on each and calculate the training and test error.

In [29]:
test_errs = []
train_errs = []
k=5

for n in range(k):
    # Split in train and test for the fold
#     for i, f in enumerate(housing_folds):
#         if i != n:
#             train.append(pd.concat(f))
    
    train = pd.concat([f for i, f in enumerate(housing_folds) if i != n])
    test = housing_folds[n]
    
    # Fit a linear regression model
    linreg.fit(train[X.columns], train[y.columns])
    
    #Evaluate Train and Test Errors
    y_hat_train = linreg.predict(train[X.columns])
    y_hat_test = linreg.predict(test[X.columns])
    
    train_res = y_hat_train - train[X.columns] # y_train
    test_res = y_hat_test - test[X.columns] # y_test
    
    train_errs.append(np.mean(train_res**2))
    test_errs.append(np.mean(test_res**2))
    

print(train_errs)
print(test_errs)

[CHAS     1113.445000
RM        698.542635
DIS      1068.299540
B        1048.307790
LSTAT    1218.420795
dtype: float64, CHAS     858.450000
RM       514.225596
DIS      820.457800
B        802.041869
LSTAT    934.166106
dtype: float64, CHAS     876.156000
RM       526.227520
DIS      838.938437
B        819.292043
LSTAT    945.302862
dtype: float64, CHAS     837.180000
RM       496.825279
DIS      800.755279
B        781.510232
LSTAT    918.923247
dtype: float64, CHAS      934.530000
RM        566.450061
DIS       895.981595
B         875.782552
LSTAT    1019.306192
dtype: float64]
[CHAS     786.200289
RM       464.154056
DIS      753.813070
B        731.129508
LSTAT    829.722450
dtype: float64, CHAS     1.032095e+07
RM       1.027484e+07
DIS      1.031694e+07
B        1.031459e+07
LSTAT    1.033141e+07
dtype: float64, CHAS     1483.633395
RM        993.508003
DIS      1429.601182
B        1408.027005
LSTAT    1654.142803
dtype: float64, CHAS     821.018078
RM       462.525801
DIS  

## Cross-Validation using Scikit-Learn

This was a bit of work! Now, let's perform 5-fold cross-validation to get the mean squared error through scikit-learn. Let's have a look at the five individual MSEs and explain what's going on.

In [28]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

result = cross_val_score(linreg, X, y, cv=5, scoring="neg_mean_squared_error")
print(result)

[-13.40514492 -17.4440168  -37.03271139 -58.27954385 -26.09798876]


Next, calculate the mean of the MSE over the 5 cross-validations and compare and contrast with the result from the train-test-split case.

##  Summary 

Congratulations! You now practiced your knowledge on k-fold crossvalidation!