This lab on Cross-Validation is a python adaptation of p. 190-194 of "Introduction to Statistical Learning
with Applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Written
by R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016).

# 5.3.1 The Validation Set Approach

In [80]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

In this section, we'll explore the use of the validation set approach in order to estimate the
test error rates that result from fitting various linear models on the ${\tt Auto}$ data set.

In [81]:
df1 = pd.read_csv('Auto.csv', na_values='?').dropna()
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 396
Data columns (total 9 columns):
mpg             392 non-null float64
cylinders       392 non-null int64
displacement    392 non-null float64
horsepower      392 non-null float64
weight          392 non-null int64
acceleration    392 non-null float64
year            392 non-null int64
origin          392 non-null int64
name            392 non-null object
dtypes: float64(4), int64(4), object(1)
memory usage: 30.6+ KB


We begin by using the ${\tt sample()}$ function to split the set of observations
into two halves, by selecting a random subset of 196 observations out of
the original 392 observations. We refer to these observations as the training
set.

We'll use the ${\tt random\_state}$ parameter in order to set a seed for
${\tt python}$’s random number generator, so that you'll obtain precisely the same results as those shown below. It is generally a good idea to set a random seed when performing an analysis such as cross-validation
that contains an element of randomness, so that the results obtained can be reproduced precisely at a later time.

In [84]:
train = df1.sample(196, random_state = 1)
test = df1[~df1.isin(train)].dropna(how = 'all')

We then use the ${\tt sm.OLS.from\_formula()}$ to fit a linear regression to predict ${\tt mpg}$ from ${\tt horsepower}$ using only
the observations corresponding to the training set.

In [85]:
lm = sm.OLS.from_formula('mpg~horsepower', train)
result = lm.fit()

We now use the ${\tt predict()}$ function to estimate the response for the test
observations, and we use some ${\tt numpy}$ functions to caclulate the MSE.

In [86]:
pred = result.predict(test)

MSE = np.mean(np.square(np.subtract(test["mpg"], pred)))
    
print(MSE)

23.361902892587235


Therefore, the estimated test MSE for the linear regression fit is 23.36. We
can use the ${\tt np.power()}$ function to estimate the test error for the polynomial
and cubic regressions.

In [88]:
lm2 = sm.OLS.from_formula('mpg~' + '+'.join(['np.power(horsepower,' + str(i) + ')' for i in [1,2]]), train)
print(np.mean(np.square(np.subtract(test["mpg"], lm2.fit().predict(test)))))

lm3 = sm.OLS.from_formula('mpg~' + '+'.join(['np.power(horsepower,' + str(i) + ')' for i in [1,2,3]]), train)
print(np.mean(np.square(np.subtract(test["mpg"], lm3.fit().predict(test)))))

20.252690858350192
20.325609366115582


These error rates are 20.25 and 20.33, respectively. If we choose a different
training set instead, then we will obtain somewhat different errors on the
validation set. We can test this out by setting a different random seed:

In [89]:
train = df1.sample(196, random_state = 2)

lm = sm.OLS.from_formula('mpg~horsepower', train)
print(np.mean(np.square(np.subtract(test["mpg"], lm.fit().predict(test)))))

lm2 = sm.OLS.from_formula('mpg~' + '+'.join(['np.power(horsepower,' + str(i) + ')' for i in [1,2]]), train)
print(np.mean(np.square(np.subtract(test["mpg"], lm2.fit().predict(test)))))

lm3 = sm.OLS.from_formula('mpg~' + '+'.join(['np.power(horsepower,' + str(i) + ')' for i in [1,2,3]]), train)
print(np.mean(np.square(np.subtract(test["mpg"], lm3.fit().predict(test)))))

23.214272449679587
19.525710963117103
19.667628097077426


Using this split of the observations into a training set and a validation
set, we find that the validation set error rates for the models with linear,
quadratic, and cubic terms are 23.21, 19.53, and 19.67, respectively.

These results are consistent with our previous findings: a model that
predicts ${\tt mpg}$ using a quadratic function of ${\tt horsepower}$ performs better than
a model that involves only a linear function of ${\tt horsepower}$, and there is
little evidence in favor of a model that uses a cubic function of ${\tt horsepower}$.

In [108]:
from sklearn.cross_validation import LeaveOneOut

loo = LeaveOneOut(10)
for train_index, test_index in loo:
    df1[train_index]

IndexError: indices are out-of-bounds