# Estimating Test Error

The purpose of building a machine learning model is to make decisions/predictions for _future_ data. Therefore, we are less interested in minimizing **training error** than we are in minimizing **test error**.

To estimate test error, we split our data into two: a **training set**, which will be used to _fit_ the model, and a **test set**, which will be used to _evaluate_ the model.

In [1]:
import pandas as pd
data = pd.read_csv("/data/harris.csv")

**Exercise 1**

First, split the Harris Bank data set into training and test sets. Then, estimate the test error of each of the following models:

- $f(\tt{Exper}) = \beta_0 + \beta_1 \tt{Exper}$
- $f(\tt{Exper}) = \beta_0 + \beta_1 \tt{Exper} + \beta_2 \tt{Exper}^2$
- $f(\tt{Exper}) = \beta_0 + \beta_1 \tt{Exper} + \beta_2 \tt{Exper}^2 + \beta_3 \tt{Exper}^3$
- $f(\tt{Exper}) = \beta_0 + \beta_1 \tt{Exper} + \beta_2 \tt{Exper}^2 + \beta_3 \tt{Exper}^3 + \beta_4 \tt{Exper}^4$

Based on your analysis, which of the models to you prefer?

In [2]:
# randomly split the data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data[["Exper"]], data["Bsal"], test_size=10)


# YOUR CODE HERE.
import numpy as np
from sklearn.linear_model import LinearRegression

model1= LinearRegression()
model1.fit(X_train, y_train)
np.sqrt(((y_test - model1.predict(X_test))**2).mean())

740.74801649742642

## $K$-Fold Cross Validation

The problem with estimating test error from a single train-test split is that the variability in the estimate can be quite high. Depending on which random split we get, the estimate of the test error could be very different.

One way to remedy this problem is to use _several_ train-test splits. Each train-test split produces a slightly different estimate of the test error; we can combine them into a single estimate by averaging.

The standard way to obtain several train-test splits is to first divide the data set into $K$ parts, or **folds**. Each fold serves as the test set once, with the remaining data used for training. This procedure is known as **k-fold cross validation**. A schematic for cross-validation when $K=5$ is shown below.
![5-fold cross-validation](5foldcv.png)

Notice that this procedure gives us $K=5$ separate estimates of the test error, which we can then average to produce a single estimate.

It is not hard to implement cross-validation from scratch. (You should try it!) However, scikit-learn provides a convenient function, 

`cross_val_score(model, X, y, scoring, cv)`, 

that will divide up `X` and `y` into `cv` folds, fit `model` to the data with each fold left out, and calculate `scoring` between the predicted and actual $y$ values in each fold. You can read more about `cross_val_score` in the scikit-learn [documentation on cross-validation](http://scikit-learn.org/stable/modules/cross_validation.html).

**Exercise 2**

Repeat Exercise 1, estimating the test error of each of the following models

- $f(\tt{Exper}) = \beta_0 + \beta_1 \tt{Exper}$
- $f(\tt{Exper}) = \beta_0 + \beta_1 \tt{Exper} + \beta_2 \tt{Exper}^2$
- $f(\tt{Exper}) = \beta_0 + \beta_1 \tt{Exper} + \beta_2 \tt{Exper}^2 + \beta_3 \tt{Exper}^3$
- $f(\tt{Exper}) = \beta_0 + \beta_1 \tt{Exper} + \beta_2 \tt{Exper}^2 + \beta_3 \tt{Exper}^3 + \beta_4 \tt{Exper}^4$

but use cross-validation instead of a single train-test split.

In [3]:
from sklearn.model_selection import cross_val_score

# YOUR CODE HERE.
model = LinearRegression()

print(-cross_val_score(model, data[['Exper']], data['Bsal'], cv=10, 
               scoring="neg_mean_squared_error").mean())

# quadratic
data['Exper2'] = data['Exper']**2
print(-cross_val_score(model, data[['Exper', 'Exper2']], data['Bsal'], cv=10, 
               scoring="neg_mean_squared_error").mean())

# cubic 
data['Exper3'] = data['Exper']**3
print(-cross_val_score(model, data[['Exper', 'Exper2', 'Exper3']], data['Bsal'], cv=10, 
               scoring="neg_mean_squared_error").mean())

# quartic
data['Exper4'] = data['Exper']**4
print(-cross_val_score(model, data[['Exper', 'Exper2', 'Exper3', 'Exper4']], data['Bsal'], cv=10, 
               scoring="neg_mean_squared_error").mean())

# quintic
data['Exper5'] = data['Exper']**5
print(-cross_val_score(model, data[['Exper', 'Exper2', 'Exper3', 'Exper4', 'Exper5']], data['Bsal'], cv=10, 
               scoring="neg_mean_squared_error").mean())

# six
data['Exper6'] = data['Exper']**6
print(-cross_val_score(model, data[['Exper', 'Exper2', 'Exper3', 'Exper4', 'Exper5', 'Exper6']], data['Bsal'], cv=10, 
               scoring="neg_mean_squared_error").mean())

518189.584546
480787.701237
478182.043105
431805.790591
431141.035327
435167.95771


In [7]:
data = pd.read_csv("/data/automobiles.csv")

FileNotFoundError: File b'/data/automobiles.csv' does not exist