# The Vaildation set approach

In [1]:
library(ISLR2)

We begin by using the `set.seed()` function in order to set a _seed_ for `R`'s random number generator. This will assure that the results obtained can be reproduced percisely at a later time.

In [2]:
set.seed(1)

The `sample()` function is used to split the observations into two halves, by selecting a random subset out of the data set.

In [3]:
# select a random set of 196 observations out of 392.
train <- sample(392, 196)

We use the `subset` option in `lm()` to fit a linear regression using onlyt the observations corresponding to the training set.

In [4]:
lm.fit <- lm(mpg ~ horsepower, data = Auto, subset = train)

We now use the `predict()` function to estimate the response for all 392 observations, and use the `mean()` function to calculate the MSE of the 196 observations in the validation set. Note the `-train` index below selects only the observations that are not in the training set.

In [5]:
attach(Auto)
mean((mpg - predict(lm.fit, Auto))[-train]^2)

We can use the `poly()` function to estimate the test error for the quadratic and cubic regressions.

In [7]:
lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data = Auto, subset = train)
mean((mpg - predict(lm.fit2, Auto))[-train]^2)

In [8]:
lm.fit3 <- lm(mpg ~ poly(horsepower, 3), data = Auto, subset = train)
mean((mpg - predict(lm.fit3, Auto))[-train]^2)

The error rates for the quadratic and cubic regressions do better than the linear regression model. However, the cubic regression is not really better than the quadratic regression. Lets see what we get if we choose a different training set.

In [9]:
set.seed(2)
train <- sample(392, 196)
lm.fit <- lm(mpg ~ horsepower, subset = train)
mean((mpg - predict(lm.fit, Auto))[-train]^2)

In [10]:
lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data = Auto, subset = train)
mean((mpg - predict(lm.fit2, Auto))[-train]^2)

In [11]:
lm.fit3 <- lm(mpg ~ poly(horsepower, 3), data = Auto, subset = train)
mean((mpg - predict(lm.fit3, Auto))[-train]^2)

This is consistent with the previous findings. The quadratic and cubic regressions are better than the linear regression. However the cubic regression is not really better than the quadratic regression.