# Chapter 5 - Resampling Methods  
Once we have fitted a model using a set of observations we want to test its performance on new data. Three approaches are discussed in the book: validation set, cross validation and bootstrap. 

The validation set approach consists of splitting the available data set into two parts, the training set, used to fit our model, and the validation set, used to test the performance of our model. The problem of the validation set approach is that the model will have a lower performance than it could achieve if the full available data set could be used for training. 

Cross validation solves this problem by splitting the observations in k subsets, using one of them to test the model trained with the other k - 1 sets, repeating the procedure k times and taking the average error rate. The paramenter k can be small, 1, 5, 10 being the most used values, so that the problem raised by the validation set will not have impact. 

Bootstrap is a technique that can be used even with small data sets to estimate the accuracy of an estimate. it si similar cross-validation, but the bootstrap samples are created by taking random observations with replacement from the available observations.     

### 5.R Review Questions

In [1]:
load("data/5.R.RData")
dim(Xy); names(Xy)

The standard error for $\beta_1$ is 0.02593

In [10]:
fit <- lm(y ~ X1 + X2, data = Xy)
summary(fit)


Call:
lm(formula = y ~ X1 + X2, data = Xy)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.44171 -0.25468 -0.01736  0.33081  1.45860 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.26583    0.01988  13.372  < 2e-16 ***
X1           0.14533    0.02593   5.604 2.71e-08 ***
X2           0.31337    0.02923  10.722  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5451 on 997 degrees of freedom
Multiple R-squared:  0.1171,	Adjusted R-squared:  0.1154 
F-statistic: 66.14 on 2 and 997 DF,  p-value: < 2.2e-16


In [11]:
beta1.fn <- function(data, index) {
    X1 <- data$X1[index]
    X2 <- data$X2[index]
    fit <- lm(y ~ X1 + X2, data = data) 
    coef(fit)[2]
}

In [12]:
library(boot)
boot(Xy, beta1.fn, R = 100)


ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Xy, statistic = beta1.fn, R = 100)


Bootstrap Statistics :
     original  bias    std. error
t1* 0.1453263       0           0

### 5.3.1 The Validation Set Approach
The validation set approach is used to estimate the mean squared error (MSE)

$$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{f}(x_i))^2$$

where $y_i$ is an observation and $\hat{f}(x_i)$ is a prediction from the fit. The set of available observation is split into two subsets. One subset is used for training and the other subset, the validation set, is used to test the quality of the fit by computing the MSE on unseen data. In the equation n is the size of the validation set. The observations in each subset are taken randomly from the original set of observations. In this example we use the Auto data set from the ISLR package. 

In [2]:
library(ISLR)
summary(Auto); dim(Auto)

      mpg          cylinders      displacement     horsepower      weight    
 Min.   : 9.00   Min.   :3.000   Min.   : 68.0   150    : 22   Min.   :1613  
 1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.0   90     : 20   1st Qu.:2223  
 Median :23.00   Median :4.000   Median :146.0   88     : 19   Median :2800  
 Mean   :23.52   Mean   :5.458   Mean   :193.5   110    : 18   Mean   :2970  
 3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0   100    : 17   3rd Qu.:3609  
 Max.   :46.60   Max.   :8.000   Max.   :455.0   75     : 14   Max.   :5140  
                                                 (Other):287                 
  acceleration        year           origin                  name    
 Min.   : 8.00   Min.   :70.00   Min.   :1.000   ford pinto    :  6  
 1st Qu.:13.80   1st Qu.:73.00   1st Qu.:1.000   amc matador   :  5  
 Median :15.50   Median :76.00   Median :1.000   ford maverick :  5  
 Mean   :15.56   Mean   :75.99   Mean   :1.574   toyota corolla:  5  
 3rd Qu.:17.10   3rd Qu.:7

We fit a linear model using a random subset of observations, then we make predictions for the full set of observations but considering only the subset that have not been used to compute the MSE. The code below doesn't work with Jupyter but it does work in R and RStudio. Use the script chapter5.R in this same folder. 

In [5]:
set.seed(1)
train <- sample(397, 196) # extract a random sample
Auto.train <- Auto[train,]
Auto.val <- Auto[-train,]
lm.fit <- lm(mpg ~ horsepower, data= Auto, subset = train) # fit the model using the train data (it works in RStudio but not in Jupyter)
lm.val <- predict(lm.fit, Auto.val) # using the model make predictions for the unseen observation
mpg.val <- Auto$mpg[-train] # extract the values for mpg that have not been used for training
square_residuals <- (mpg.val - lm.val)^2 # compute the square of the residuals for the validation set
mse <- mean(square_residuals) # compute the mean squared error
#mse

ERROR: Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor horsepower has new levels 102, 107, 108, 132, 133, 138, 152, 190, 198, 200, 208, 210, 215, 220, 53, 54, 61, 64, 82, 91, 93


### 5.3.4 The Bootstrap
The bootstrap is a technique to obtain samples from a given data set by taking random samples with replacement of the original data set. As an example we have two stocks, X and Y, for which we have the returns and we want to build a portfolio allocating our money in such a way that the variance of our portfolio, that represents the risk, is minimized

$$Var(\alpha X + (1 - \alpha) Y)$$

where

$$\alpha = \frac{\sigma_Y^2 - \sigma_{XY}}{\sigma_X^2 + \sigma_Y^2 - 2 \sigma_{XY}}$$

In order to compute a good estimation of $\sigma_X, \sigma_Y$ and $\sigma_{XY}$ and the accuracy of $\alpha$ we can use the bootstrap to create many samples of the original data set.

In [15]:
summary(Portfolio); dim(Portfolio)

       X                  Y           
 Min.   :-2.43276   Min.   :-2.72528  
 1st Qu.:-0.88847   1st Qu.:-0.88572  
 Median :-0.26889   Median :-0.22871  
 Mean   :-0.07713   Mean   :-0.09694  
 3rd Qu.: 0.55809   3rd Qu.: 0.80671  
 Max.   : 2.46034   Max.   : 2.56599  

We define the function $\alpha$ as a function that takes as imput the data, the stock returns, and a randomized index for that data.

In [14]:
alpha.fn <- function(data, index) {
    X <- data$X[index]
    Y <- data$Y[index]
    return ((var(Y) - cov(X, Y)) / (var(X) + var(Y) - 2 * cov(X,Y)))
}

We use the boot() function from the 'boot' package to create random samples with replacement from the original data set and estimate the function $\alpha$ and its standard error.

In [18]:
set.seed(1)
boot(Portfolio, alpha.fn, R = 1000)


ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Portfolio, statistic = alpha.fn, R = 1000)


Bootstrap Statistics :
     original       bias    std. error
t1* 0.5758321 -0.001596422  0.09376093