# STK4030 Final project

## Setup

In [8]:
library(caret)
set.seed(4030)

Loading required package: lattice
Loading required package: ggplot2

Attaching package: ‘caret’

The following object is masked _by_ ‘.GlobalEnv’:

    RMSE



## Helper functions

In [4]:
AllColumnsExcept = function(data, col.names) {
    columns = !(names(data) %in% col.names)
    return(data[, columns])
}

In [5]:
SortData = function(data) {
    data.as.df = as.data.frame(data)
    sorted.data = data.as.df[with(data.as.df, order(-abs(data.as.df[, 1]))), , drop = FALSE]
    return(sorted.data)
}

## Exercise 1

### Load data

In [6]:
load("bostonhousing.rdata")

training.data = data[data$train == TRUE, ]
training.data = AllColumnsExcept(training.data, 'train')

test.data = data[data$train == FALSE, ]
test.data = AllColumnsExcept(test.data, 'train')

### 1.1 Linear regression

*Estimate a linear Gaussian regression model including all 14 independent variables by (ordinary) least squares (OLS) on the training set.*

In [7]:
lgr.model = lm(y ~ ., training.data)

*Report the estimated coefficients.*

In [8]:
lgr.model$coefficients

*Which covariates have the strongest association with y? In particular, the study focused on the effect of air pollution, measured through the concentrations of nitrogen oxide pollutants (nox) and particulate (part).*

#### Correlations

In [9]:
correlations = cor(training.data, training.data['y'], method = "pearson")
SortData(correlations)

Unnamed: 0,y
y,1.0
lstat,-0.84057651
rm,0.62856668
tax,-0.5687532
indus,-0.55433246
nox,-0.52532741
crim,-0.52429359
ptratio,-0.50639657
rad,-0.49478292
age,-0.48137089


#### Coefficients on a standardized regression model

In [10]:
scaled.training.data = lapply(training.data, scale)
scaled.model = lm(y ~ ., scaled.training.data)
SortData(scaled.model$coefficients)

Unnamed: 0,data
lstat,-0.6245181
rad,0.3162886
dis,-0.309536
tax,-0.2538477
crim,-0.2029852
nox,-0.197765
ptratio,-0.1907376
rm,0.1197933
bk,-0.1135118
zn,0.08091898


#### Incremental/partial R2

Gain in R2 when adding variable as the last one
TODO

*Do they have any effect on the house price? If yes, which kind of effect?*

- **part** has a (statistically insignificant) negative effect on housing prices
- **nox** has a (statistically significant) negative effect on housing prices

### 1.2 Evaluation

*The model above can be also used to predict the price for the other tracts (test set).*

*Compute the prediction error on the test data.*

In [11]:
RMSE = function(model, data) {
  predictions = predict(model, data)
  prediction.error = sqrt(mean((predictions - data$y)^2))
  return(prediction.error)
}

In [12]:
RMSE(lgr.model, test.data)

*Moreover, derive two reduced models by applying a backward elimination procedure with AIC and α = 0.05 as stopping criteria, respectively. For both models, report the estimated coefficients and the prediction error estimated on the test data. Comment the results.*

#### Backard elimination with AIC as stopping criteria

In [13]:
aic.elimination.model = lm(y ~ ., training.data)
aic.elimination.model = step(aic.elimination.model, direction = "backward")

Start:  AIC=-882.01
y ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + 
    ptratio + bk + lstat + part

          Df Sum of Sq     RSS     AIC
- age      1    0.0006  6.8805 -883.99
- part     1    0.0252  6.9051 -883.09
- indus    1    0.0358  6.9156 -882.70
<none>                  6.8799 -882.01
- zn       1    0.1113  6.9911 -879.95
- chas     1    0.2300  7.1098 -875.69
- tax      1    0.2736  7.1535 -874.14
- rm       1    0.3140  7.1939 -872.72
- nox      1    0.3304  7.2103 -872.14
- bk       1    0.4368  7.3166 -868.44
- rad      1    0.5016  7.3814 -866.21
- ptratio  1    0.7161  7.5960 -858.96
- crim     1    0.9588  7.8387 -851.00
- dis      1    1.0300  7.9099 -848.71
- lstat    1    5.3725 12.2524 -738.00

Step:  AIC=-883.99
y ~ crim + zn + indus + chas + nox + rm + dis + rad + tax + ptratio + 
    bk + lstat + part

          Df Sum of Sq     RSS     AIC
- part     1    0.0263  6.9068 -885.02
- indus    1    0.0357  6.9162 -884.68
<none>                  6

In [14]:
aic.elimination.model$coefficients

In [15]:
RMSE(aic.elimination.model, test.data)

#### Backward elimination with α = 0.05 as stopping criteria

In [16]:
PValues = function(model) {
  return(summary(model)$coefficients[-1, 4])
}

In [17]:
alpha.elimination.model = NULL
alpha.elimination.data = training.data

repeat {
  alpha.elimination.model = lm(y ~ ., alpha.elimination.data)

  if (all(PValues(alpha.elimination.model) <= 0.05)) {
    break;
  }
  
  p.values = PValues(alpha.elimination.model)
  highest.p.value = p.values[which.max(p.values)]
  predictor.with.highest.p.value = names(highest.p.value)
  alpha.elimination.data = AllColumnsExcept(alpha.elimination.data, predictor.with.highest.p.value) 
}

In [18]:
alpha.elimination.model$coefficients

In [19]:
RMSE(alpha.elimination.model, test.data)

### 1.3 Principal Component Regression

*Estimate a principal component regression model, selecting the number of components by 10-fold cross-validation.*

In [20]:
RotateData = function(pca, data, response.variable = 'y') {
  rotated.data = as.data.frame(predict(pca, data))
  rotated.data[response.variable] = data[response.variable]
  return(rotated.data)
}

In [21]:
pca = prcomp(AllColumnsExcept(training.data, 'y'), scale = TRUE, center = TRUE)

rotated.training.data = RotateData(pca, training.data)
rotated.test.data = RotateData(pca, test.data)

In [78]:
PCRModel = function(data, method, number) {
    model = train(
        y ~ .,
        data = data,
        preProc = c("center", "scale"),
        method = "pcr",
        tuneControl = 10,
        trControl = trainControl(
            method = method,
            number = number
        ))
    model$num.of.components = model$bestTune$ncomp
    return(model)
}

In [79]:
pcr.cv.model = PCRModel(training.data, 'cv', 10)

In [77]:
RMSE(pcr.cv.model, test.data)

*How many components have been selected?*

In [62]:
pcr.cv.model$num.of.components

In [22]:
RMSE(lm(y ~ . - PC14, rotated.training.data), rotated.test.data)

*What does it mean?*

TODO

### 1.4 PCR using .632 bootstrap

*Repeat the procedure to choose the number of components by using the .632 bootstrap procedure.*

In [41]:
pcr.bootstrap.model = SelectBestPCRModel(training.data, 'boot632', 100)

In [42]:
RMSE(pcr.bootstrap.model, test.data)

*Does the number of selected components change?*

In [43]:
pcr.bootstrap.model$num.of.components

*Report the estimate of the prediction error for each possible number of components.*
    
See above.

### 1.5 Ridge regression

*Estimate the regression model by ridge regression, where the optimal tuning parameter λ is chosen by 10-fold cross-validation.*

**TODO**: What are reasonable lambda values?

In [157]:
ridge.model = train(
  y ~ .,
  data = training.data,
  method = "glmnet",
  tuneGrid = expand.grid(
    alpha = 0,
    lambda = 10^seq(-5, 5, by = .1)
  ),
  trControl = trainControl(
    method = 'cv',
    number = 10
  )
)

Loading required package: Matrix
Loading required package: foreach
Loaded glmnet 2.0-13

“There were missing values in resampled performance measures.”

*Report the estimated coefficients, the obtained value of lambda and the prediction error computed on the test data.*

In [176]:
coef(ridge.model$finalModel, ridge.model$bestTune$lambda)

15 x 1 sparse Matrix of class "dgCMatrix"
                        1
(Intercept)  3.8771124229
crim        -0.0080244317
zn           0.0008041282
indus       -0.0002841339
chas         0.1359930995
nox         -0.3888605767
rm           0.0965529950
age         -0.0004318776
dis         -0.0441903838
rad          0.0061303460
tax         -0.0002416576
ptratio     -0.0298469164
bk          -0.3716559594
lstat       -0.0309399065
part        -0.0080482342

In [175]:
ridge.model$bestTune$lambda

In [174]:
RMSE(ridge.model, test.data)

### 1.6 Lasso regression

*Repeat the same procedure by using lasso and component-wise L2Boost. Use 10-fold cross-validation to find the optimal value for λ (lasso) and mstop (L2Boost), while set the boosting step size ν equal to 0.1.*

##### Lasso

In [177]:
lasso.model = train(
  y ~ .,
  data = training.data,
  method = "glmnet",
  tuneGrid = expand.grid(
    alpha = 1,
    lambda = 10^seq(-5, 5, by = .1)
  ),
  trControl = trainControl(
    method = 'cv',
    number = 10
  )
)

“There were missing values in resampled performance measures.”

In [178]:
coef(lasso.model$finalModel, lasso.model$bestTune$lambda)

15 x 1 sparse Matrix of class "dgCMatrix"
                        1
(Intercept)  4.3260709940
crim        -0.0089050445
zn           0.0012568908
indus        0.0020492554
chas         0.1261989179
nox         -0.5861530862
rm           0.0714581102
age          .           
dis         -0.0588804573
rad          0.0120748001
tax         -0.0004775532
ptratio     -0.0340596693
bk          -0.3885042779
lstat       -0.0363073007
part        -0.0078546974

In [179]:
lasso.model$bestTune$lambda

In [180]:
RMSE(lasso.model, test.data)

### 1.7 Variable transformations

*It has been argued that the predictors rm and dis do not have a linear effect on the outcome. Substitute the former with its cube and the latter with its inverse (dis-1) in the first model (OLS) and refit the model.*

In [181]:
TransformData = function(data) {
  transformed.data = data
  transformed.data$rm.squared = data$rm^2
  transformed.data$dis.inverse = data$dis^(-1)
  transformed.data = AllColumnsExcept(transformed.data, c('rm', 'dis'))
}

In [182]:
transformed.training.data = TransformData(training.data)
transformed.test.data = TransformData(test.data)

In [183]:
transformed.model = lm(y ~ ., transformed.training.data)

*Report the estimated coefficients.*

In [184]:
transformed.model$coefficients

*Compute the prediction error on the test set and compare the result with that obtained at point 1.*

In [185]:
RMSE(transformed.model, transformed.test.data)

In [186]:
lgr.prediction.error

The fit improved!

TODO: Comment/test significance of difference