---
# Computational Statistics project: 
### Comparing Lasso and Random Forest for variable selection
---

In [1]:
# Imports
library(MASS) 
library(stats) #for fast toeplitz matrix

#Import auxiliary functions
source("auxiliary_functions.R", local=FALSE)

In [2]:
set.seed(456)

---
## 1. Introduction
---

---
## 2 Variable Selection
---

### 2.1 LASSO

### 2.2 RF

### 2.3 Measures

---
## 3. Simulation Study
---

### 3.1 Set-up

We stipulte a sparse, linear data generation process. Importantly, due to the linearity - direct comparisons of the prediction error of the LASSO and the RF are not very useful. Instead, what is of interest to us is the relative performance for varying levels of data quality.

The model emulates a frequently used DGP popularized by Belloni et al. 2011: the regression model is of the form

$Y = X'\beta_0 + \varepsilon$

The variables of interest are:
 - n observations to n predictors
 - number of non-zero coefficients
 - size of coefficients
 - signal to noise ratio !
 - Distribution of the error terms (?)

In [3]:
beta = beta_1(100,5)
df <- simulate(n=100, p=100, rho=0.5, beta=beta, SNR = 1)$df
head(df)

Unnamed: 0_level_0,Y,X1,X2,X3,X4,X5,X6,X7,X8,X9,⋯,X91,X92,X93,X94,X95,X96,X97,X98,X99,X100
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,-0.122382006,0.56317,0.3704573,-0.0545722,-0.0822142,0.1531361,1.22995708,0.8436492,-1.06057199,-0.62170175,⋯,0.4170016,0.07188499,1.7569537,-1.2357546,-0.7573094,-1.0863422,-0.5950666,-1.72412345,-0.8902131,-1.16959916
2,-0.782831144,0.5133443,0.0912919,-0.06212029,1.8397683,0.688943,0.02892623,-0.4635067,2.30278351,1.638343192,⋯,1.202369,-1.16942288,0.5023184,-2.0363256,-0.3961266,-0.3435142,0.5424981,-0.3745105,0.2322417,0.08819934
3,0.004918806,-0.3958803,-0.1989434,0.93221793,0.4113333,1.0661119,-0.30410812,0.6746924,1.77901175,0.007572251,⋯,-0.1094645,-0.72973882,0.2452632,-0.2997512,1.233906,0.3824722,0.7566942,-0.68263108,-1.2042095,-1.07676222
4,-1.894713333,-1.8701442,-0.3995208,0.39619932,-0.8112229,-0.7181958,-1.04029536,-0.9069355,-0.43193891,0.356338875,⋯,2.2169479,2.78971876,0.2669565,0.3444761,-0.8653121,0.4045451,-0.4956664,-1.2783843,0.3607651,0.99387953
5,-3.046231083,0.6665568,1.0816778,-0.66746471,-0.8412897,-0.6636646,-0.70527654,-0.9006999,-0.07185995,0.477724391,⋯,0.472265,1.24388102,1.3614714,1.7944987,0.6752546,0.6167135,0.2548338,1.86769023,1.5189985,0.73643199
6,-1.501009056,0.6743191,-0.644686,-0.7983457,0.2206195,1.9161518,0.03111727,-0.7132421,-0.96433065,-1.340048222,⋯,0.7566137,-1.82486251,-1.4517218,-0.3292665,-0.2277125,-1.4206689,-0.5062045,0.07974973,0.8197603,-0.28410004


**Orient correlation coefficient $\rho$ and sample size on application data**

What is my signal to Noise ratio here? $SNR = \frac{var(x'\beta^0)}{\sigma^2}$

In [22]:
matrix(NaN, ncol=1)

0
""


In [46]:
simulate <- function(n, #number of observations
                     p, #number of covariates
                     rho, #degree of covariance
                     beta, #vetctor of true coefficients
                     SNR # desired Signal-to-Noise ratio
){
  if (length(beta) != p){
    cat("Number of beta coefficient unequal to p")
  }else{
    #Mean of explanatory variables
    mu = rep(0,p) #all covariates are standardized with mean zero
    
    #Variance-Covariance Matrix
    ###Note: Matrix only depends on p and rho
    toep = rho^(0:(p-1)) #creates geometric series starting at one
    Sigma = toeplitz(toep) #creates toeplitz matrix from geometric series: rho^(i-j)
    
    #explanatory variables
    X = mvrnorm(n, mu, Sigma)
    
    # Set snr based on sample variance on infinitely large test set
    var_mu = as.numeric(t(beta) %*% Sigma %*% beta)
    sigma = as.numeric(sqrt(var_mu/SNR))

    # Generate response variable
    Y = as.numeric(X %*% beta + rnorm(n)*sigma)
      
    #-------Creating data frame
    df <- data.frame(Y, X)
    
    list_1 = list("df" = df, "sigma" = sigma)
    return(list_1)
  } 
}

In [57]:
beta = beta_1(10,5)
df <- simulate(n=100, p=10, rho=0.5, beta=beta, SNR = 1)$df
head(df)

Unnamed: 0_level_0,Y,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,3.104494411,0.3812724,0.1655085,1.0202853,0.579985,0.8404607,0.2883465,0.3780697,0.31273187,1.1412918,0.4468788
2,0.006126285,0.6167204,0.2686988,0.2619658,0.8981043,-0.1682565,-1.14249695,-0.820353,-0.2420147,-0.1913897,-1.3175664
3,-5.116991072,-0.8144163,-0.9008645,-1.9297572,-1.0541327,-1.1009114,-0.9659049,0.2355541,0.55199905,1.0473717,-1.2402259
4,1.16649185,-0.6025551,0.2195556,0.1435758,0.1700888,1.4984011,0.96469378,2.506206,0.8430907,-0.3804716,0.3748146
5,-1.048299099,0.1335362,1.1482252,0.6305707,-0.2958468,-1.1228392,-1.56732163,-0.2237386,-0.07384976,0.8563595,0.2030239
6,-7.010889129,-1.0089412,-0.6133604,-2.4162544,-2.1024918,-1.10781,-0.08249021,-1.0546264,-0.6551639,-1.3670825,0.3066135


In [54]:
beta = beta_1(10,5)
container = matrix(NaN, ncol=1)
for ( i in 1:1000){
  df <- simulate(n=100, p=10, rho=0.9, beta=beta, SNR = 2)
    X = df$df[,-1]
    a = data.matrix(X) %*% beta
    var_y = var(df$df[,1])
    container[i] = var(a)/(df$sigma**2)
    #container[i] = var(a)/sqrt(var_y)
}
mean(container)

### 3.2 Case 1 - Baseline

Looking at the case without special collinearity between the significant coefficients.

---
## 3. Application
---

---
# Reference Section
---
* Belloni et al 2011