You may discuss homework problems with other students, but you have to prepare the written assignments yourself.

Please combine all your answers, the computer code and the figures into one PDF file, and submit a copy to your gradescope account.

Grading scheme: 10 points per question, total of 50.
Due date: 11:59 PM March 3, 2016 (Thursday evening).


## Motivation for the problems

In the first three questions, we consider a few common "algorithms" 
or "workflow" that might be thought of as "data snooping". The
objective of doing these simulations is to get a sense in which data snooping affects inference.

The last two questions are related to penalized regression and comparing the ridge and LASSO estimators.

# Question 1

In this problem, we will look at a simple form of data snooping and try to make
some adjustments. We will be simulating data here, so that we know the answers.

1. Write a function, `f1`, that takes arguments $X_{n \times p}$, a design matrix and $Y_{n \times 1}$, a response vector
and runs one step of forward stepwise regression using AIC as a criterion, taking at most 1 step and returning the resulting `lm` object. What are the possible return values?

2. Modify the function so that when a variable has been chosen it returns the $p$-value from the `summary` table, otherwise return `NA`. Call this new function `f2`.

3. Call this function many times, with `X=matrix(rnorm(n*p), n, p); Y=rnorm(n)` for $n=100, p=20$, collecting the $p$-values that are not `NA`. Plot the distribution function of the $p$-values using `ecdf`. Do they look like $p$-values from a uniform distribution? What can you conclude about the $p$-values from this algorithm?

4. Modify your function again to randomly choose 50% of the data to use with `step`. When at least one variable has been chosen, use that variable and the remaining 50% of the data to compute an appropriate $p$-value. Call this new function `f3`.

5. Call this new function many times, with `X=matrix(rnorm(n*p), n, p); Y=rnorm(n)` for $n=100, p=20$, collecting the $p$-values that are not `NA`. Plot the distribution function of the $p$-values using `ecdf`. Do they look like $p$-values from a uniform distribution? What can you conclude about the $p$-values from this algorithm?




# Question 2

In the setting of Question 1 above, let's consider the function `f2` and the case $p=1$ and $n$ really large
so that the relevant $T$ distribution is basically $N(0,1)$.

1. Modify the function `f2` so that it returns the $T$ statistic for the model rather than
the $p$-value, still returning `NA` when no step was taken. Call the function `f4`.

2. In what situations does your function return a $T$ statistic rather than $NA$? Can you
express this event in terms of the $T$ statistic (at least approximately)?

3. What does part 2. say about the distribution of the $T$ statistics you observe when calling this function? Can you modify `f4` so that it returns a $p$-value that is approximately
uniform when `X=rnorm(n); Y=rnorm(n)`? Call this function `f5`.

4. Call the function `f5` many times with `X=rnorm(n); Y=rnorm(n)`. Verify that you have found a $p$-value that has the correct behaviour when $X$ is independent of $Y$.


# Question 3 

In the midterm, we considered a regression problem something along the lines of
$$
Y = X\beta + W\gamma + Z\delta + \epsilon
$$
where $X_{n \times k}$, $W_{n \times 1}$ and $Z_{n \times 2}$.

We had at least two null hypotheses floating around:
$$
H_{0,Z}: \delta_1=\delta_2 = 0
$$
and
$$
H_{0,W}: \gamma = 0.
$$

For the purposes of this problem, we want specific matrices so take `X=diabetes$x` where `diabetes`
can be found in the `lars` package. Generate `W` and `Z` as follows.

In [None]:
library(lars)
data(diabetes)
X = diabetes$x
Y = diabetes$y
set.seed(0)
W = rnorm(nrow(X))
Z = matrix(rnorm(2*nrow(X)), nrow(X), 2)
Z[,1] = Z[,1] + 0.3 * W
Z[,2] = Z[,2] - 0.1 * W

1. When $H_{0,Z}$ is true, there are two different $T$ statistics we can use to test $H_{0,W}$. Compute them and their corresponding $p$-values.

2. Having fixed the design `cbind(X,W,Z)`, write a function that takes an argument `Y`,
fits a linear regression model with the above design matrix (including an intercept) and 
returns these two $T$ statistics and corresponding $p$-values. Call this function `g1`.

3. Using the coefficients of `lm(Y~X)` and the estimated $\sigma^2$, write a function that generates $Y$ with these coefficients and noise level under the scenario when $H_{0,W}$ and $H_{0,Z}$ are true.

4. Using the data generating model of 3., call `g1` many times and look at the corresponding
$p$-values. Do they seem to be uniform?

5. Modify the function `g1` so that it first runs the $F$ test to test $H_{0,Z}$. When
the decision is not to reject $H_{0,Z}$ at a 5% level, the function should return the two different $p$-values for testing $H_{0,W}$, otherwise return `NA`. Call this function `g2`. Repeat 4. 

6. There is also an $F$ test to test 
$$
H_{0,W} \cap H_{0,Z}: \delta_1=\delta_2=\gamma=0.
$$
Modify function `g2` to return the corresponding $p$-value when the  decision is not to reject $H_{0,Z}$ at a 5% level. Call this function `g3`. Repeat 4.

7. Repeat parts 5. and 6. when the decision about $H_{0,Z}$ was to reject it at a 5% level.


# Question 4

In this problem, we will build a regression model using the LASSO and cross-validation.

The features can be found [here](http://stats203.stanford.edu/data/X_3TC.csv)
and the response [here](http://stats203.stanford.edu/data/Y_3TC.csv). The variables are names
[here](http://stats203.stanford.edu/data/muts_3TC.csv).

1. Load the response and features into `R` in the variables `X` and `Y`. (Make sure they are matrices by calling: `X=as.matrix(X); Y=as.matrix(Y)`.)

2. Choose a model using `cv.glmnet` in the package `glmnet`. This object will have
attributes `lambda.min` and `lambda.1se`. The exact coefficients
at a specific value of `lambda`, say `l` can be computed as
`G=glmnet(X,Y); coef(G, s=l, exact=TRUE)`. Find the model chosen at `lambda.min`. How do the models
chosen with these values of `lambda.min` compare to the model chosen in [class](http://nbviewer.jupyter.org/url/web.stanford.edu/class/stats203/notebooks/Data%20snooping.ipynb)? 

3. Randomly split the data, using 80% of the data with `cv.glmnet` to choose `lambda`. 
Find the variables chosen by the `lambda.min` rule. 
Fit a regression model using the
remaining 20% of the data and report confidence intervals for the coefficients
of all the selected variables. Repeat the procedure many times. Do the same variables
always get selected? Even when exactly the same variables are chosen are the $p$-values the same?

4. To ensure we all have a consistent set of answers, run `set.seed(1)` and repeat 3. once. Report
the variables and confidence intervals for the coefficient of each variable selected.

# Question 5

Consider the ridge regression algorithm:
$$
\hat{\beta}_{\lambda} = \text{argmin}_{\beta} \frac{1}{2} \|Y-X\beta\|^2_2 + \frac{\lambda}{2} \|\beta\|^2_2.
$$

Suppose 
$$
Y|X=X\beta^* + \epsilon, \qquad \epsilon \sim N(0, \sigma^2 I).
$$

1. As a function of $\lambda$, compute
$$
MSE(\lambda) = E(\|\beta^*-\hat{\beta}_{\lambda}\|^2_2).
$$

2. Write a function to generate $Y$ where $\beta^*$ is the OLS coefficients for the full model in the 3TC data in Question 4, and $\sigma^2$ is the OLS estimate of $\sigma^2$ from the full model.
Compute
$$
\lambda^* = \text{argmin}_{\lambda} MSE(\lambda).
$$
(This will be easiest to do numerically by plotting $MSE$ as a function of $\lambda$. You might try to get an expression for $\frac{d}{d\lambda}MSE(\lambda)$ but it will be difficult to solve exactly, though it will be expressible in terms of eigenvalues of $(X^TX)$, $\lambda$, $\beta$ and $\sigma^2$.)

3. For these values of
$\beta^*, \sigma^2$, how much of an improvement in $MSE(\lambda^*)$ do you see over the OLS estimators (i.e. $\lambda=0$)? Do your simulations confirm this much of an improvement?

4. Use the data generating function in 3. but this time estimate $\beta$ using the LASSO at `lambda.min` using 5-fold cross-validation on 100% of the data (i.e. do not split the data before finding `lambda.min`). Numerically, how does the $MSE$ of this procedure compare to the best ridge estimator, i.e. the ridge estimator with the optimal value $\lambda^*$?
