# Week 6: High-Dimensional Methods and Confidence Intervals

The purpose of this week's problem set is to get familiar with inference based on high-dimensional methods.  Our focus is again on methods based on the Lasso, and we again use the <tt>housing.csv</tt> dataset. (See the previous problem set for data details.) Note how our focus has here changed from prediction (of house prices) to inference (drivers of house prices).

We first read the data into Python and remove missings.

In [1]:
# Read data
import pandas as pd
housing = pd.read_csv("housing.csv")
print("The number of rows and colums are {} and also called shape of the matrix".format(housing.shape)) # data dimensions
print("Columns names are \n {}".format(housing.columns))
print(housing.head()) # first observations
print(housing.tail()) # last observations
print(housing.dtypes) # data types
housing=housing.dropna() # dropping observations missing a bedroom count 

The number of rows and colums are (20640, 10) and also called shape of the matrix
Columns names are 
 Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value ocean_proximity  
0       322.0       126.0         8.3252            452600.0        NEAR BAY  
1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  


We model house prices (<tt>median_house_value</tt>) using a linear (in the parameters) model of the basic regressors (minus the categorical variable <tt>ocean_proximity</tt>),

$$
\underbrace{\mathtt{median\,house\,value}}_{=Y}=\alpha\times\underbrace{\mathtt{median\,income}}_{=D} + Z'\gamma + \varepsilon,\quad\mathrm{E}[\varepsilon|D,Z]=0.
$$

We here focus on constructing a confidence interval for the coefficient of <tt>median_income</tt> after having used the Lasso. In doing so we treat both <tt>median_income</tt> and the remaining ($p=7$) controls as exogenous. Moreover, we augment the above model with another linear model

$$
\mathtt{median\,income}=Z'\psi + \nu,\quad\mathrm{E[\nu|Z]=0},
$$

now for <tt>median_income</tt>.

(One would be hard pressed to claim that median income *causes* house price movements. This is only an exercise in the mechanics.)

In [2]:
y = housing.median_house_value
d = housing.median_income
Z = housing.drop(["median_house_value","median_income","ocean_proximity"],axis=1)

# Exercises
Complete the following exercises using only the eight basic regressors.

### Question 1 
Lasso <tt>median_house_value</tt> on the controls, Z, using the (feasible) Bickel-Ritov-Tsybakov (BRT) penalty level. [Hint: Don't forget to standardize.]

\begin{align}
    \hat{\lambda}^{B R T}   &=\frac{2 c \sigma}{\sqrt{N}} \Phi^{-1}\left(1-\frac{\alpha}{2 p}\right) \sqrt{\max _{1 \leq j \leq p} \frac{1}{N} \sum_{i=1}^N \mathbf{X}_i^2} \Leftrightarrow \\

    \hat{\lambda}^{B R T}   &=\frac{2 c \sigma}{\sqrt{N}} \Phi^{-1}\left(1-\frac{\alpha}{2 p}\right)
\end{align}

The last term = 1 only happens, if standardize $\mathbf{X}_i$.

In [3]:
#remember to save the residuals from the LASSO-regression (Easy way: Lasso from sklearn.linear_model has a function predict) 
# - construct the residual, so they are aligned with the post-lasso method
#You could challenge yourself by creating functions for standardizing and/or BRT

import numpy as np

X=np.column_stack((d,Z))

def standardize(X):
    result = (X-X.mean())/X.std()
    return result

X_stan = standardize(X)
Z_stan = standardize(Z)
d_stan = standardize(d)


In [4]:
from scipy.stats import norm
from sklearn.linear_model import Lasso


def BRT(X_tilde,y):
    sigma = np.std(y)
    (N,p) = X_tilde.shape
    c = 1.1
    alpha = 0.05

    penalty_BRT= ((2*c*sigma)/np.sqrt(N))*(norm.ppf(1-alpha/(2*p)))

    penalty_BRT = penalty_BRT/2

    return penalty_BRT

penalty_BRTyz= BRT(X_tilde=Z_stan, y=y)

print("lambda_BRT =",penalty_BRTyz.round(2))

# Lasso on median house value
fit_BRTyz= Lasso(alpha=penalty_BRTyz)
fit_BRTyz.fit(Z_stan,y)
preds_yz = fit_BRTyz.predict(Z_stan)

# save residuals
rezys = y-preds_yz

lambda_BRT = 2389.6


You should get lambda_BRT = 2389.6

### Question 2
Lasso <tt>median_income</tt> on the controls using the (feasible) BRT penalty level. 

In [8]:
# Lasso on median income
penalty_BRTdz= BRT(X_tilde=Z_stan, y=d)

fit_BRTdz= Lasso(alpha=penalty_BRTdz)
fit_BRTdz.fit(Z_stan, d)
preds_dz = fit_BRTdz.predict(Z_stan)
coefs_BRTdz = fit_BRTdz.coef_
# save residuals
resdz= d-preds_dz

In [9]:
np.count_nonzero(coefs_BRTdz)

6

### Question 3: Post Partialling Out Lasso vs OLS
Calculate the implied partialling-out Lasso estimate $\breve{\alpha}$ and compare with the OLS estimate $\widehat{\alpha}^{\mathtt{LS}}$ (obtained by regressing Y on a constant, D and controls, Z).

In [6]:
Z.shape

(20433, 7)

In [7]:
#Use your previously saved residuals
#If you want - you can compare with OLS (why is this possible?)
from numpy import linalg as la

denom = np.sum(resdz**2)
num = rezys@resdz
POL = num/denom

print("POL = ",POL.round(2))

(N,p) = X_stan.shape
ones = np.ones(shape=(N,1))
xx = np.hstack((ones,X))# <--- add a constant to the regressors, X=(d,Z)
yy = np.array(y).reshape(-1,1) # reshape y to 2-dim so can use matmul
LS = la.inv(xx.T@xx)@xx.T@y

# save residuals, will need them later
res_ols = y-xx@LS
print("LS = ",LS[1].round(2))

POL =  41220.04
LS =  40297.52


You should get

POL =  41220.0

LS =  [40297.52]

### Question 4: Variance of Partialling Out Lasso
Calculate the implied variance estimate $\breve{\sigma}^2$. 

In [8]:
penalty_BRTyx= BRT(X_tilde=X_stan, y=y)

print("lambda_BRT =",penalty_BRTyx.round(2))

clf_yx = Lasso(alpha=penalty_BRTyx)

clf_yx.fit(X_stan, y)

preds_yx = clf_yx.predict(X_stan)

res_yx = y-preds_yx

lambda_BRT = 2428.92


In [9]:
#Use your previously saved residuals

(N,p) = X_stan.shape

num = rezys**2@resdz**2/N
denom = (np.sum(resdz**2)/N)**2
sigma2_POL = num/denom
print("sigma2_POL = ",sigma2_POL.round(2))

sigma2_POL =  14072884375.56


You should get sigma2_POL =  14072882476.19

### Question 5: Confidence Interval of Partialling Out Lasso
Construct a two-sided 95 pct. confidence interval (CI) for $\alpha$, which is asymptotically valid even in the high-dimensional regime. Compare with the "standard" CI implied by OLS (presuming conditionally homoskedastic errors). 

In [15]:
xi=0.025

q = norm.ppf(xi)
se_POL = np.sqrt(sigma2_POL)
CI_POL =  (POL+q*se_POL/np.sqrt(N), POL-q*se_POL/np.sqrt(N))
print("CI_POL = ",CI_POL)

CI_POL =  (39593.46573280547, 42846.61191423362)


You should get CI_POL = (39593.4, 42846.61)

In [20]:
# compute OLS standard errors
SSR = res_ols.T@res_ols
N = xx.shape[0]
K = xx.shape[1]
sigma2_ols = SSR/(N-K)
var = sigma2_ols*la.inv(xx.T@xx)
se_ols = np.sqrt(var.diagonal()).reshape(-1,1)

# pull out relevant coefficient and se estimates
# X=np.hstack((ones,d,Z)) --> we are interested in d
se_ols_d = se_ols[1]
LS_d = LS[1]

# construct CI
q = norm.ppf(xi)
CI_OLS =  (LS_d+q*se_ols_d, LS_d-q*se_ols_d) 
print("CI_OLS = ",CI_OLS)

CI_OLS =  (array([39636.60780279]), array([40958.43562672]))


You should get CI_OLS =  ([39636.61], [40958.44])

### Questions 6: Post Double Lasso
Construct a two-sided 95 pct. CI using *double Lasso* $\check{\alpha}$ instead. 

In [None]:
# update BRT penalty
penalty_BRTyx= # fill in

print("lambda_BRT =",penalty_BRTyx.round(2))

# Lasso on median house value
fit_BRTyx= # fill in
coefs=fit_BRTyx.coef_

# save residuals
resyxz= # fill in

In [None]:
#Use your previously saved residuals
#If you want - you can compare with OLS (why is this possible?)

denom = # fill in
num = # fill in

PDL=num/denom
print("PDL = ",PDL.round(2))

You should get PDL =  40883.0

In [None]:
# variance
resyzz= # fill in
num = # fill in
denom = # fill in
sigma2_PDL = num/denom
print("sigma2_PDL = ",sigma2_PDL.round(2))

You should get sigma2_PDL =  4219298356.25

In [None]:
# conf interval
q = # fill in
se_PDL = # fill in

CI_PDL= # fill in
print("CI_PDL = ",CI_PDL)

You should get CI_PDL = (39992.36, 41773.64)

### Question 7: Extensions

To make the estimation problem more challenging:

Repeat Exercises 1--5/6 after adding all control quadratics ($Z_1^2,\dotsc,Z_p^2$) and first-order interactions ($Z_1Z_2,Z_1Z_3,\dotsc,Z_{p-1}Z_{p}$). [Hints: Use <tt>sklearn.preprocessing.PolynomialFeatures</tt> for simple transformation. Your optimizer may not converge. Consider increasing the maximum number of iterations using the Lasso option <tt>max_iter=</tt>[your number].]

Optional variations:
* Repeat Exercises 1--5/6/7 using the Belloni-Chen-Chernozhukov-Hansen (BCCH) penalty level for each Lasso (which may be justified without any independence/homoskedasticity assumptions).
* Repeat Exercises 1--5/6/7 using cross-validation (CV) for each Lasso.

Don't include bias in <tt>sklearn.preprocessing.PolynomialFeatures</tt>

In [None]:
from sklearn.preprocessing import PolynomialFeatures

poly = # fill in
Z_2 = # fill in # <---- returns Z with interactions
Z_2_stan = standardize(Z_2) # standardize regressors

In [None]:
# POL Coefficient Estimate

# fill in

You should get POL_2 =  41243.6 LS =  [39512.12] if using BRT penalties

In [None]:
# POL Variance Estimate

# fill in

You should get sigma2_POL_2 =  15263885977.92 if using BRT penalties

In [None]:
# POL Confidence Interval

# fill in

You should get CI_POL_2=(39549.56, 42937.63) if using BRT penalties