# Week 6: High-Dimensional Methods and Confidence Intervals

The purpose of this week's problem set is to get familiar with inference based on high-dimensional methods.  Our focus is again on methods based on the Lasso, and we again use the <tt>housing.csv</tt> dataset. (See the previous problem set for data details.) Note how our focus has here changed from prediction (of house prices) to inference (drivers of house prices).

We first read the data into Python and remove missings.

In [1]:
# Read data
import pandas as pd
housing = pd.read_csv("housing.csv")
print("The number of rows and colums are {} and also called shape of the matrix".format(housing.shape)) # data dimensions
print("Columns names are \n {}".format(housing.columns))
print(housing.head()) # first observations
print(housing.tail()) # last observations
print(housing.dtypes) # data types
housing=housing.dropna() # dropping observations missing a bedroom count 

The number of rows and colums are (20640, 10) and also called shape of the matrix
Columns names are 
 Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value ocean_proximity  
0       322.0       126.0         8.3252            452600.0        NEAR BAY  
1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  


We model house prices (<tt>median_house_value</tt>) using a linear (in the parameters) model of the basic regressors (minus the categorical variable <tt>ocean_proximity</tt>),

$$
\underbrace{\mathtt{median\,house\,value}}_{=Y}=\alpha\times\underbrace{\mathtt{median\,income}}_{=D} + Z'\gamma + \varepsilon,\quad\mathrm{E}[\varepsilon|D,Z]=0.
$$

We here focus on constructing a confidence interval for the coefficient of <tt>median_income</tt> after having used the Lasso. In doing so we treat both <tt>median_income</tt> and the remaining ($p=7$) controls as exogenous. Moreover, we augment the above model with another linear model

$$
\mathtt{median\,income}=Z'\psi + \nu,\quad\mathrm{E[\nu|Z]=0},
$$

now for <tt>median_income</tt>.

(One would be hard pressed to claim that median income *causes* house price movements. This is only an exercise in the mechanics.)

In [2]:
y = housing.median_house_value
d = housing.median_income
Z = housing.drop(["median_house_value","median_income","ocean_proximity"],axis=1)

# Exercises
Complete the following exercises using only the eight basic regressors.

### Question 1
Lasso <tt>median_house_value</tt> on the controls, Z, using the (feasible) Bickel-Ritov-Tsybakov (BRT) penalty level. [Hint: Don't forget to standardize.]

In [3]:
#remember to save the residuals from the LASSO-regression (Easy way: Lasso from sklearn.linear_model has a function predict) 
# - construct the residual, so they are aligned with the post-lasso method
#You could challenge yourself by creating functions for standardizing and/or BRT

import numpy as np

X=np.column_stack((d,Z))

def standardize(X):
    X_mean = np.mean(X,axis=0)
    X_std = np.std(X,axis=0)
    X_stan=(X-X_mean)/X_std
    return X_stan

X_stan = standardize(X)
Z_stan = standardize(Z)
d_stan = standardize(d)


In [4]:
from scipy.stats import norm
from sklearn.linear_model import Lasso


def BRT(X_tilde,y):
    (N,p)=X_tilde.shape
    sigma = np.std(y)
    c = 1.1
    alpha = 0.05

    penalty_BRT= (sigma * c)/np.sqrt(N)*norm.ppf(1-alpha/(2*p)) # on normalised data since sum of squares is =1, NB div by 2

    return penalty_BRT

penalty_BRTyz=BRT(Z_stan,y)

print("lambda_BRT =",penalty_BRTyz.round(2))

# Lasso on median house value
fit_BRTyz=Lasso(alpha=penalty_BRTyz).fit(Z_stan,y) #,fit_intercept = False

# save residuals
resyz=y-fit_BRTyz.predict(Z_stan) 

lambda_BRT = 2389.6


### Question 2
Lasso <tt>median_income</tt> on the controls using the (feasible) BRT penalty level. 

In [5]:
#remember to save the residuals

# Lasso on median income
penalty_BRTdz=BRT(Z_stan,d)

fit_BRTdz=Lasso(alpha=penalty_BRTdz).fit(Z_stan,d) #,fit_intercept = False

# save residuals
resdz=d-fit_BRTdz.predict(Z_stan)
print("lambda_BRT =",penalty_BRTdz.round(2))


lambda_BRT = 0.04


### Question 3: Partialling Out Lasso vs OLS
Calculate the implied partialling-out Lasso estimate $\breve{\alpha}$ and compare with the OLS estimate $\widehat{\alpha}^{\mathtt{LS}}$.

In [6]:
#Use your previously saved residuals
#If you want - you can compare with OLS (why is this possible?)
from numpy import linalg as la
def POL_ols(x,y):
    denom = np.sum(x**2)
    num = np.sum(x*y)
    return num/denom

POL=POL_ols(resdz,resyz)
print("POL = ",POL.round(2))

N=y.shape[0]
xx=np.column_stack((np.ones(N),X)).reshape(-1,1+X.shape[1])
yy=np.array(y).reshape(-1,1)
LS=la.inv(xx.T@xx)@xx.T@yy
res_ols=yy-xx@LS
print("LS = ",LS[1].round(2))

POL =  41220.0
LS =  [40297.52]


### Question 4: Variance of Partialling Out Lasso
Calculate the implied variance estimate $\breve{\sigma}^2$. 

In [7]:
#Use your previously saved residuals
N = resyz.shape[0]
num = np.sum(resdz**2*resyz**2)/N
denom = (np.sum(resdz**2)/N)**2
sigma2_POL = num/denom
print("sigma2_POL = ",sigma2_POL.round(2))

sigma2_POL =  14072882476.19


### Question 5: Confidence Interval of Partialling Out Lasso
Construct a two-sided 95 pct. confidence interval (CI) for $\alpha$, which is asymptotically valid even in the high-dimensional regime. Compare with the "standard" CI implied by OLS (presuming conditionally homoskedastic errors). 

In [8]:
q = norm.ppf(1-0.025)
se_POL = np.sqrt(sigma2_POL/N)
CI_POL = (((POL-q*se_POL).round(2),(POL+q*se_POL).round(2)))
print("CI_POL = ",CI_POL)

CI_POL =  (39593.43, 42846.58)


In [9]:
q = norm.ppf(1-0.025)
SSR = np.sum(res_ols ** 2)
N = y.shape[0]
K = xx.shape[1]
sigma2_ols = SSR/(N-K)
var = sigma2_ols*la.inv(xx.T@xx)
se_ols = np.sqrt(np.diagonal(var)).reshape(-1, 1)
se_ols_d=se_ols[1]
LS_d=LS[1]
CI_OLS =  (((LS_d-q*se_ols_d).round(2),(LS_d+q*se_ols_d).round(2)))
print("CI_OLS = ",CI_OLS)

CI_OLS =  (array([39636.61]), array([40958.44]))


### Question 6: Post Double Lasso
Construct a two-sided 95 pct. CI using *double Lasso* $\check{\alpha}$ instead. 

In [10]:
penalty_BRTyx=BRT(X_stan,y)

print("lambda_BRT =",penalty_BRTyx.round(2))

# Lasso on median house value
fit_BRTyx=Lasso(alpha=penalty_BRTyx).fit(X_stan,y) 
coefs=fit_BRTyx.coef_
print(coefs[0].round(2))
# save residuals
resyxz=y-fit_BRTyx.predict(X_stan) + d_stan*coefs[0]

lambda_BRT = 2428.92
73438.2


In [31]:
#Use your previously saved residuals
#If you want - you can compare with OLS (why is this possible?)
from numpy import linalg as la
def PDL_ols(resdz,resyxz,d):
    denom = np.sum(resdz*d)
    num = np.sum(resdz*resyxz)
    return num/denom

PDL=PDL_ols(resdz,resyxz,d)
print("PDL = ",PDL.round(2))

2056247723.5613732
50295.90754509234
PDL =  40883.0


In [28]:
# variance
resyzz=y-fit_BRTyx.predict(X_stan)
N = resyzz.shape[0]
num = np.sum(resdz**2*resyzz**2)/N
denom = (np.sum(resdz**2)/N)**2
sigma2_PDL = num/denom
print("sigma2_PDL = ",sigma2_PDL.round(2))

sigma2_PDL =  4219298356.25


In [30]:
# conf interval
q=norm.ppf(1-0.025)
se_PDL=np.sqrt(sigma2_PDL/N)

CI_PDL=(((PDL-q*se_PDL).round(2),(PDL+q*se_PDL).round(2)))
print("CI_PDL = ",CI_PDL)

CI_PDL =  (39992.36, 41773.64)


### Question 7: Extensions

To make the estimation problem more challenging:

Repeat Exercises 1--5/6 after adding all control quadratics ($Z_1^2,\dotsc,Z_p^2$) and first-order interactions ($Z_1Z_2,Z_1Z_3,\dotsc,Z_{p-1}Z_{p}$). [Hints: Use <tt>sklearn.preprocessing.PolynomialFeatures</tt> for simple transformation. Your optimizer may not converge. Consider increasing the maximum number of iterations using the Lasso option <tt>max_iter=</tt>[your number].]

Optional variations:
* Repeat Exercises 1--5/6/7 using the Belloni-Chen-Chernozhukov-Hansen (BCCH) penalty level for each Lasso (which may be justified without any independence/homoskedasticity assumptions).
* Repeat Exercises 1--5/6/7 using cross-validation (CV) for each Lasso.

In [14]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(2, include_bias=False)
Z_2=poly.fit_transform(Z)
Z_2_stan=standardize(Z_2)



In [15]:
penalty_BRTyz_2=BRT(Z_2_stan,y)

print("lambda_BRT =",penalty_BRTyz_2.round(2))

# Lasso on median house value
fit_BRTyz_2=Lasso(alpha=penalty_BRTyz_2,max_iter=10000).fit(Z_2_stan,y) #,fit_intercept = False

# save residuals
resyz_2=y-fit_BRTyz_2.predict(Z_2_stan) 

lambda_BRT = 2832.6


In [16]:

# Lasso on median income
penalty_BRTdz_2=BRT(Z_2_stan,d)

fit_BRTdz_2=Lasso(alpha=penalty_BRTdz_2).fit(Z_2_stan,d) #,fit_intercept = False

# save residuals
resdz_2=d-fit_BRTdz_2.predict(Z_2_stan)

In [17]:
#Use your previously saved residuals
#If you want - you can compare with OLS (why is this possible?)
from numpy import linalg as la
def POL_ols(x,y):
    denom = np.sum(x**2)
    num = np.sum(x*y)
    return num/denom

POL_2=POL_ols(resdz_2,resyz_2)
print("POL_2 = ",POL_2.round(2))

N=y.shape[0]
xx_2=np.column_stack((np.ones(N),d,Z_2)).reshape(-1,2+Z_2.shape[1])
yy=np.array(y).reshape(-1,1)
LS_2=la.inv(xx_2.T@xx_2)@xx_2.T@yy
print("LS = ",LS_2[1].round(2))

POL_2 =  41243.6
LS =  [39512.12]


In [18]:
#Use your previously saved residuals
N = resyz.shape[0]
num = np.sum(resdz_2**2*resyz_2**2)/N
denom = (np.sum(resdz_2**2)/N)**2
sigma2_POL_2 = num/denom
print("sigma2_POL_2 = ",sigma2_POL_2.round(2))

sigma2_POL_2 =  15263885977.92


In [19]:
q=norm.ppf(1-0.025)
se_POL_2=np.sqrt(sigma2_POL_2/N)
CI_POL_2=(((POL_2-q*se_POL_2).round(2),(POL_2+q*se_POL_2).round(2)))
CI_POL_2

(39549.59, 42937.6)

Optional variations:
* Repeat Exercises 1--5/6/7 using the Belloni-Chen-Chernozhukov-Hansen (BCCH) penalty level for each Lasso (which may be justified without any independence/homoskedasticity assumptions).
* Repeat Exercises 1--5/6/7 using cross-validation (CV) for each Lasso.