In [1]:
from os import rename
from os.path import split, join
from glob import glob
from glob import glob

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
from itertools import product
from IPython.display import display, HTML
pd.options.display.max_columns = 100
pd.options.display.max_rows = 10
sns.set_context("paper", font_scale=1.4)

# FHHPS Empirical Application

This reports the result of applying our method to Allcott's dataset, which contained a bit over 10k observations.

## Tuning parameter selection

In the same notation of the tuning parameters as in the `simulation_figures.ipynb`, we drew over 150k tuning parameters from the grid below.

+ $c_{shocks} \in [.01, 5]$
+ $c_{output1\_step1} \in [0.01, 5]$
+ $c_{output1\_step2} \in [0.01, 5]$
+ $c_{output2} \in [0.01, 5]$
+ $c_{censor1} \in [0.01, 2]$
+ $c_{censor2} \in [0.01, 2]$
+ kernel $\in$ {KNN, neighbor}

Note that the asymptotic results in the paper do not provide guide explicit guidance on how to choose these parameters.

With a lot of data, these choices matter less. But as we will see below, at 10k observation they still matter a great deal.

## Reading in

Reading in the results. Ignore this section

In [2]:
dfs = []
for k, file in enumerate(glob(join("empirical_out", "*"))):
    df_tmp = pd.read_csv(file, header=None)
    dfs.append(df_tmp)
    
df = pd.concat(dfs).dropna()

mean_names = ["EA", "EB", "EC"]
cov_names = ["VarA", "VarB", "VarC", "CovAB", "CovAC", "CovBC"]
param_names = ["shock_bw1_const", "output_bw1_coqnst_step1", "output_bw1_const_step2", "output_bw2_const"]
pretty_param_names = ["$c_{shock}$", "$c_{y,1}^{(1)}$", "$c_{y,1}^{(2)}$", "$c_{y,2}$"]

params = ['n', 'kernel1', 'kernel2', 
      'output_bw1_const_step1', 'output_bw1_const_step2', 'output_bw2_const',
      'output_bw1_alpha', 'output_bw2_alpha', 
      'shock_bw1_const', 'shock_bw2_const', 'shock_bw1_alpha', 'shock_bw2_alpha', 
      'censor1_const', 'censor2_const']
others = ['mean_valid', 'cov_valid',"time"]
cols = params + others
df.columns = cols + mean_names + cov_names
df = df.drop_duplicates(params)

In [45]:
f"We tried {len(df)} different tuning parameter combinations"

'We tried 157369 different tuning parameter combinations'

In [46]:
df.head()

Unnamed: 0,n,kernel1,kernel2,output_bw1_const_step1,output_bw1_const_step2,output_bw2_const,output_bw1_alpha,output_bw2_alpha,shock_bw1_const,shock_bw2_const,shock_bw1_alpha,shock_bw2_alpha,censor1_const,censor2_const,mean_valid,cov_valid,time,EA,EB,EC,VarA,VarB,VarC,CovAB,CovAC,CovBC
0,2501,gaussian,gaussian,0.75,0.25,0.1,0.1,0.1,0.75,0.1,0.166667,0.166667,1.5,1.5,0.451819,0.055578,25.141415,0.985419,-0.023741,0.009764,6727758.0,32250.268114,822.41618,-469162.867752,74737.166589,-5177.213606
1,2501,neighbor,neighbor,0.75,0.1,0.25,0.1,0.1,0.25,1.0,0.166667,0.166667,1.5,1.0,0.451819,0.062775,23.314401,4.836907,-0.296089,-0.009239,-2393.513,-2.774495,-2.61001,85.926687,62.777299,-1.961132
2,2501,gaussian,gaussian,0.1,1.0,0.5,0.1,0.1,0.5,0.5,0.166667,0.166667,1.0,1.0,0.55058,0.062775,20.970403,0.650364,-0.049692,0.058165,519.0798,1.477454,0.874862,-23.958718,-4.807919,-0.435598
3,2501,gaussian,gaussian,1.0,0.1,0.25,0.1,0.1,1.0,0.1,0.166667,0.166667,1.0,0.5,0.55058,0.07477,20.694484,1.16695,-0.018846,-0.009699,6131.287,24.548959,5.366953,-352.301503,-27.421778,-3.019102
4,2501,gaussian,neighbor,0.5,0.25,1.0,0.1,0.1,0.1,0.1,0.166667,0.166667,0.5,1.0,0.716114,0.062775,24.839138,2.387369,-0.09152,-0.037496,-629.0749,4.172742,-2.227558,-15.618102,48.981569,-1.74864


## Results

The estimates are **extremely** sensitive to tuning parameters. For example, here is the range of interecept estimates.

In [48]:
df["EA"].agg(["min", "max"])

min    -5.212288
max    30.609138
Name: EA, dtype: float64

And here's the range of the variance of the first slope, for another example.

In [49]:
df["VarB"].agg(["min", "max"])

min       -36.497302
max    929455.158319
Name: VarB, dtype: float64

## Restricting the grid of tuning parameters

Some choices of tuning parameters will give entirely unreasonable estimates.

So let's say that we want to constraint out tuning parameter selection by discard any configuration that produces mathematically impossible numbers, such as:

+ Negative variance estimates
+ Correlations larger than 1

Also, let's discard configuration that produce estimates that don't make economic sense. We'll restrict to parameters that give us:

+ $E[A_{1}] \geq 0$
+ $E[B_{1}] \geq -0.5$
+ $E[C_{1}] \geq -0.5$
+ $Var[A_{1}] < 50$
+ $Var[B_{1}] < 20$
+ $Var[C_{1}] < 20$
+ $|Corr[A_{1}, B_{1}]| < 1$ (similar for other correlations)


If we impose all of these restrictions, what are we left with?

In [40]:
df_good = df[
    
    # Variances are positive
    (df["VarA"] > 0) & (df["VarB"] > 0)  & (df["VarC"] > 0) 
    
    # Variances have reasonable magnitude
    & (df["VarA"] < 50) & (np.abs(df["CovAB"]) < 20)  & (np.abs(df["CovAC"]) < 20) 

    # Correlations are at most 1 in absolute value
    & (np.abs(df["CovAC"]) < np.sqrt(df["VarA"]*df["VarC"]))
    & (np.abs(df["CovAB"]) < np.sqrt(df["VarA"]*df["VarB"]))
    & (np.abs(df["CovBC"]) < np.sqrt(df["VarB"]*df["VarC"]))
    
    # Sensible average values
    & (df["EA"] > 0) & (df["EB"] > -.5) & (df["EC"] > -.5)
    
].drop_duplicates()

  from ipykernel import kernelapp as app


Restricted dataset (19 configurations, out of the 150k+ we started with.) 

In [51]:
df_good

Unnamed: 0,n,kernel1,kernel2,output_bw1_const_step1,output_bw1_const_step2,output_bw2_const,output_bw1_alpha,output_bw2_alpha,shock_bw1_const,shock_bw2_const,shock_bw1_alpha,shock_bw2_alpha,censor1_const,censor2_const,mean_valid,cov_valid,time,EA,EB,EC,VarA,VarB,VarC,CovAB,CovAC,CovBC
55,2501,gaussian,gaussian,1.000,0.750,0.750,0.1,0.1,0.25,0.1,0.166667,0.166667,1.5,0.5,0.451819,0.07477,25.132809,1.272406,-0.036109,0.000985,49.474183,0.760835,0.531906,-4.745937,-4.299000,-0.175283
64,2501,gaussian,gaussian,0.875,0.750,0.750,0.1,0.1,0.25,0.1,0.166667,0.166667,0.5,0.5,0.716114,0.07477,24.695361,1.394153,-0.018713,-0.020046,49.474183,0.760835,0.531906,-4.745937,-4.299000,-0.175283
85,2501,gaussian,gaussian,0.875,0.750,0.750,0.1,0.1,0.25,0.1,0.166667,0.166667,1.5,0.5,0.451819,0.07477,24.435665,1.177487,-0.033649,0.006002,49.474183,0.760835,0.531906,-4.745937,-4.299000,-0.175283
61,2501,gaussian,gaussian,1.000,0.750,0.750,0.1,0.1,0.25,0.1,0.166667,0.166667,0.5,0.5,0.716114,0.07477,21.699437,1.448984,-0.022993,-0.020893,49.474183,0.760835,0.531906,-4.745937,-4.299000,-0.175283
9,2501,gaussian,gaussian,1.000,0.750,0.750,0.1,0.1,0.25,0.1,0.166667,0.166667,1.0,0.5,0.550580,0.07477,20.338830,1.307912,-0.030484,-0.007320,49.474183,0.760835,0.531906,-4.745937,-4.299000,-0.175283
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47,2501,gaussian,gaussian,0.875,0.875,0.875,0.1,0.1,0.25,0.1,0.166667,0.166667,1.5,0.5,0.451819,0.07477,24.428672,1.177487,-0.033649,0.006002,24.659055,0.692620,0.490231,-3.608274,-3.346084,-0.206289
22,2501,gaussian,gaussian,0.100,0.750,0.750,0.1,0.1,0.25,0.1,0.166667,0.166667,1.5,0.5,0.451819,0.07477,23.023035,0.420905,-0.081227,0.105117,49.474183,0.760835,0.531906,-4.745937,-4.299000,-0.175283
73,2501,gaussian,gaussian,0.875,0.875,0.875,0.1,0.1,0.25,0.1,0.166667,0.166667,1.0,0.5,0.550580,0.07477,24.798212,1.238486,-0.028584,-0.003584,24.659055,0.692620,0.490231,-3.608274,-3.346084,-0.206289
59,2501,gaussian,gaussian,0.750,0.875,0.875,0.1,0.1,0.25,0.1,0.166667,0.166667,1.0,0.5,0.550580,0.07477,24.929169,1.195603,-0.028582,0.000106,24.659055,0.692620,0.490231,-3.608274,-3.346084,-0.206289


**Conclusion: Even after restricting to 'sensible' tuning parameters, we still observe a lot of variation in our point estimates**

In [52]:
df_good[["EA", "EB", "EC"]].describe()

Unnamed: 0,EA,EB,EC
count,19.0,19.0,19.0
mean,1.210181,-0.031883,0.003004
std,0.250137,0.015239,0.031059
min,0.420905,-0.081227,-0.020893
25%,1.177487,-0.033649,-0.020046
50%,1.272406,-0.030484,-0.003584
75%,1.367115,-0.022993,0.006002
max,1.448984,-0.015508,0.105117


In [53]:
df_good[["VarA", "VarB", "VarC", "CovAB", "CovAC", "CovBC"]].describe()

Unnamed: 0,VarA,VarB,VarC,CovAB,CovAC,CovBC
count,19.0,19.0,19.0,19.0,19.0,19.0
mean,37.719649,0.728523,0.512166,-4.207044,-3.847619,-0.18997
std,12.729892,0.034993,0.021379,0.583609,0.488836,0.015906
min,24.659055,0.69262,0.490231,-4.745937,-4.299,-0.206289
25%,24.659055,0.69262,0.490231,-4.745937,-4.299,-0.206289
50%,49.474183,0.760835,0.531906,-4.745937,-4.299,-0.175283
75%,49.474183,0.760835,0.531906,-3.608274,-3.346084,-0.175283
max,49.474183,0.760835,0.531906,-3.608274,-3.346084,-0.175283


As we can see above, we can get estimates of (e.g.) $Var[A_{1}]$ as low as 24 and as high as 49, depending on the parameters.

## Can we get 'better' estimates?

If we don't "like" these estimates, then by fiddiling with the tuning parameters we can get "better" estimates. 

For example, now that we know that we have a better idea about which parameters yield more sensible estimates, we can keep searching on a finer grid in that region until we get estimates that make economic sense to us.

But would we want to do that? (No)