# Problem Set 2

We might need to install some packages

$\newcommand\ci{\perp\mkern-10mu\perp}$
$\newcommand{\nci}{\not\!\perp\!\!\!\perp}$
$\newcommand{\E}{\mathbb{E}}$


In [1]:
#!pip install linearmodels

In [2]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

Data preparation

In [3]:
data = pd.read_stata('problemset2.dta')
data['eligible'] = data['eligible'].astype(float)

Y = data['earnings']
X = data['veteran']
X = sm.add_constant(X)

## Exercise 1

Regression of "future earning" on "veteran status" dummy

In [4]:
results = sm.OLS(Y,X).fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:               earnings   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     5.359
Date:                Fri, 05 Feb 2021   Prob (F-statistic):             0.0207
Time:                        12:34:25   Log-Likelihood:                -31364.
No. Observations:                3552   AIC:                         6.273e+04
Df Residuals:                    3550   BIC:                         6.275e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       1.109e+04     30.570    362.703      0.0

Let $Y_{0, i}$ and $Y_{1, i}$ be the potential outcome variables of future earning had the subject not been a veteran or had the subject been a veteran respectively, these random variables are not independent on veteran status.  $Y_{1i}$, $Y_{0i} \nci T_i$. This is so because people could volunteer to go to war. Interestingly, this regression has positive slove on veteran status. 

The paper states, that "men with relatively few civilian opportunities are probably more likely to enlist." This hypothesis would explain a positive slope if these people would actually benefit from becoming veterans and would constitute a relevant set of the population. If this is so, for non-white people, in the paper we should see consistents positive slopes which is not the case.

What is striking is that the set of veterans & non-eligible actually have the highest average wage. This could suggest another interpretation. "Committed people go to war / Committed people perform better at work".

In [5]:
pd.crosstab(data.eligible, data.veteran, margins=True, margins_name="Total", values=data['earnings'], aggfunc='mean')

veteran,0.0,1.0,Total
eligible,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,11086.916016,11401.948242,11140.44043
1.0,11089.912109,10955.398438,11064.75293
Total,11087.813477,11256.926758,11117.428711


The table below shows that an important set of people that were not eligible did become veterans. Therefore, breaking the independence conditions. These people could be described as committed.

In [6]:
pd.crosstab(data.eligible, data.veteran, margins=True, margins_name="Total", values=1, aggfunc='sum')/len(data)

veteran,0.0,1.0,Total
eligible,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,0.577703,0.118243,0.695946
1.0,0.247185,0.056869,0.304054
Total,0.824887,0.175113,1.0


Two Stage Linear Regression

In [7]:
from linearmodels.iv import IV2SLS

res_second = IV2SLS(Y, X['const'], X.veteran, data.eligible).fit(cov_type='unadjusted')
print(res_second)

                          IV-2SLS Estimation Summary                          
Dep. Variable:               earnings   R-squared:                     -1.1072
Estimator:                    IV-2SLS   Adj. R-squared:                -1.1078
No. Observations:                3552   F-statistic:                    0.7456
Date:                Fri, Feb 05 2021   P-value (F-stat)                0.3879
Time:                        12:34:27   Distribution:                  chi2(1)
Cov. Estimator:            unadjusted                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const       1.189e+04     896.76     13.260     0.0000   1.013e+04   1.365e+04
veteran       -4417.4     5115.9    -0.8635     0.38

We can see that the results differ significantly. Specially in sign. That means that when we use eligibility as an instrumental variable, we are really capturing the effect of a random treatment.

## Exercise 2

The effect disappears because eligibility is a truly random variable. By instrumenting veteran status dummy via eligibility, we are ruling out selection bias.

Let $Z_i$ be an instrumental variable. The condition identifying necessary conditions are:

- $Z_i$ is randomly assigned. This is the case since eligibility was assigned in a lottery process
- $Z_i$ satisfies the exclusion restriction. That is we are not using eligibility directly as a regressor to estimate $Y_i$
- $Z_i$ affects the endogenous regressor i.e. veteran. Again, this is the case. Some people that would not have become veterans had they not been eligible

## Exercise 3


### Structural Form

 $$ Y_i = \beta_0 + \beta_DD_i + \epsilon_i^D$$
 
### Reduced Form

$$ Y_i = \rho_0 + \rho_ZZ_i + \epsilon_i^Z$$


### First Stage Form

$$ D_i = \alpha_0 + \alpha_ZZ_i + \epsilon_i^Z$$


### Analytical derivation of the estimator

$$
\begin{align}
Y_i & = \beta_0 + \beta_DD_i + \epsilon_i^D \\
    & = \beta_0 + \beta_D(\alpha_0 + \alpha_ZZ_i + \epsilon_i^Z) + \epsilon_i^D \\
    & = (\beta_0 + \beta_D\alpha_0) + \beta_D\alpha_ZZ_i + (\beta_D\epsilon_i^Z + \epsilon_i^D) \\
    & = \rho_0 + \rho_ZZ_i + \epsilon_i^Z
\end{align}
$$

If we equate coefficients we get

$$ \rho_Z = \beta_D\alpha_Z $$

Therefore, 

$$ \beta_D = \frac{\rho_Z}{\alpha_Z} $$



## Exercise 4

Recall that, 

$$ \rho_Z = \frac{Cov(Y_i, Z_i)}{Var(Z_i)} $$
$$ \alpha_Z = \frac{Cov(D_i, Z_i)}{Var(Z_i)} $$

therefore, 

$$ \beta_D = \frac{\rho_Z}{\alpha_Z} = \frac{Cov(Y_i, Z_i)}{Cov(D_i, Z_i)}$$


We can express the numerator as, 

$$
\begin{align}
Cov(Y_i, Z_i) & = \E[Y_iZ_i]-\E[Y_i]\E[Z_i] \\
              & = \E[Y_i|Z_i=1]P(Z_i =1) -\{ \E [Y_i|Z_i =1]P(Z_i=1) + \E[Y_i|Z_i=0](1- P(Z_i=1))\}P(Z_i=1) \\
              &= \{\E[Y_i|Z_i = 1] - \E[Y_i|Z_I = 0] \}P(Z_i=1)(1-P(Z_i=1))
\end{align}
$$

... and the denominator as, 


$$
Cov(D_i, Z_i)  = \{ \E[D_i|Z_i=1]-\E[D_i|Z_i=0]\}P(Z_i=1)(1-P(Z_i=1)) 
$$

Thus

$$
    \beta_D = 
    \frac{\{\E[Y_i|Z_i = 1] - \E[Y_i|Z_I = 0] \}P(Z_i=1)(1-P(Z_i=1))}{\{\E[D_i|Z_i=1]-\E[D_i|Z_i=0]\}P(Z_i=1)(1-P(Z_i=1)) } = 
    \frac{\E[Y_i|Z_i = 1] - \E[Y_i|Z_I = 0]}{\E[D_i|Z_i=1]-\E[D_i|Z_i=0]}
$$





## Exercise 5

The crucial assumption violated is the exclusivity hypothesis, that is the IV does not directly affect the $Y_i$ 

## Exercise 6

We need the __monotonicity condition__ to hold, that is any person that is willing to treat if assigned to the control group is also willing to treat if assigned to the treatment group i.e. there are no Defiers. We can assume that people that became veteran not been eligible would also become veterans had they had been eligible. 

This would be so taking into account three points.
- People willingness to become a veteran does not change by becoming eligible. Seems reasonable.
- People non-eligible that became veterans would have passed the medical and psychological tests had they been eligible. Seems reasonable.
- Once been eligible, people had to effectively be called to join. It could be the case that some eligible people would have been veterans had they not been eligible, would simply wait to be call and never receive this call. We must assume that this effect is small.

## Exercise 7

In [8]:
data.groupby(['eligible', 'veteran']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,byr,year,earnings
eligible,veteran,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,0.0,2052,2052,2052
0.0,1.0,420,420,420
1.0,0.0,878,878,878
1.0,1.0,202,202,202


In [9]:
pd.crosstab(data.eligible, data.veteran, margins=True, margins_name="Total", values=1, aggfunc='sum')/len(data)

veteran,0.0,1.0,Total
eligible,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,0.577703,0.118243,0.695946
1.0,0.247185,0.056869,0.304054
Total,0.824887,0.175113,1.0


In [10]:
0.175113*0.695946

0.12186919189799998

In [11]:
data[['eligible', 'veteran']].corr()

Unnamed: 0,eligible,veteran
eligible,1.0,0.020738
veteran,0.020738,1.0
