# Exercise Set 4: controlling for observable factors and causal trees

In this Exercise Set 4 we will try out different techniques for using matching and try an implementation of causal trees. 

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd 
import seaborn as sns
from scipy.stats import ttest_rel, ttest_ind

%matplotlib inline

<br>

## 4.1 Survival and passenger class

We revisit a classic dataset: Titanic. We are interested in analyzing whether the passengers on First class had a higher survival probability. 

The code below loads the dataset. 

In [43]:
df = sns.load_dataset('titanic').dropna(subset=['age'])

X = pd.get_dummies(df.drop(['pclass','class', 'alive','survived'],axis=1), drop_first=True).astype('float')
D = (df['pclass'] < 3).rename('high_class')
y = (df['alive']=='yes').astype('float')
sum(D)

359

> **Ex. 4.1.1:** Compute the ATE of not travelling on a 3rd class ticket, assuming the CIA holds.

In [44]:
avg_treated = y[D].mean()
avg_untreated = y[~D].mean()

print(f'ATE assuming randomization of D: {avg_treated - avg_untreated}')

ATE assuming randomization of D: 0.33159402095021384


> **Ex. 4.1.2:** Compute the share of males, the proportion travelling alone, and the mean age, by treatment status. Then modify the code below to try out coarsened exact matching on `exact_cols = ['age_group', 'alone','sex_male']` with age bins of size 2, 5, 10 and 15 years. 
>
> Comment on the result. Does coarse matching seem like a feasible approach in this 
```python
age_diff =  2
X['age_group'] =  (X.age // age_diff)
match_count = \
    pd.DataFrame({'treat':X[D].groupby(exact_cols).size(), 
                  'control':X[~D].groupby(exact_cols).size()})\            
n_obs_matched = int(match_count.dropna().sum().sum())
```

In [45]:
match_count

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,treat,control
age_group,alone,sex_male,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,0.0,0.0,12.0,24.0
0.0,0.0,1.0,12.0,26.0
0.0,1.0,0.0,,3.0
0.0,1.0,1.0,,1.0
1.0,0.0,0.0,37.0,22.0
1.0,0.0,1.0,23.0,26.0
1.0,1.0,0.0,22.0,27.0
1.0,1.0,1.0,37.0,112.0
2.0,0.0,0.0,29.0,14.0
2.0,0.0,1.0,28.0,14.0


In [46]:
#age_diff =  2
exact_cols = ['age_group', 'alone', 'sex_male']

for age_diff in (2,5,10,15):
    X['age_group'] =  (X.age // age_diff)
    match_count = \
        pd.DataFrame({'treat':X[D].groupby(exact_cols).size(), 
                      'control':X[~D].groupby(exact_cols).size()})           
    n_obs_matched = int(match_count.dropna().sum().sum())
    print(f'Matched {n_obs_matched} with a maximum age difference of {age_diff}')

Matched 611 with a maximum age difference of 2
Matched 662 with a maximum age difference of 5
Matched 679 with a maximum age difference of 10
Matched 687 with a maximum age difference of 15


Continue with age difference = 5.

> **Ex. 4.1.3:** Compute the average treatment effect by using (coarsened) exact matching on `age` (i.e. on `age_group`). 
>
>Comment on the result. How does the group treatment effects compare to the ATE you found in 4.1.2?

In [47]:
from sklearn.neighbors import RadiusNeighborsRegressor

# 1. fit RadiusNeighborRegressor(radius = 0) to the treated 
# individuals. This model will predict the average Y(1) of 
# people with an exact match on covariates.
# Use this model to predict counterfactual Y(1) on the control group.
y_treated, y_control = y[D], y[~D]
X_treated, X_control = X[D], X[~D]

model_tgroup = RadiusNeighborsRegressor(radius = 0)
model_tgroup.fit(X_treated[exact_cols], y_treated)
imputed_Y1 = model_tgroup.predict(X_control[exact_cols])

# Reverse and repeat for the control group
model_cgroup = RadiusNeighborsRegressor(radius = 0)
model_cgroup.fit(X_control[exact_cols], y_control)
imputed_Y0 = model_cgroup.predict(X_treated[exact_cols])


ITE_treated = y_treated - imputed_Y0
ITE_control = imputed_Y1 - y_control

ITE = np.r_[ITE_treated, ITE_control]
ITE[~np.isnan(ITE)].mean()



0.2959026678780339

> **Ex. 4.1.4:** Estimate a logistic regression model for predicting the passenger class variable (i.e. `D`, the treatment indicator).

In [105]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import StandardScaler        # scales variables to be mean=0,sd=1
from sklearn.model_selection import GridSearchCV,RepeatedStratifiedKFold,cross_validate

## Set random state
random_state = 1337

## Create initiations for pipeline
std_scaler   = StandardScaler()
lr_reg       = LogisticRegression(class_weight='balanced', solver = 'liblinear')


## Create pipeple
lr = Pipeline([('scale', std_scaler ),
               ('clf', lr_reg)])


## Set up gridsearch over lambda
param_grid_lr          = {'clf__C':np.logspace(-4,4,20)}


cv_inner_fold    = 10 ## Inner CV to tune hyper parameters
cv_outer_splits  = 10 ## Outer CV to estimate the generalization error 
cv_outer_repeats = 2  ## Repetitions of the outer CV (with the same training data vs test data ratio)

cv_outer_folds        = RepeatedStratifiedKFold(n_splits=cv_outer_splits,
                                               n_repeats=cv_outer_repeats,
                                               random_state=random_state
                                               )
lr_cv_grid            = GridSearchCV(estimator=lr,
                                    param_grid=param_grid_lr,
                                    cv=cv_inner_fold 
                                    )
## Single CV
lr_cv                 = lr_cv_grid.fit(X,y) 
optimal_lambda        = lr_cv.best_params_


## Nested CV (only used if one wants to investigate generalization error)
lr_cv_nested          = cross_validate(estimator=lr_cv_grid,
                                     X=X,
                                     y=y,
                                     scoring=['accuracy','f1'],
                                     cv=cv_outer_folds,
                                    return_estimator = True) 

## get lr predictions 
lr['clf'].C = optimal_lambda['clf__C'] ## Set optimal lambda within the pipeline
lr.fit(X,y) ## Re-run pipeline with updated parameters
y_pred = lr_reg.predict(X)

> **Ex. 4.1.5:** What other models might we have chosen? 

*Any other classifier - i.e. KNN, DEEP NN, etc.*

> **Ex. 4.1.6:** What is the overlap of predicted probabilities? What happens if you estimate the model without `fare` and `deck`? Comment
>
> Why do `fare` and `deck` matter a lot in this setting, try to draw a causal diagram that might illuminate your discussion.

*They matter a lot because the people who were on the upper decks had easier access to the rescue boats. The fares is higher for upper class decks - hence fares also matter.*

> **Ex. 4.1.7:** Use a 5-nearest-neighbors matching in propensity space to compute the average treatment effect. Bootstrap the 95 pct. confidence interval of the ATE. What happens if you select only propensity score values with high common support, i.e. between 0.2 and 0.8?

In [None]:
# Your answer here

> **Ex. 4.1.7:** (BONUS) How might we improve on the approach above?

In [None]:
# Your answer here

## 4.2 Honest trees

In this problem we will try to implement and understand some of the ideas used in [Athey, Imbens (2015)](https://www.pnas.org/content/pnas/113/27/7353.full.pdf) to develop _Honest Inference_ in desicion tree models. The paper begins by covering honesty in a setting of population averages, and for estimating conditional means; so you will need to look towards the second half of the paper to get an impression of it's use for treatment-effect estimation.

> **Ex. 4.2.1:** What does it mean that a tree is _honest?_ In particular what are the implications in terms of 
> * The intuition for why honesty is required in order to get good local treatment effect estimates?
> * The practical implementation of the DT algorithm?

*We prevent data leakage across partitioning and estimating treatment effect in the forest*

> **Ex.4.2.2:** Use the `load_42_data` function to load the boston house-price dataset. Split your dataset in two. A 50% test set and a 50% train set using `sklearn.model_selection.train_test_split`. 

In [111]:
def load_42_data():
    from sklearn.datasets import load_boston
    df = load_boston()
    df = pd.DataFrame(np.c_[df['data'], df['target']], columns = list(df['feature_names']) + ['y'])
    return df

In [120]:
df = load_42_data()
y  = df['y']
X  = df.drop('y',axis=1)

In [125]:
X.loc[train_indx]

KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike'

In [126]:
# Your answer here
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.5,train_size=0.5)

> **Ex 4.2.3:** Identify the column and value in `X_train` that minimizes the (cross split weighted) sum of squared errors in the training data. Split the test data according to this value and report the mean and standard deviation of `y` in both subsamples for both the train and test data.
>
> Comment on your results. How different are the two subsamples from the overall mean and standard deviation?

In [128]:
# Your answer here
#X_train

> **Ex 4.2.4:** Redo your analysis from 4.2.3, but this time split in a 66% train dataset and a 33% test dataset. Split the train data once more 50/50 to get a train and an estimation dataset. 
>
> Focus only on one of the subsets (i.e. either the left or right leaf). 
>
> Report the same statistics as before, but for train, estimation and test samples. This time, show your results as density plots graphing 5.000 bootstrap replications of the whole procedure. If your pc is slow, you might need to reduce the number of replications to 1000.


In [None]:
# Your answer here