In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Marginal structural models (MSM), frequentist approach
## IPTW based MSM ad modum Andrew Heiss
This is a python conversion of Andrew Heiss' blog post on IPTW based MSM's here: https://www.andrewheiss.com/blog/2020/12/03/ipw-tscs-msm/
Andrew's generated data will be used. The dataset simulates the effects of the 6-hour working day policy (binary treatment) and number of vacation days (cont. treatment) on happiness in several (fake) countries. The underlying DAG is this: https://www.andrewheiss.com/blog/2020/12/03/ipw-tscs-msm/index_files/figure-html/dag-complex-1.png

### Load data and a stupid preprocessing step
After loading the data, a preprocessing step to filter out countries that never adopt the new policies. This is because of math issues when treatment remains unchanged. This could be helped with zero-inflated modeling of the IPW instead of plain logistic regression.

In [2]:
data = pd.read_csv('https://www.andrewheiss.com/blog/2020/12/03/ipw-tscs-msm/happiness_data.csv')
policy_data = data[data["country"].isin(data[data["policy"] == 1].country)].reset_index(drop=True).copy()

## Binary treatment MSM/IPTW
### Underlying question
What is the effect of implementing the 6-hour working day on happiness?
### IPTW for binary treatment

$\text{unstabilized binary IPW}_{it} = \prod_{t=1}^{t} \frac{1}{P(X_{it} | \bar{X}_{i,t-1}, Y_{i,t-1}, C_{it}, V_{i})}$, where

 - ${i}$ is the individual country
 - ${t}$ it the timestep, ${X_{it}}$ is an observed treatment assignment for ${i}$ at ${t}$
 - $\bar{X}_{i,t-1}$ are the observed treatment assignments for ${i}$ <b>up until</b> ${t-1}$
 - ${Y}_{i,t-1}$ is the observed outcome ${i}$ at ${t-1}$, $C_{it}$ are the time varying confounders for ${i}$ at ${t}$
 - $V_{i}$ are the time invariant (constant) confounders for ${i}$.


<b>Read as:</b> "inverse probability of treatment given all previous treatment assignments, the outcome of interest at the previous timestep, the current time varying confounders and the constant confounders"

However, these weights need to be stabilized. This is done by a modifying the numerator:

$\text{stabilized binary IPW}_{it} = \prod_{t=1}^{t} \frac{P(X_{it} | \bar{X}_{i,t-1}, V_{i})}{P(X_{it})| \bar{X}_{i,t-1}, Y_{i,t-1}, C_{it}, V_{i})}$


### Recipe for getting the IPTW
1. Define the propensity function $e(W)$
2. 


In [17]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

#create new dataframe
policy_data_weighted = policy_data.copy()

#raw propensity scores
model_num = smf.glm("policy ~ lag_policy + country", data=policy_data, family=sm.families.Binomial(sm.families.links.logit()))
model_den = smf.glm("policy ~ log_gdp_cap + democracy + corruption + lag_happiness_policy + lag_policy + country", data=policy_data, family=sm.families.Binomial(sm.families.links.logit()))
policy_data_weighted["propensity_num"] = model_num.fit().fittedvalues
policy_data_weighted["propensity_den"] = model_den.fit().fittedvalues

#calculate instantaneous ipw
policy_data_weighted["propensity_num_outcome"] = np.where(policy_data_weighted["policy"]==1, policy_data_weighted["propensity_num"], 1-policy_data_weighted["propensity_num"])
policy_data_weighted["propensity_den_outcome"] = np.where(policy_data_weighted["policy"]==1, policy_data_weighted["propensity_den"], 1-policy_data_weighted["propensity_den"])
policy_data_weighted["instant_ipw"] = policy_data_weighted["propensity_num_outcome"] / policy_data_weighted["propensity_den_outcome"]

#calculate actual ipw (cumsum)
policy_data_weighted["ipw"] = policy_data_weighted.groupby("country").instant_ipw.cumprod()

### Estimation of ATE
With the IPWT calculated, we can estimate the ATE. Here I'll use a simple GLM, whereas Heiss uses a mixed effects model. The latter simply does not work with statsmodels as it will not accept weights. However, the effect of the policy on happiness is very close to the true value 7.6!


In [21]:
policy_ate_model = smf.glm("happiness_policy ~ policy + lag_policy",
                           data = policy_data_weighted,
                           freq_weights=policy_data_weighted["ipw"])

In [22]:
policy_ate_model.fit().summary()

0,1,2,3
Dep. Variable:,happiness_policy,No. Observations:,1380.0
Model:,GLM,Df Residuals:,2014.91
Model Family:,Gaussian,Df Model:,2.0
Link Function:,identity,Scale:,80.223
Method:,IRLS,Log-Likelihood:,-7285.9
Date:,"Wed, 09 Nov 2022",Deviance:,161640.0
Time:,17:13:22,Pearson chi2:,162000.0
No. Iterations:,3,Pseudo R-squ. (CS):,0.2054
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,48.5687,0.471,103.069,0.000,47.645,49.492
policy,7.8076,0.771,10.128,0.000,6.297,9.319
lag_policy,1.5761,0.654,2.410,0.016,0.294,2.858
