# Difference-in-differences demo

This example assumes you have 80 census tracts: 40 in a control group and 40 in a treatment group. For each tract you have calculated median rent/sqft at two time points: pre-treatment (2014) and post-treatment (2020). I assign random values to generate data accordingly.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
np.random.seed(0) #for recomputability

## Set up fake random observations

Let's say there are 40 tracts each in treatment and control.

In [2]:
n = 40
tract_ids = np.arange(n * 2)
treat_tract_ids = tract_ids[:n]
cntrl_tract_ids = tract_ids[n:]

In [3]:
# random data across 4 groups: treatment/control and pre/post event
treat_pre = np.random.normal(loc=1.2, scale=0.2, size=n)
treat_pst = np.random.normal(loc=2.0, scale=0.2, size=n)
cntrl_pre = np.random.normal(loc=1.0, scale=0.2, size=n)
cntrl_pst = np.random.normal(loc=1.5, scale=0.2, size=n)

## Assemble dataset from the observations

Let `time` = 0 if 2014 and 1 if 2020.

Let `group` = 0 if control and 1 if treatment.

In [4]:
df_treat_pre = pd.DataFrame(data={'tract_id': treat_tract_ids,
                                  'tract_median_rent_sqft': treat_pre,
                                  'time': 0,
                                  'group': 1})
df_treat_pst = pd.DataFrame(data={'tract_id': treat_tract_ids,
                                  'tract_median_rent_sqft': treat_pst,
                                  'time': 1,
                                  'group': 1})
df_cntrl_pre = pd.DataFrame(data={'tract_id': cntrl_tract_ids,
                                  'tract_median_rent_sqft': cntrl_pre,
                                  'time': 0,
                                  'group': 0})
df_cntrl_pst = pd.DataFrame(data={'tract_id': cntrl_tract_ids,
                                  'tract_median_rent_sqft': cntrl_pst,
                                  'time': 1,
                                  'group': 0})
df = pd.concat([df_treat_pre, df_treat_pst, df_cntrl_pre, df_cntrl_pst]).reset_index(drop=True)

In [5]:
# create our key dummy variable: 1 if is treatment group AND post event, otherwise 0
df['post_treatment'] = df['time'] * df['group']

# add a couple random covariates
df['num_bedrooms'] = np.random.normal(loc=2, scale=0.3, size=len(df))
df['dist_to_transit'] = np.random.normal(loc=500, scale=200, size=len(df))

In [6]:
# show a random sample of the assembled dataset
print(df.shape)
df.sample(10)

(160, 7)


Unnamed: 0,tract_id,tract_median_rent_sqft,time,group,post_treatment,num_bedrooms,dist_to_transit
74,34,2.080468,1,1,1,1.653145,441.632527
54,14,1.994364,1,1,1,1.716666,509.898996
137,57,1.45834,1,0,0,2.174886,59.711743
62,22,1.837371,1,1,1,1.860921,48.887154
147,67,1.723403,1,0,0,1.595985,341.376527
108,68,1.384588,0,0,0,1.906734,386.137589
30,30,1.230989,0,1,0,1.612143,347.171215
152,72,1.351049,1,0,0,1.661952,444.865893
138,58,1.579201,1,0,0,1.880165,539.860039
34,34,1.130418,0,1,0,2.156983,369.54128


## Regression analysis

**The estimated DiD effect is the coefficient on the `post_treatment` variable.**

In [7]:
# choose a response and predictors
response = 'tract_median_rent_sqft'
predictors = ['time', 'group', 'post_treatment']

# filter full dataset to retain only these columns and only rows without nulls in these columns
data = df[[response] +  predictors].dropna()

# create design matrix and response vector
X = data[predictors]
y = data[response]

# estimate a simple linear regression model with OLS, using statsmodels
model = sm.OLS(y, sm.add_constant(X))
result = model.fit()
print(result.summary())

                              OLS Regression Results                              
Dep. Variable:     tract_median_rent_sqft   R-squared:                       0.724
Model:                                OLS   Adj. R-squared:                  0.719
Method:                     Least Squares   F-statistic:                     136.7
Date:                    Thu, 18 Jun 2020   Prob (F-statistic):           1.85e-43
Time:                            21:28:40   Log-Likelihood:                 36.101
No. Observations:                     160   AIC:                            -64.20
Df Residuals:                         156   BIC:                            -51.90
Df Model:                               3                                         
Covariance Type:                nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
cons

In [8]:
# you can also get the same value just by subtracting means
# i.e., taking the "difference in differences"
pre_treat = ~df['time'].astype(bool) & df['group'].astype(bool)
pst_treat = df['time'].astype(bool) & df['group'].astype(bool)
pre_cntrl = ~df['time'].astype(bool) & ~df['group'].astype(bool)
pst_cntrl = df['time'].astype(bool) & ~df['group'].astype(bool)
col = 'tract_median_rent_sqft'
(df.loc[pst_treat, col].mean() - df.loc[pre_treat, col].mean()) - (df.loc[pst_cntrl, col].mean() - df.loc[pre_cntrl, col].mean())

0.2512124249367014

## Regression with covariates

The regression framework is useful because it lets you include covariates and calculate standard errors.

Here we include covariates in the model: the DiD estimate changes slightly (and only very slightly, because the covariates are uncorrelated with the response).

In [9]:
# choose a response and predictors
response = 'tract_median_rent_sqft'
covariates = ['num_bedrooms', 'dist_to_transit']
predictors = ['time', 'group', 'post_treatment'] + covariates

# filter full dataset to retain only these columns and only rows without nulls in these columns
data = df[[response] +  predictors].dropna()

# create design matrix and response vector
X = data[predictors]
y = data[response]

# estimate a simple linear regression model with OLS, using statsmodels
model = sm.OLS(y, sm.add_constant(X))
result = model.fit()
print(result.summary())

                              OLS Regression Results                              
Dep. Variable:     tract_median_rent_sqft   R-squared:                       0.725
Model:                                OLS   Adj. R-squared:                  0.716
Method:                     Least Squares   F-statistic:                     81.27
Date:                    Thu, 18 Jun 2020   Prob (F-statistic):           2.08e-41
Time:                            21:28:40   Log-Likelihood:                 36.309
No. Observations:                     160   AIC:                            -60.62
Df Residuals:                         154   BIC:                            -42.17
Df Model:                               5                                         
Covariance Type:                nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
co