# Difference-in-differences

Difference-in-differences (DiD) is a quasi-experimental statistical method common in econometrics that uses data from an observational study to approximate a natural experiment with treatment/control groups and observations before and after some "treatment" of interest. It estimates the *effect* of a predictor (the treatment) on the response. DiD can help mitigate selection biases but could still suffer from endogeneity problems like a confounding omitted variable or simultaneity. The usual OLS assumptions apply.

This example assumes you have 80 census tracts: 40 in a control group and 40 in a treatment group. A new policy was implemented in the treatment group in 2017 and we want to test its effect on rents. For each tract, we have median rent/sqft at two time points: pre-treatment (2014) and post-treatment (2020). We assign randomized values to generate data accordingly.

We do not control for spatial diffusion of the policy effect in this example: after working through the example, set up the DiD with controls for spatial diffusion as individual exercise.

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
np.random.seed(0) #for recomputability

## Set up fake random observations

Let's say there are 40 tracts apiece in the treatment and control groups.

In [None]:
n = 40
tract_ids = np.arange(n * 2)
treat_tract_ids = tract_ids[:n]
cntrl_tract_ids = tract_ids[n:]

In [None]:
# random data across 4 groups: treatment/control and pre/post event
treat_pre = np.random.normal(loc=1.2, scale=0.2, size=n)
treat_pst = np.random.normal(loc=2.0, scale=0.2, size=n)
cntrl_pre = np.random.normal(loc=1.0, scale=0.2, size=n)
cntrl_pst = np.random.normal(loc=1.5, scale=0.2, size=n)

## Assemble dataset from the observations

Create 4 DataFrames representing treatment/pre, treatment/post, control/pre, control/post, and then concatenate them into a single DataFrame containing all the data.

Let `time = 0` if 2014 (pre-event) and 1 if 2020 (post-event).

Let `group = 0` if control and 1 if treatment.

In [None]:
df_treat_pre = pd.DataFrame(data={'tract_id': treat_tract_ids,
                                  'tract_median_rent_sqft': treat_pre,
                                  'time': 0,
                                  'group': 1})
df_treat_pst = pd.DataFrame(data={'tract_id': treat_tract_ids,
                                  'tract_median_rent_sqft': treat_pst,
                                  'time': 1,
                                  'group': 1})
df_cntrl_pre = pd.DataFrame(data={'tract_id': cntrl_tract_ids,
                                  'tract_median_rent_sqft': cntrl_pre,
                                  'time': 0,
                                  'group': 0})
df_cntrl_pst = pd.DataFrame(data={'tract_id': cntrl_tract_ids,
                                  'tract_median_rent_sqft': cntrl_pst,
                                  'time': 1,
                                  'group': 0})
df = pd.concat([df_treat_pre, df_treat_pst, df_cntrl_pre, df_cntrl_pst]).reset_index(drop=True)

In [None]:
# create our DiD interaction dummy variable of interest:
# 1 if is treatment group AND post-event, otherwise 0
df['post_treatment'] = df['time'] * df['group']

# add a couple random covariates
df['num_bedrooms'] = np.random.normal(loc=2, scale=0.3, size=len(df))
df['dist_to_transit'] = np.random.normal(loc=500, scale=200, size=len(df))

In [None]:
# show a random sample of the assembled dataset
print(df.shape)
df.sample(10)

## Regression analysis

The estimated DiD effect is the coefficient on the `post_treatment` variable.

In [None]:
# choose a response and predictors
response = 'tract_median_rent_sqft'
predictors = ['time', 'group', 'post_treatment']

# filter full dataset to retain only these columns and only rows without nulls in these columns
data = df[[response] +  predictors].dropna()

# create design matrix and response vector
X = data[predictors]
y = data[response]

# estimate a simple linear regression model with OLS, using statsmodels
model = sm.OLS(y, sm.add_constant(X))
result = model.fit()
print(result.summary())

**Compare the output of the cell above to that of the cell below:** without other covariates in the regression model above, it is equivalent to the DiD estimator below which calculates the same value (i.e., the coefficient on `post_treatment`) just by subtracting means (i.e., calculating the "difference in differences").

In [None]:
# slice the dataset up into the 4 groups (pre/post, treatment/control)
pre_treat = ~df['time'].astype(bool) & df['group'].astype(bool)
pst_treat = df['time'].astype(bool) & df['group'].astype(bool)
pre_cntrl = ~df['time'].astype(bool) & ~df['group'].astype(bool)
pst_cntrl = df['time'].astype(bool) & ~df['group'].astype(bool)

# then subtract their means
col = df['tract_median_rent_sqft']
did = (col[pst_treat].mean() - col[pre_treat].mean()) - (col[pst_cntrl].mean() - col[pre_cntrl].mean())
round(did, 4)

## DiD regression with covariates

The regression framework is useful for DiD because it lets you include *covariates* and calculate *standard errors*. Adding covariates can 1) increase the power of the statistical tests and 2) improve identification by controlling other time-varying factors between the treatment and control groups that would otherwise violate the "common trend" assumption.

Here we include covariates in the model: the DiD estimate changes only slightly, because in this example these covariates are uncorrelated with the response.

In [None]:
# choose a response and predictors
response = 'tract_median_rent_sqft'
covariates = ['num_bedrooms', 'dist_to_transit']
predictors = ['time', 'group', 'post_treatment'] + covariates

# filter full dataset to retain only these columns and only rows without nulls in these columns
data = df[[response] +  predictors].dropna()

# create design matrix and response vector
X = data[predictors]
y = data[response]

# estimate a simple linear regression model with OLS, using statsmodels
model = sm.OLS(y, sm.add_constant(X))
result = model.fit()
print(result.summary())