# Analyze Causal Effect Using Diff-in-Diff Model

https://towardsdatascience.com/analyze-causal-effect-using-diff-in-diff-model-85e07b17e7b7

Card and Krueger (1994) estimate the causal effect of an increase in the state minimum wage on the employment. On April 1, 1992, New Jersey raised the state minimum wage from $4.25 to $5.05 while the minimum wage in Pennsylvania stays the same at $4.25. Card and Krueger surveyed fast food restaurants in New Jersey in February 1992 and again in November 1992. They also collected the data from fast food restaurants in the neighboring neighboring eastern Pennsylvania.
I downloaded the data file public.dat used by Card and Krueger (1994) from MIT Economics website. I computed the simple DiD estimates of the effects of the NJ minimum wage increase in Python. Essentially, I compare the change in employment in NJ to the employment change in PA over the period from February to November. Below are the codes:

In [39]:
import os
import numpy as np
import pandas as pd
from statsmodels.formula.api import ols

In [20]:
df = pd.read_csv("mini-wage.csv")

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 410 entries, 0 to 409
Data columns (total 46 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   SHEET     410 non-null    int64  
 1   CHAINr    410 non-null    int64  
 2   CO_OWNED  410 non-null    int64  
 3   STATE     410 non-null    int64  
 4   SOUTHJ    410 non-null    int64  
 5   CENTRALJ  410 non-null    int64  
 6   NORTHJ    410 non-null    int64  
 7   PA1       410 non-null    int64  
 8   PA2       410 non-null    int64  
 9   SHORE     410 non-null    int64  
 10  NCALLS    410 non-null    int64  
 11  EMPFT     410 non-null    object 
 12  EMPPT     410 non-null    object 
 13  NMGRS     410 non-null    object 
 14  WAGE_ST   410 non-null    object 
 15  INCTIME   410 non-null    object 
 16  FIRSTINC  410 non-null    object 
 17  BONUS     410 non-null    int64  
 18  PCTAFF    410 non-null    object 
 19  MEAL      410 non-null    int64  
 20  OPEN      410 non-null    float6

In [23]:
# data cleaning 
df = df.replace('.', '')
df = df.apply(pd.to_numeric)

df['EMPTOT']= df['EMPPT']*0.5 + df['EMPFT'] + df['NMGRS']                                              
df['EMPTOT2']=df['EMPPT2']*0.5 + df['EMPFT2'] + df['NMGRS2']
df = df[['STATE','EMPTOT','EMPTOT2']]

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 410 entries, 0 to 409
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   STATE    410 non-null    int64  
 1   EMPTOT   398 non-null    float64
 2   EMPTOT2  396 non-null    float64
dtypes: float64(2), int64(1)
memory usage: 9.7 KB


In [25]:
df.head()

Unnamed: 0,STATE,EMPTOT,EMPTOT2
0,0,40.5,24.0
1,0,13.75,11.5
2,0,8.5,10.5
3,0,34.0,20.0
4,0,24.0,35.5


In [13]:
# create panel dataset for regression
df_1 = df[['STATE', 'EMPTOT']]
df_1['t']=0
df_2 = df[['STATE', 'EMPTOT2']]
df_2['t']=1
df_2.columns = ['STATE', 'EMPTOT','t']
df_reg = pd.concat([df_1, df_2], axis=0)
df_reg['dt'] = np.where((df_reg['STATE']== 1) & (df_reg['t']== 1) , 1, 0)

In [31]:
df_reg.head()

Unnamed: 0,STATE,EMPTOT,t,dt
0,0,40.5,0,0
1,0,13.75,0,0
2,0,8.5,0,0
3,0,34.0,0,0
4,0,24.0,0,0


In [36]:
# check by calculating the mean for each group directly
print("PA employment before:", round(df[(df.STATE == 0)]['EMPTOT'].mean(),2))
print("PA employment after:", round(df[(df.STATE == 0)]['EMPTOT2'].mean(),2))
print("NJ employment before:", round(df[(df.STATE == 1)]['EMPTOT'].mean(),2))
print("NJ employment after:", round(df[(df.STATE == 1)]['EMPTOT2'].mean(),2))
pa_dif =  round(df[(df.STATE == 0)]['EMPTOT2'].mean(),2) - round(df[(df.STATE == 0)]['EMPTOT'].mean(),2)
nj_dif = round(df[(df.STATE == 1)]['EMPTOT2'].mean(),1) - round(df[(df.STATE == 1)]['EMPTOT'].mean(),2)

did = nj_dif - pa_dif
print("Diff in Diff in mean employment:",round(did,2))

PA employment before: 23.33
PA employment after: 21.17
NJ employment before: 20.44
NJ employment after: 21.03
Diff in Diff in mean employment: 2.72


In [41]:
# regression

result = ols('EMPTOT ~ STATE + t + dt', data=df_reg).fit(cov_type='HC1',)
print(result.summary())



                            OLS Regression Results                            
Dep. Variable:                 EMPTOT   R-squared:                       0.007
Model:                            OLS   Adj. R-squared:                  0.004
Method:                 Least Squares   F-statistic:                     1.404
Date:                Tue, 01 Feb 2022   Prob (F-statistic):              0.240
Time:                        09:08:36   Log-Likelihood:                -2904.2
No. Observations:                 794   AIC:                             5816.
Df Residuals:                     790   BIC:                             5835.
Df Model:                           3                                         
Covariance Type:                  HC1                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     23.3312      1.346     17.337      0.0