## Difference-in-differences (DID) in Economics

DiD is one of the quasi-experimental statistical techniques methods widely used in econometrics and quantitative research. In this method, observational data is used. Quasi experimental methods are crucial in estimating causal effects considering a set of assumptions, especially when randomization is not possible.

In this notebook, I try to perform DiD analysis using a fictional dataset provided during the lecture. This notebook is based on the lectures on "Quasi-Experiments in Development Economics" by Prof. Dr. Sebastian Vollmer at the University of Göttingen.

https://flexnow2.uni-goettingen.de/modulbeschreibungen/66723.pdf

DiD is only implemented in panel data, with observation from multiple time periods and multiple units. In this notebook, I implement DiD for multiple datasets.

### 1. First DiD Analysis

Based on a dataset on 7 countries

In [1]:
# library
library(foreign)

library(tidyverse)
data1 <- read.dta("http://dss.princeton.edu/training/Panel101.dta")

# Check out the data
dim(data1)

# check out in a bit detail
glimpse(data1)


"Paket 'tidyverse' wurde unter R Version 4.1.3 erstellt"
-- [1mAttaching packages[22m ------------------------------------------------------------------------------- tidyverse 1.3.2 --

[32mv[39m [34mggplot2[39m 3.3.5     [32mv[39m [34mpurrr  [39m 0.3.4
[32mv[39m [34mtibble [39m 3.1.6     [32mv[39m [34mdplyr  [39m 1.0.7
[32mv[39m [34mtidyr  [39m 1.1.4     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mreadr  [39m 2.1.3     [32mv[39m [34mforcats[39m 0.5.1

"Paket 'readr' wurde unter R Version 4.1.3 erstellt"
-- [1mConflicts[22m ---------------------------------------------------------------------------------- tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



Rows: 70
Columns: 9
$ country [3m[90m<fct>[39m[23m A, A, A, A, A, A, A, A, A, A, B, B, B, B, B, B, B, B, B, B, C,~
$ year    [3m[90m<int>[39m[23m 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 19~
$ y       [3m[90m<dbl>[39m[23m 1342787840, -1899660544, -11234363, 2645775360, 3008334848, 32~
$ y_bin   [3m[90m<dbl>[39m[23m 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0,~
$ x1      [3m[90m<dbl>[39m[23m 0.27790365, 0.32068470, 0.36346573, 0.24614404, 0.42462304, 0.~
$ x2      [3m[90m<dbl>[39m[23m -1.1079559, -0.9487200, -0.7894840, -0.8855330, -0.7297683, -0~
$ x3      [3m[90m<dbl>[39m[23m 0.28255358, 0.49253848, 0.70252335, -0.09439092, 0.94613063, 1~
$ opinion [3m[90m<fct>[39m[23m Str agree, Disag, Disag, Disag, Disag, Str agree, Disag, Str a~
$ op      [3m[90m<dbl>[39m[23m 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1,~


In [2]:
head(data1)

Unnamed: 0_level_0,country,year,y,y_bin,x1,x2,x3,opinion,op
Unnamed: 0_level_1,<fct>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>
1,A,1990,1342787840,1,0.2779036,-1.1079559,0.28255358,Str agree,1
2,A,1991,-1899660544,0,0.3206847,-0.94872,0.49253848,Disag,0
3,A,1992,-11234363,0,0.3634657,-0.789484,0.70252335,Disag,0
4,A,1993,2645775360,1,0.246144,-0.885533,-0.09439092,Disag,0
5,A,1994,3008334848,1,0.424623,-0.7297683,0.94613063,Disag,0
6,A,1995,3229574144,1,0.4772141,-0.723246,1.02968037,Str agree,1


The data above consists of 7 countries named from A to G with 10 years of observation fom the year 1990 to 1999. Thus, the dimension of the dataset is *70X9*.

*Suppose some kind of intervention began in the year 1994, for countries E, F, G. However, the remaining countries A, B, C, and D didn't receive any kind of intervention.*

In [3]:
attach(data1)
D_i <- ifelse(data1$year >= 1994, 1, 0) # timing dummy

Post_i <- ifelse(data1$country == "E" | data1$country == "F" |
                       data1$country == "G", 1, 0) # treatment dummy



#### Relevant DiD Regression Equation

$Y_{it} = \beta_0 + \beta_1D_i + \beta_2Post_t + \beta_3D_i*Post_t + \epsilon_{it}$

$\rightarrow$ For untreated before 1994: $Y_{it} = \beta_0 + \epsilon_{it}$ ...(1)

$ \rightarrow$For untreated after 1994: $Y_{it} = \beta_0 + \beta_1 + \epsilon_{it}$ ...(2)

$\rightarrow$ For treated before 1994: $ Y_{it} = \beta_0 + \beta_2 + \epsilon_{it}$ ...(3)

$\rightarrow$ For treated after 1994: $Y_{it} = \beta_0 + \beta_1 + \beta_2 + \beta_3 + \epsilon_{it}$ ...(4)

    Now, 

Difference in differences is thus = Diff in treated - Diff in untreated

$DiD = (4) - (3) - [(2) - (1)]$

$DiD = \beta_1 + \beta_3 -(\beta_1)$

$\therefore DiD = \beta_3 $



$\rightarrow$ The difference in differences estimate, thus, is the coefficient obtained on the interaction term. The following steps guides through the regression processes. There are mainly two procedure of carrying out the regression, which are explained below:

##### Procedure 1. Setting up an interaction term and run the regression

In [5]:
# Interaction term between time dummy and treatment dummy:
interaction <- D_i * Post_i

# Regress
r1 <- lm(y ~ D_i + Post_i + interaction)
summary(r1)


Call:
lm(formula = y ~ D_i + Post_i + interaction)

Residuals:
       Min         1Q     Median         3Q        Max 
-9.768e+09 -1.623e+09  1.167e+08  1.393e+09  6.807e+09 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  3.581e+08  7.382e+08   0.485   0.6292  
D_i          2.289e+09  9.530e+08   2.402   0.0191 *
Post_i       1.776e+09  1.128e+09   1.575   0.1200  
interaction -2.520e+09  1.456e+09  -1.731   0.0882 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.953e+09 on 66 degrees of freedom
Multiple R-squared:  0.08273,	Adjusted R-squared:  0.04104 
F-statistic: 1.984 on 3 and 66 DF,  p-value: 0.1249


The DiD coefficient is the estimate interaction (~ 2.520e+09), which is quite large and negative. It is significant at the 10% level only, thus implying that the change in Y for the treated countries (E, F, and G) is less than that of the change in Y for the intreated group. In general, the $\beta_3$ coefficient is negative.

##### Procedure 2. Multiplication methods

In [6]:
# regression of interaction on y
r2 <- lm(y~ D_i * Post_i)
summary(r2)


Call:
lm(formula = y ~ D_i * Post_i)

Residuals:
       Min         1Q     Median         3Q        Max 
-9.768e+09 -1.623e+09  1.167e+08  1.393e+09  6.807e+09 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  3.581e+08  7.382e+08   0.485   0.6292  
D_i          2.289e+09  9.530e+08   2.402   0.0191 *
Post_i       1.776e+09  1.128e+09   1.575   0.1200  
D_i:Post_i  -2.520e+09  1.456e+09  -1.731   0.0882 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.953e+09 on 66 degrees of freedom
Multiple R-squared:  0.08273,	Adjusted R-squared:  0.04104 
F-statistic: 1.984 on 3 and 66 DF,  p-value: 0.1249


Both procedures yied same results and also imply the same conclusion.

#### Time trend plotting