# Difference in Differences
## Intuition
Differences-in-differences follow a simple pattern. They compare the before and after for groups that have treatments and those that do not have treatments.

Let's assume that we want to estimate the effect of a minimum wage law. We'd want to look at the before and after and take the difference to see if the difference is statistically significant. The key assumption in difference in difference is parallel trends. That is, without the treatment, things would have trended in the same direction.

What's unique about these approaches isn't the effect that they're trying to identify. What's unique is that they structure the effect in such a way that **we can add additional factors to try to explain away time-variant differences.**

## Explanation

|               | Treatment |       |
| ------------- |:---------:|:-----:|
| Before (Pre)  | $T_B$     | $C_B$ |
| After (Post)  | $T_A$     | $C_A$ |

We can represent this table to find the difference by rewriting the equation as follows:

$$
(T_A - T_B) - (C_A - C_B)
$$

Or, we can represent it through a regression:

### Regression form
$$
y = \beta_0 + \beta_1 TRT + \beta_2 POST + \beta_3 TRT \cdot POST + \epsilon
$$

The advantage to a regression is we can add additional covariates.

## Example

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

Let's take a simple example where we want to recreate the wage study from New Jersey

In [2]:
def gen_employees(n=1000, min_employees=4, std=2, effect=2):

    # generate N restaurants which have minimum wage employees
    X = np.random.randn(n) * 10 + 20
    X = X.round()
    X[np.where(X < min_employees)] = min_employees
    
    # create a pre and post group
    pre = X
    post = X + (np.random.randn(1000) * std).round() + effect
    return pre, post



# create pre/post data for both states
np.random.seed(123)
nj_pre, nj_post = gen_employees(std=2, effect=0.6)
pa_pre, pa_post = gen_employees(std=3, effect=-2.16)

# create a dataframe with all results
df = pd.concat([pd.DataFrame(nj_pre,  columns=["val"]).assign(post=0, trt=1),
                pd.DataFrame(nj_post, columns=["val"]).assign(post=1, trt=1),
                pd.DataFrame(pa_pre,  columns=["val"]).assign(post=0, trt=0),
                pd.DataFrame(pa_post, columns=["val"]).assign(post=1, trt=0)])

df['trt*post'] = df.post * df.trt
df['const'] = 1

Now let's look at a simple cross-tab

In [3]:
cross_tab = df.groupby(['trt', 'post']).mean()[['val']].unstack()
cross_tab['diff'] = cross_tab[('val', 1)] - cross_tab[('val', 0)]
cross_tab.columns = ['Before', 'After', 'Difference']
cross_tab.index = 'PA', 'NJ'

print("=" * 78)
print("Cross-tab of average value by Treatment and Post\n")
print(cross_tab.T)
print("=" * 78)

Cross-tab of average value by Treatment and Post

                PA      NJ
Before      20.301  19.875
After       18.343  20.497
Difference  -1.958   0.622


Now, let's run a difference in differences regression to show that it's at least equivalent to the difference in difference

In [5]:
Xvars = ['const', 'trt', 'post', 'trt*post']
X = df[Xvars].to_numpy()
y  = df.val.to_numpy()

# Total sum of squares
ybar = np.mean(y)
ydist = y - ybar

# Estimates
betas = np.linalg.inv(X.T @ X) @ X.T @ y
yhat = X @ betas

# Errors
residuals = y - yhat
tss = ydist.T @ ydist
rss = residuals.T @ residuals
se_beta = (rss / (X.shape[0] - X.shape[1]) * np.linalg.inv(X.T @ X).diagonal()) ** .5
rsquared = 1 - rss / tss

# Results
summary_beta = pd.DataFrame([betas, se_beta],
                            columns=['const', 'trt', 'post', 'trt*post'],
                            index=['coef', 'std err']).T

summary_beta['t'] = summary_beta['coef'] / summary_beta['std err']

Let's look at the results

In [6]:
# Summary
print("="*78)
print("R2:\t", round(rsquared, 4))
print("="*78)
print()
print("="*78)
print("Beta Estimates:")
print(summary_beta)
print("="*78)

R2:	 0.0077

Beta Estimates:
            coef   std err          t
const     20.301  0.304098  66.758113
trt       -0.426  0.430059  -0.990561
post      -1.958  0.430059  -4.552860
trt*post   2.580  0.608196   4.242055


Now, let's interpret what these variables mean.

1. const: Control group before ($C_B$)
2. trt: Fixed effects for treatment group ($C_A - \bar{C}$)
3. post: Contrl group marginal increase ($C_A - C_B$)
4. trt\*post: Effect of the treatment ($T_A - T_B) - (C_A - C_B)$

What's **useful** is that we have a $t$-test on the difference the two groups. This means we can actually make statements on the significance of the difference.

What's **interesting** is that the $R^2$ has little relevenace in this regression. In fact, it's so low, one might think that our regression doesn't have any validity. But let's think about what we're actually trying to understand. We're trying to understand - holding time and group constant. So essentially, we're seeing how much significance ones and zeroes have. It *should* be very low.

What's **important** to remember is our $y$ variable is a randomly distributed variable and our $X$ matrix is simply four binary variables. In this example, $R^2$ isn't instructive at all. 