# Basic Difference-in-Differences with diff-diff

This notebook demonstrates how to use the `diff-diff` library for basic 2x2 Difference-in-Differences (DiD) analysis. We'll cover:

1. Setting up a basic DiD estimation
2. Using both column-name and formula interfaces
3. Interpreting results
4. Adding covariates
5. Using fixed effects
6. Cluster-robust and wild bootstrap inference

In [None]:
import numpy as np
import pandas as pd
from diff_diff import DifferenceInDifferences, TwoWayFixedEffects
from diff_diff.prep import generate_did_data

## 1. Generate Sample Data

The `generate_did_data` function creates synthetic panel data with a known treatment effect, which is useful for learning and testing.

In [None]:
# Generate synthetic DiD data with known ATT of 5.0
data = generate_did_data(
    n_units=100,
    n_periods=2,
    treatment_effect=5.0,
    treatment_fraction=0.5,
    treatment_period=1,  # Period 1 is post-treatment (periods are 0 and 1)
    noise_sd=1.0,
    seed=42
)

print(f"Dataset shape: {data.shape}")
data.head(10)

In [None]:
# Examine the data structure
print("Treatment and time distribution:")
print(data.groupby(['treated', 'post']).size().unstack(fill_value=0))

## 2. Basic DiD Estimation

The `DifferenceInDifferences` estimator provides an sklearn-like interface with a `fit()` method.

In [None]:
# Create the estimator
did = DifferenceInDifferences()

# Fit using column names
results = did.fit(
    data,
    outcome="outcome",
    treatment="treated",
    time="post"
)

# Print the summary
print(results.summary())

### Understanding the Results

The key results are:
- **ATT (Average Treatment Effect on the Treated)**: The estimated causal effect of the treatment
- **SE**: Standard error of the estimate
- **t-stat**: T-statistic for testing H0: ATT = 0
- **p-value**: Two-sided p-value
- **95% CI**: Confidence interval for the ATT

In [None]:
# Access individual components
print(f"Estimated ATT: {results.att:.4f}")
print(f"True ATT: 5.0")
print(f"Standard Error: {results.se:.4f}")
print(f"95% CI: [{results.conf_int[0]:.4f}, {results.conf_int[1]:.4f}]")
print(f"P-value: {results.p_value:.4f}")
print(f"Is significant at 5% level: {results.is_significant}")
print(f"Significance stars: {results.significance_stars}")

## 3. Using the Formula Interface

For those familiar with R, `diff-diff` supports a formula interface similar to R's notation.

In [None]:
# Using formula interface (R-style)
did_formula = DifferenceInDifferences()
results_formula = did_formula.fit(
    data,
    formula="outcome ~ treated * post"
)

print(results_formula.summary())

In [None]:
# Verify both methods give the same result
print(f"Column-name ATT: {results.att:.6f}")
print(f"Formula ATT: {results_formula.att:.6f}")
print(f"Difference: {abs(results.att - results_formula.att):.2e}")

## 4. Adding Covariates

You can include additional control variables to improve precision and reduce bias from observed confounders.

In [None]:
# Add some covariates to our data
np.random.seed(42)
data['size'] = np.random.normal(100, 20, len(data))
data['age'] = np.random.normal(10, 3, len(data))

# Fit with covariates
did_cov = DifferenceInDifferences()
results_cov = did_cov.fit(
    data,
    outcome="outcome",
    treatment="treated",
    time="post",
    covariates=["size", "age"]
)

print(results_cov.summary())

In [None]:
# All coefficient estimates are available
print("All coefficients:")
for name, coef in results_cov.coefficients.items():
    print(f"  {name}: {coef:.4f}")

## 5. Fixed Effects

Fixed effects control for time-invariant unobserved heterogeneity. `diff-diff` supports two approaches:

1. **Dummy variables** (`fixed_effects`): Creates indicator variables for each level
2. **Within-transformation** (`absorb`): Demeans data by group (more efficient for high-dimensional FE)

In [None]:
# Generate data with more structure
np.random.seed(42)
n_units = 50
n_periods = 4

panel_data = []
for unit in range(n_units):
    is_treated = unit < n_units // 2
    state = unit % 5  # 5 states
    unit_effect = np.random.normal(0, 2)
    
    for period in range(n_periods):
        post = 1 if period >= 2 else 0
        y = 10.0 + unit_effect + period * 0.5 + state * 1.5
        if is_treated and post:
            y += 4.0  # True ATT = 4.0
        y += np.random.normal(0, 0.5)
        
        panel_data.append({
            'unit': unit,
            'state': f'state_{state}',
            'period': period,
            'treated': int(is_treated),
            'post': post,
            'outcome': y
        })

panel_df = pd.DataFrame(panel_data)
print(f"Panel data: {panel_df.shape[0]} observations")
panel_df.head()

In [None]:
# Using fixed effects with dummy variables
did_fe = DifferenceInDifferences()
results_fe = did_fe.fit(
    panel_df,
    outcome="outcome",
    treatment="treated",
    time="post",
    fixed_effects=["state"]
)

print(results_fe.summary())

In [None]:
# Using absorbed fixed effects (within-transformation)
# This is more efficient for high-dimensional fixed effects
did_absorb = DifferenceInDifferences()
results_absorb = did_absorb.fit(
    panel_df,
    outcome="outcome",
    treatment="treated",
    time="post",
    absorb=["unit"]  # Absorb unit fixed effects
)

print(results_absorb.summary())

## 6. Two-Way Fixed Effects (TWFE)

For panel data, the `TwoWayFixedEffects` estimator automatically includes both unit and time fixed effects using within-transformation.

In [None]:
# Two-Way Fixed Effects estimator
twfe = TwoWayFixedEffects()
results_twfe = twfe.fit(
    panel_df,
    outcome="outcome",
    treatment="treated",
    time="period",  # Use actual time periods
    unit="unit"
)

print(results_twfe.summary())

## 7. Robust Inference

### Cluster-Robust Standard Errors

When observations are correlated within clusters (e.g., units over time), use cluster-robust standard errors.

In [None]:
# Create clustered data
np.random.seed(42)
n_clusters = 20
obs_per_cluster = 10

clustered_data = []
for cluster in range(n_clusters):
    is_treated = cluster < n_clusters // 2
    cluster_effect = np.random.normal(0, 2)
    
    for obs in range(obs_per_cluster):
        for period in [0, 1]:
            y = 10.0 + cluster_effect
            if period == 1:
                y += 3.0
            if is_treated and period == 1:
                y += 2.5  # True ATT = 2.5
            y += np.random.normal(0, 0.5)
            
            clustered_data.append({
                'cluster': cluster,
                'obs': obs,
                'period': period,
                'treated': int(is_treated),
                'post': period,
                'outcome': y
            })

clustered_df = pd.DataFrame(clustered_data)
print(f"Clustered data: {clustered_df.shape[0]} observations in {n_clusters} clusters")

In [None]:
# Compare standard errors: robust vs cluster-robust
did_robust = DifferenceInDifferences(robust=True)
did_cluster = DifferenceInDifferences(cluster="cluster")

results_robust = did_robust.fit(
    clustered_df,
    outcome="outcome",
    treatment="treated",
    time="post"
)

results_cluster = did_cluster.fit(
    clustered_df,
    outcome="outcome",
    treatment="treated",
    time="post"
)

print(f"ATT (both methods): {results_robust.att:.4f}")
print(f"Robust SE (HC1): {results_robust.se:.4f}")
print(f"Cluster-robust SE: {results_cluster.se:.4f}")
print(f"\nCluster-robust SE is {results_cluster.se / results_robust.se:.2f}x larger")

### Wild Cluster Bootstrap

For better inference with few clusters (<50), use the wild cluster bootstrap.

In [None]:
# Wild cluster bootstrap inference
did_bootstrap = DifferenceInDifferences(
    cluster="cluster",
    inference="wild_bootstrap",
    n_bootstrap=999,
    bootstrap_weights="rademacher",
    seed=42
)

results_bootstrap = did_bootstrap.fit(
    clustered_df,
    outcome="outcome",
    treatment="treated",
    time="post"
)

print(results_bootstrap.summary())

In [None]:
# Compare inference methods
print("Comparison of inference methods:")
print(f"{'Method':<25} {'SE':>10} {'p-value':>10} {'95% CI':>25}")
print("-" * 70)
print(f"{'Cluster-robust (analytical)':<25} {results_cluster.se:>10.4f} {results_cluster.p_value:>10.4f} [{results_cluster.conf_int[0]:>8.4f}, {results_cluster.conf_int[1]:>8.4f}]")
print(f"{'Wild cluster bootstrap':<25} {results_bootstrap.se:>10.4f} {results_bootstrap.p_value:>10.4f} [{results_bootstrap.conf_int[0]:>8.4f}, {results_bootstrap.conf_int[1]:>8.4f}]")

## 8. Exporting Results

Results can be exported to various formats for reporting.

In [None]:
# Export to dictionary
result_dict = results.to_dict()
print("As dictionary:")
for key, value in result_dict.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")
    else:
        print(f"  {key}: {value}")

In [None]:
# Export to DataFrame (useful for combining multiple estimates)
result_df = results.to_dataframe()
print("\nAs DataFrame:")
result_df

## Summary

In this notebook, we covered:

- **Basic DiD estimation** with both column-name and formula interfaces
- **Adding covariates** to control for observed confounders
- **Fixed effects** using dummy variables or within-transformation
- **Two-Way Fixed Effects** for panel data
- **Cluster-robust standard errors** for correlated observations
- **Wild cluster bootstrap** for robust inference with few clusters

For more advanced topics, see the other example notebooks:
- `02_staggered_did.ipynb` - Staggered adoption with Callaway-Sant'Anna
- `03_synthetic_did.ipynb` - Synthetic Difference-in-Differences
- `04_parallel_trends.ipynb` - Testing and diagnostics