# Implementing Regression Discontinuity Design in Python
### by [Jason DeBacker](http://jasondebacker.com), October 2021

This notebook provides an example of an regression discontinuity design in Python.

## Example: Lee (*Journal of Econometrics*, 2008)

[Lee (2008)](https://www.princeton.edu/~davidlee/wp/RDrand.pdf) seeks to answer an empirical question - how big is the incumbency advantage in the U.S. House? - and provide a methodological contribution - can we use regression discontinuity design to identify causal effects if the running variable is endogenous?

Economists and political scientists have studied the incumbency advantage for decades.  The main empirical challenge here is obvious: incumbents are from a selected sample - they were elected to office in the first place!  So simply looking at the rate at which incumbents win elections doesn't tell you anything about the causal impact of incumbency on elections.  Rather, it could just be evidence that good candidates do better than poorer candidates.  There have been a number of interesting approaches to tackle this problem, including [Levitt (*Journal of Political Economy*, 1994)](http://pricetheory.uchicago.edu/levitt/Papers/LevittUsingRepeatChallengers1994.pdf) who uses elections where the challenging candidate is the same.

Lee's important idea: let's look at close elections.  These could have gone either way, so the candidates who were running were of similar quality, but one just one.  And that narrow victor of the election at time $t$ now has the advantage of incumbency at $t+1$.  So we should be able to identify the incumbency advantage by looking at how candidates who narrowly won the election at time $t$ perform at time $t+1$.  More specifically, Lee (2008) looks at the incumbency advantage accruing to the party (and not to a specific candidate).  But the idea is the same (and in fact, since most incumbents run again, the effects on the party and the individual and quantitatively similar as well).

We'll replicate Lee's analysis here.  To begin, I've downloaded the Election Table data from the the ICPSR's [Database of United States Congressional Historical Statistics](http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/3371?q=congressional+election) as a Stata datafile, renamed `Election_Data.dta` and placed it in `./data/`.

With these data we won't have all the elections considered by Lee (2008) and so won't be able to replicate his analysis exactly, but these are close enough.

We'll need to do a bit of cleaning to these data.  Let's load the libraries we neeed and do the cleaning first.

### Cleaning data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import statsmodels.formula.api as smf
from rdd import rdd

In [2]:
# read in data from ./data/Election_Data.dta
df = pd.read_stata("./data/Election_Data.dta", convert_categoricals=False)
# make all variable names lower case
df.columns = df.columns.str.lower()

# clean election data
# keep only years 1946+ and House general elections
elec_data = df[(df.year >= 1946) & (df.electype == "G") & (df.office == 3)]
# keep only columns will use
elec_data = elec_data[["cand_id","year","congress", "state",
                       "district", "candpct", "party", "outcome",
                       "incumb"]]

### Setup data for analysis

Create dataset on candidates:
1. Number of election wins prior to year t
2. Number of elections campaigned in prior to year t

In [3]:
# sort by year and candidate
elec_data.sort_values(by=['year', 'cand_id'], inplace=True)
# create indicator if ran in election
elec_data["ran_for_office"] = 1

# create electoral experience by summing previous times ran for office
elec_data['elec_experience'] = elec_data.groupby(['cand_id'])['ran_for_office'].cumsum() - 1  # minus 1 so don't count current election
# create political experience by summing previous times won office
elec_data.sort_values(by=['year', 'cand_id'], inplace=True)
elec_data['pol_experience'] = elec_data.groupby(['cand_id'])['outcome'].cumsum()
elec_data['pol_experience'] = elec_data['pol_experience'] - elec_data['outcome']  # so don't count current election outcome

rd_data = elec_data.copy()

### More data prep

Here, we keep just the top vote getting candidates, reshape the data for analysis, and create lags of relevant variables.

In [4]:
# rank by percent of vote share (with in each race)
rd_data['ranking'] = elec_data.groupby(
    ['congress', 'state', 'district'])['candpct'].rank(method='first', ascending=False)

# keep only top 2 per race
rd_data = rd_data[rd_data.ranking < 3]

# drop cand_id
rd_data.drop(columns='cand_id', inplace=True)

# reshape data to wide - so one row per race
rd_data.set_index(['congress', 'year', 'state', 'district', 'ranking'], inplace=True)
rd_wide = rd_data.unstack(level=-1).rename_axis((None,None), axis=1)
#reset MultiIndex in columns with list comprehension
rd_wide.columns = [col[0] + '_' + str(int(col[1])) for col in rd_wide.columns]
rd_wide['dem_win'] = ((rd_wide['party_1'] == 100) & (rd_wide['outcome_1'] == 1)).astype('int')
rd_wide['dem_share'] = (
    (rd_wide['party_1'] == 100) * rd_wide['candpct_1'] +
    (rd_wide['party_2'] == 100) * rd_wide['candpct_2'])

# Create variable for Democratic margin of victory.  This will be running variable in RDD
# set to missing if dem if isn't one of two parties in top 2
rd_wide['margin'] = ((rd_wide['party_1'] == 100) * (rd_wide['candpct_1'] - rd_wide['candpct_2']) +
                     (rd_wide['party_2'] == 100) * (rd_wide['candpct_2'] - rd_wide['candpct_1']))

# Create variables for Democratic candidate experience
rd_wide['dem_elec_exp'] = (((rd_wide['party_1'] == 100) * rd_wide['elec_experience_1']) +
                           ((rd_wide['party_2'] == 100) * rd_wide['elec_experience_2']))
rd_wide['dem_pol_exp'] = (((rd_wide['party_1'] == 100) * rd_wide['pol_experience_1']) +
                           ((rd_wide['party_2'] == 100) * rd_wide['pol_experience_2']))

# drop obs with missing values for margin
rd_wide.drop(rd_wide[(rd_wide['margin'] == np.nan) | (rd_wide['margin'] == "")].index, inplace=True)
# drop those where margin is 0, but candidate shares are above 51 pct
# I don't know what such a case could be but there are about 400 in the data
rd_wide.drop(rd_wide[(rd_wide['margin'] == 0.0) & (rd_wide['candpct_1'] > 51)].index, inplace=True)

# Create leads and lags of vote share and wins
rd_wide['dem_win_m1'] = rd_wide.groupby(['state', 'district'])['dem_win'].shift(1)
rd_wide['dem_win_p1'] = rd_wide.groupby(['state', 'district'])['dem_win'].shift(-1)
# create leads for dem win and dem vote share
rd_wide['dem_share_m1'] = rd_wide.groupby(['state', 'district'])['dem_share'].shift(1)
rd_wide['dem_share_p1'] = rd_wide.groupby(['state', 'district'])['dem_share'].shift(-1)


# drop if missing values for leads (e.g. if year > 1986)
rd_wide = rd_wide[rd_wide.index.get_level_values(1) < 1987]
# drop if dem share >= 100
rd_wide = rd_wide[rd_wide.dem_share < 100]
rd_wide['gt0'] = (rd_wide['margin'] > 51).astype('int')

In [52]:
# create bins for running variable (margin)
rd_wide['margin_bin'] = pd.cut(rd_wide['margin'],
                               np.arange(start=-100, stop=102, step=2),
                               right=True,
                               labels=False, include_lowest=False)

# collapse data so have means by bin
binned_data = rd_wide.groupby('margin_bin').mean().reset_index()

binned_data = binned_data[binned_data['margin_bin'] < 100] # first one outliers?
# create variable for over 50%
binned_data['gt0'] = binned_data['margin_bin'] > 51

In [53]:
# Plot Democratic share at t+1 over Dem margin at t
px.scatter(binned_data, x='margin', y='dem_share_p1',
           title='Visual Evidence of a Discontinuity in Vote Share',
           labels={'margin':"Democrat's margin of victory at time t",
                   'dem_share_p1': "Democrat's vote share at time t+1"},
           template="plotly_white")

In [54]:
# Plot Democratic prob of victory at t+1 over Dem margin at t
px.scatter(binned_data, x='margin', y='dem_win_p1',
           title='Visual Evidence of a Discontinuity in Vote Share',
           labels={'margin':"Democrat's margin of victory at time t",
                   'dem_win_p1': "Democrat's probability of victory time t+1"},
           template="plotly_white")

### Regression Discontinuity Design

The two plots above give us a pretty clear picture of the incumbency advantage.  RDD is built around this discontinuity at zero and we can essentailly measure the effect here by the jump in the level of the point on the scatter plot at the cut point (i.e., a margin of victory of 0).

But if we want to be more precise, find the confidence intervals around this estimate, and control for covariates, we want to use regression analysis here.

Some notation to help describe what we are going to do:
* $Y$ is our outcome variable of interest (e.g., the probability a democrat wins the election at $t+1$). $Y(0)$ represents the outcome with no treatment and $Y(1)$ represents the outcome with treatment.
* $X$ is our running variable.  That is, this is the variable in which there is a discontinuity in the treatment (e.g., the margin of victory for a Democrat in election $t$)
* $c$ is the cut-point.  This is the point of the discontinuity in $X$ (e.g., just winning a margin of victory of 0)

In the data, we never observe $E[Y(0)|X=c]$, that is there are no units at the cutoff that don't get the treatment, but in principle it can be approximated arbitrarily well by $E[Y(0)|X=c-\varepsilon]$.  Therefore we estimate: $E[Y|X=c+\varepsilon]-E[Y|X=c-\varepsilon]$

We now need a functional form for $E[Y|X]$.  Suppose that $Y=\alpha+\tau T+ \beta f(X) +\eta$ and that $f(X)$ is a smooth function of $X$. A flexible way to do this is to modeling $f(X)$ with a pth-order polynomial in this way leads to 
$$Y=\alpha + \beta_{01}X + \beta_{02}X^{2}+...+\beta_{0p}X^{p}+\tau T + \beta_{1}TX + \beta_{2}TX^{2} + ... + \beta_{p}TX^{p}+\eta$$

We still have some choices here.  First off, what order is the polynomial we use?  Second, and relatedly, how far from the cut point do we look at the data?  

As with everything, this is a trade off.  If you look closer to the cut point, the data is more likely to be well approximated by a linear function, so you can use a lower order polynomial.  If you consider data farther from the cut point, a higher order polynomial may be necessary to fit the data better.  The distance from the cutpoint that you use in estimating the model is termed the "bandwidth" of the estimator.

There are a few formal ways to find that you have an appropriately flexible polynomial.  A couple of these are:
1. Use the Akaike information criterion (AIC) for model selection: $AIC = N ln(\hat{\sigma}^{2}) + 2p$, where $\hat{\sigma}^{2}$ is the mean squared error of the regression and $p$ is the number of model parameters (want to pick model with lowest AIC - i.e., lowest info loss)
2. Select a natural set of bins (as you would for an RD graph) and add bin dummies to the model and test their joint significance. Add higher order terms to the polynomial until the bin dummies are no longer jointly significant. (FYI, This also turns out to be a test for the presence of discontinuities in the regression function at points other then the cutoff, which you'll want to do anyway.)

There are also ways to determine that the bandwidth chosen is appropriate.  Again, this will depend on the functional form of your model.  There are two general methods for selecting bandwidth:
1. Ad hoc (e.g., elections between 48-52\% are "close") or substantively derived (Data driven)
2. Optimal bandwidth methods (Imbens and Kalyanaraman (*The Review of Economic Studies*, 2012))

A rule of thumb seems to be to use a 4th degree polynomial for $f(X)$.  But here, to make it simpler, let's start by estimating the RDD on our data that we just plotted using a 2nd degree polynomial and the full bandwidth (i.e., all the data).

In [28]:
# In this RD use polynomial of order 2 that varies on either side of cut point
# Define the model
rdd1 = smf.ols(formula='dem_share_p1 ~ margin + np.power(margin, 2) + gt0 + gt0*margin + gt0*np.power(margin, 2)', data=rd_wide)
# Estimate the model
print(rdd1.fit().summary())

                            OLS Regression Results                            
Dep. Variable:           dem_share_p1   R-squared:                       0.642
Model:                            OLS   Adj. R-squared:                  0.642
Method:                 Least Squares   F-statistic:                     2651.
Date:                Thu, 07 Oct 2021   Prob (F-statistic):               0.00
Time:                        22:20:04   Log-Likelihood:                -27980.
No. Observations:                7382   AIC:                         5.597e+04
Df Residuals:                    7376   BIC:                         5.601e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                 

The coefficient on the dummy variable for the cut point is the causal effect of treatment.  In this case this is the coefficient on the `gt0` variable, which denotes a more than 50% of the vote share.  We can do the same for the probability of victory.

In [29]:
# In this RD use polynomial of order 2 that varies on either side of cut point
# Define the model
rd_wide['gt0'] = (rd_wide['margin'] > 0).astype('int')
rdd2 = smf.ols(
    formula='dem_win_p1 ~ margin + np.power(margin, 2) + gt0 + gt0*margin + gt0*np.power(margin, 2)',
    data=rd_wide)
# Estimate the model
print(rdd2.fit().summary())

                            OLS Regression Results                            
Dep. Variable:             dem_win_p1   R-squared:                       0.613
Model:                            OLS   Adj. R-squared:                  0.613
Method:                 Least Squares   F-statistic:                     2424.
Date:                Thu, 07 Oct 2021   Prob (F-statistic):               0.00
Time:                        22:20:29   Log-Likelihood:                -1890.5
No. Observations:                7663   AIC:                             3793.
Df Residuals:                    7657   BIC:                             3835.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                 

We can also use `ggplot` to plot our fitted values along with the data.

In [12]:
# get predicted values and bin them so can put on same plot
pred_rdd2 = rdd2.fit().predict(rd_wide)
new_rd_wide = rd_wide.copy()
new_rd_wide['pred_val'] = pred_rdd2
binned_data_rdd2 = new_rd_wide.groupby('margin_bin').mean().reset_index()

fig = px.scatter(binned_data, x='margin', y='dem_win_p1',
           title='The Incumbency Advantage',
           labels={'margin':"Democrat's margin of victory at time t",
                   'dem_win_p1': "Democrat's probability of victory time t+1"},
           template="plotly_white")
fig.add_scatter(x=binned_data['margin'], y=binned_data_rdd2['pred_val'],
               name='Fitted Values')
fig.show()

### The Role of Covariates

Notice that we didn't include any other controls in the models above, such as candidate experience.  In principle, covariates are not needed for identification in RD, but they can help reduce sampling variability in the estimator and improve precision if they are correlated with the potential outcomes.  Note that this is a standard argument which also supports inclusion of covariates in analyses of randomized trials.

Adding covariates should not affect the point estimate of the effect (very much). If it does, there is likely a problem.  The wider the bandwidth the more important it may be to include covariates -  including additional covariates may eliminate some bias that is the result of the inclusion of these additional observations far from the cutoff point.

The first and most important point is that the presence of these covariates rarely changes the identification strategy. Typically, the conditional distribution of the covariates $Z$ given $X$ is continuous at $x = c$. If such discontinuities in other covariates are found, the justification of the identification strategy may be questionable. If the conditional distribution of $Z$ given $X$ is continuous at $x = c$, then including Z in the regression. 

### Threats to the Validity of the RDD

It is impossible to test the continuity assumption directly, but we can test some implications of it.  Namely, all observed predetermined characteristics should have identical distributions on either side of the cutoff, in the limit, as we approach smaller and smaller bandwidths. That is, there should be no discontinuities in the observables.

Again there is an analogy to an experiment: we cannot test whether unobserved characteristics are balanced, but we can test the observables. Rejection calls the randomization into question.

A subtle point in the RD context is that a finding a discontinuity in observable covariates indicates a violation of the continuity assumption, not a violation of unconfoundedness, which is satisfied by definition.

To make sure that our RDD is valid, we'll want to have evidence that there are not discontinuities in the covariates at the cut point - or with any pre-determined variables  E.g., we can test for discontinuities in the Democrates vote share at t-1.

In [13]:
# Now test if other covariates smooth through cut point
px.scatter(binned_data, x='margin', y='dem_win_m1',
           title='Validity Test - Pre-determined Variables',
           labels={'margin':"Democrat's margin of victory at time t",
                   'dem_win_p1': "Democrat's margin of victory at time t-1"},
           template="plotly_white")

### The rdd Package

For RDD in Python, on might consider the [`rdd` package](https://pypi.org/project/rdd/).  It supports a number of fucntions useful for RDD estimators, such as computing optimal bandwith (Imbens and Kalyanaraman (2012)).

As an example, let's use the `rdd` package to find the RD estimate of the incumbency advantage on our full data:

In [35]:
# find IK optimal bandwidth
bandwidth_opt = rdd.optimal_bandwidth(rd_wide['dem_win_p1'], rd_wide['margin'], cut=0)
# do restrict data to within opt bandwith -- seems too small here
# use bandwidth of 20
data_rdd = rdd.truncated_data(rd_wide, 'margin', 20, cut=0)
# estimate model
# In way written, using local linear regression
model = rdd.rdd(data_rdd, 'margin', 'dem_win_p1', cut=0)
print(model.fit().summary())

Estimation Equation:	 dem_win_p1 ~ TREATED + margin
                            WLS Regression Results                            
Dep. Variable:             dem_win_p1   R-squared:                       0.403
Model:                            WLS   Adj. R-squared:                  0.403
Method:                 Least Squares   F-statistic:                     1088.
Date:                Thu, 07 Oct 2021   Prob (F-statistic):               0.00
Time:                        22:32:21   Log-Likelihood:                -1507.2
No. Observations:                3226   AIC:                             3020.
Df Residuals:                    3223   BIC:                             3039.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------


In [36]:
print('Optimal bandwidth = ', bandwidth_opt)

Optimal bandwidth =  2.5209883840880702


### Summary

Lee (2008) finds incumbency advantage of 7-8% of vote share (if party won last time, this is effect), which translates to 35\% increase in prob winning.  We find effects of similar magnitude with these data - about 4-5% and 40%, repsectively.  Of course a weakness of the design (and any non-structural approach) is that it doesn't tell us anything about what drives incumbency advantage.

Also, recall that one of the main points of this paper is that RD is value even if the running variable can be endogenous - as long as it can not be perfectly chosen.  The means we can use RDD not just with elections, but also things like thresholds for test scores, etc.
