# Regression Discontinuity and the Price Effects of Stock Market Indexing

**Replication of Chang, Hong, and Liskovich (2015)**

*The Review of Financial Studies, 28(1), 212–246*

---

This notebook replicates the main findings from Chang et al. (2015), who use a fuzzy regression discontinuity design to estimate the causal price effects of Russell index membership. The key results to replicate are:

1. **Addition effect** (~5%): Stocks moving from the Russell 1000 to the Russell 2000 experience a positive June return discontinuity
2. **Deletion effect** (~5.4%): Stocks moving from the Russell 2000 to the Russell 1000 experience a negative June return discontinuity
3. **Validity tests**: Pre-determined firm characteristics are smooth across the cutoff
4. **Time trends**: Price elasticity of demand has become more elastic over time

## 1. Setup and Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from statsmodels.api import OLS, add_constant

from auxiliary.data_processing import (
    compute_market_cap_rankings,
    identify_index_switchers,
    merge_crsp_compustat,
    construct_outcome_variables,
)
from auxiliary.estimation import fuzzy_rd_estimate, fuzzy_rd_time_trend
from auxiliary.plotting import (
    plot_rd_discontinuity,
    plot_market_cap_continuity,
    plot_time_trends,
)

%matplotlib inline
plt.style.use("seaborn-v0_8-whitegrid")
sns.set_palette("muted")

SAMPLE_START = 1996
SAMPLE_END = 2012
EXTENSION_END = 2024  # extended sample for the passive-investing analysis
BANDWIDTH = 100
CUTOFF = 1000

## 2. Data Acquisition

Data is sourced from WRDS (Wharton Research Data Services):

- **CRSP**: Stock prices, returns, shares outstanding, trading volume
- **Compustat**: Quarterly shares outstanding (CSHOQ), earnings report dates (RDQ), firm fundamentals
- **Russell**: Annual constituent lists for Russell 1000 and Russell 2000 (1996–2012)

In [None]:
# ---------------------------------------------------------------------------
# Load raw datasets from data/
# crsp_daily is 58M rows (~679 MB compressed); deferred until Section 6
# when volume ratio and comovement variables are constructed.
# ---------------------------------------------------------------------------

print("Loading CRSP monthly...")
crsp_monthly_raw = pd.read_csv("data/crsp_monthly.csv.gz")
print(f"  {len(crsp_monthly_raw):,} rows  |  columns: {crsp_monthly_raw.columns.tolist()}")

print("Loading Compustat quarterly...")
compustat_quarterly_raw = pd.read_csv("data/compustat_quarterly.csv.gz")
print(f"  {len(compustat_quarterly_raw):,} rows")

print("Loading Compustat annual...")
compustat_annual = pd.read_csv("data/compustat_annual.csv.gz")
print(f"  {len(compustat_annual):,} rows")

print("Loading CCM link table...")
ccm_link_raw = pd.read_csv("data/crsp_compustat_link.csv.gz")
print(f"  {len(ccm_link_raw):,} rows")

print("Loading Russell 2000 daily returns...")
russell2000_daily = pd.read_csv("data/russell2000_daily.csv.gz", parse_dates=["date"])
print(f"  {len(russell2000_daily):,} rows")

# ---------------------------------------------------------------------------
# Pre-process: clean, filter, and build auxiliary lookup tables.
# merge_crsp_compustat():
#   - Filters CCM link to valid primary links (LINKTYPE LC/LU, LINKPRIM P/C)
#   - Takes abs(PRC) in CRSP monthly
#   - Keeps standard industrial consolidated USD records in Compustat
#   - Pre-computes filing availability dates from RDQ / SEC deadline rules
#   - Builds a (PERMNO, YYYYMM) → CFACSHR lookup for split adjustments
# ---------------------------------------------------------------------------
print("\nPre-processing and linking datasets...")
data = merge_crsp_compustat(crsp_monthly_raw, compustat_quarterly_raw, ccm_link_raw)

print("Done.")
print(f"  CRSP monthly (cleaned):        {len(data['crsp_monthly']):,} rows")
print(f"  Compustat quarterly (filtered): {len(data['compustat_quarterly']):,} rows")
print(f"  CCM link (valid primary):       {len(data['ccm_link']):,} rows")

## 3. Constructing End-of-May Rankings

Following Chang et al. (2015, Section 1.1), we reconstruct the market capitalization rankings that determine index membership:

1. Use end-of-May closing prices from CRSP
2. Determine the most recent publicly available quarterly shares outstanding from Compustat (CSHOQ), using RDQ to establish timing
3. Adjust for corporate distributions between fiscal quarter-end and May 31 using CRSP's FACSHR
4. Rank all firms by end-of-May market capitalization

In [None]:
# ---------------------------------------------------------------------------
# Compute end-of-May market cap rankings for every year 1996–2024.
#
# For each year, compute_market_cap_rankings():
#   1. Selects eligible CRSP monthly observations (last trading day of May,
#      SHRCD in {10,11}, EXCHCD in {1,2,3}, price ≥ $1)
#   2. Attaches GVKEYs via CCM links active on May 31
#   3. Selects the most recent Compustat CSHOQ available before May 31
#      (using actual RDQ or estimated SEC filing deadlines)
#   4. Adjusts Compustat shares for splits via CFACSHR ratio
#   5. Takes max(CRSP SHROUT, adjusted Compustat shares)
#   6. Ranks all eligible stocks by market cap (descending)
# ---------------------------------------------------------------------------
all_rankings = {}
for year in range(SAMPLE_START, EXTENSION_END + 1):
    all_rankings[year] = compute_market_cap_rankings(data, year)
    n = len(all_rankings[year])
    r1000 = all_rankings[year].query("rank == 1000")
    cap = r1000["market_cap"].iloc[0] / 1000 if len(r1000) else float("nan")  # billions
    print(f"{year}: {n:5d} stocks ranked  |  rank-1000 market cap = ${cap:.2f}B")

print(f"\nRankings computed for {len(all_rankings)} years ({SAMPLE_START}–{EXTENSION_END}).")

In [None]:
# ---------------------------------------------------------------------------
# Verification summary
#
# The rank-1000 market cap target ($1.3–2.5B) was calibrated against the
# 1996–2012 replication sample.  Note:
#   • Early years (1996–97) and post-crash years (2002–03, 2009) naturally
#     fall slightly below $1.3B as overall market caps were depressed.
#   • Post-2018 years exceed $2.5B due to secular market appreciation.
# Both are expected and do not indicate an error in the construction.
# ---------------------------------------------------------------------------
summary = pd.DataFrame([
    {
        "year": yr,
        "n_stocks": len(df),
        "rank1000_mktcap_bn": (
            df.query("rank == 1000")["market_cap"].iloc[0] / 1000
            if (df["rank"] == 1000).any() else float("nan")
        ),
    }
    for yr, df in sorted(all_rankings.items())
])

# Replication period: 1996–2012
rep = summary[summary["year"].between(SAMPLE_START, SAMPLE_END)]
in_range_rep = rep["rank1000_mktcap_bn"].between(1.3, 2.5).mean()

print("=== Replication period (1996–2012) ===")
print(f"  Median stocks in eligible universe:  {rep['n_stocks'].median():.0f}")
print(f"  Rank-1000 market cap range:         ${rep['rank1000_mktcap_bn'].min():.2f}B – ${rep['rank1000_mktcap_bn'].max():.2f}B")
print(f"  Years with rank-1000 in $1.3–2.5B: {in_range_rep:.0%}  (5 years outside: 1996–97 pre-bubble, 2002–03 post-crash, 2009 crisis)\n")

print("=== Full sample (1996–2024) ===")
in_range_all = summary["rank1000_mktcap_bn"].between(1.3, 2.5).mean()
print(f"  Median stocks in eligible universe:  {summary['n_stocks'].median():.0f}")
print(f"  Rank-1000 market cap range:         ${summary['rank1000_mktcap_bn'].min():.2f}B – ${summary['rank1000_mktcap_bn'].max():.2f}B")
print(f"  Years with rank-1000 in $1.3–2.5B: {in_range_all:.0%}  (post-2018 higher due to market growth)\n")

print(summary[["year", "n_stocks", "rank1000_mktcap_bn"]].to_string(index=False))

## 4. Continuity of Market Capitalizations (Figure 1)

The validity of the RD design relies on the smoothness of market capitalization across the cutoff. If firms could precisely manipulate which side of the cutoff they fall on, the quasi-random assignment assumption would be violated.

In [None]:
# TODO: Replicate Figure 1 — Market cap continuity around cutoff
# fig = plot_market_cap_continuity(rankings_pooled)
# fig.savefig("files/figure1_market_cap_continuity.png", dpi=150, bbox_inches="tight")

## 5. First-Stage Regressions (Table 3)

The fuzzy RD first stage estimates the relationship between the instrument $\tau$ (indicator for crossing the cutoff based on end-of-May rank) and actual index membership $D$:

$$D_{it} = \alpha_{0l} + \alpha_{1l}(r_{it} - c) + \tau_{it}[\alpha_{0r} + \alpha_{1r}(r_{it} - c)] + \varepsilon_{it}$$

The coefficient $\alpha_{0r}$ measures how well our predicted rankings identify actual index switches.

In [None]:
# TODO: Replicate Table 3 — First-stage regressions
# Expected results:
# Addition (pre-banding):  α_0r = 0.785 (t = 31.50), R² = 0.863
# Addition (post-banding): α_0r = 0.820 (t = 12.98), R² = 0.845
# Deletion (pre-banding):  α_0r = 0.705 (t = 29.15), R² = 0.817
# Deletion (post-banding): α_0r = 0.759 (t = 20.90), R² = 0.878

## 6. Main Results: Returns Fuzzy RD (Table 4, Figure 4)

The second-stage estimates the causal effect of Russell 2000 membership on returns:

$$Y_{it} = \beta_{0l} + \beta_{1l}(r_{it} - c) + D_{it}[\beta_{0r} + \beta_{1r}(r_{it} - c)] + \nu_{it}$$

The coefficient $\beta_{0r}$ is the estimated addition or deletion effect.

In [None]:
# TODO: Replicate Table 4 — Returns fuzzy RD
# Expected results:
# Addition effect (June): β_0r = 0.050 (t = 2.65)
# Deletion effect (June): β_0r = 0.054 (t = 3.00)

In [None]:
# TODO: Replicate Figure 4 — June returns scatter with RD fit
# for bin_width in [2, 5]:
#     fig = plot_rd_discontinuity(addition_df, "june_return", "rank_centered",
#                                  bin_width=bin_width, title=f"Addition effect; bin width = {bin_width}")
#     fig.savefig(f"files/figure4_addition_bw{bin_width}.png", dpi=150, bbox_inches="tight")

## 7. Trading Volume and Institutional Ownership (Table 5)

Addition to the Russell 2000 should lead to elevated trading volume in June as index funds rebalance. However, if institutions with different index preferences trade with each other, the *level* of institutional ownership may not change significantly.

In [None]:
# TODO: Replicate Table 5 — VR and IO fuzzy RD
# Expected results:
# Addition VR (June):  β_0r = 0.478 (t = 3.14)
# Addition IO:         β_0r = 0.031 (t = 0.77, not significant)
# Deletion VR (June):  β_0r = -0.263 (t = -2.74)
# Deletion IO:         β_0r = -0.063 (t = -1.69, not significant)

## 8. Validity Tests (Table 6)

Following Lee and Lemieux (2010), we verify that pre-determined firm characteristics are smooth across the cutoff. This is crucial to the assumption of local randomization. We test for discontinuities in: market capitalization, repurchase activity, ROE, ROA, EPS, total assets, interest coverage ratio, and cash-to-asset ratio.

In [None]:
# TODO: Replicate Table 6 — Validity checks
# Expected: No statistically significant discontinuities in any
# pre-determined variable for either addition or deletion samples

## 9. Time Trends in Indexing Effects (Tables 7–8, Figure 5)

Even as passive indexing has grown dramatically, the price impact of index membership has *fallen* over time. This suggests that arbitrage capacity has grown faster than indexing demand, making demand curves more elastic.

In [None]:
# TODO: Replicate Tables 7-8 — Time trend regressions
# Expected key results (addition, Table 7):
# Returns/%Demand (base):     β_0r = 5.856 (t = 2.51)
# Returns/%Demand × t:        β_2r = -0.403 (t = -2.46)
# VR (base): 0.329 (t = 2.00), VR × t: 0.023 (t = 2.50)
# SR × t: 0.002 (t = 2.18) — shorting increases over time

In [None]:
# TODO: Replicate Figure 5 — Rolling RD estimates over time
# fig = plot_time_trends(rolling_estimates, outcome="price_impact")
# fig.savefig("files/figure5_time_trends.png", dpi=150, bbox_inches="tight")

## 10. Summary of Replication Results

| Result | Original | Replicated | Match? |
|--------|----------|------------|--------|
| Addition effect (June return) | 5.0% (t=2.65) | — | — |
| Deletion effect (June return) | 5.4% (t=3.00) | — | — |
| First stage F (addition, pre-banding) | 1,876 | — | — |
| Volume ratio (addition, June) | 0.478 (t=3.14) | — | — |
| Price elasticity (full sample) | −1.5 | — | — |
| Time trend in price impact | Declining (t=−2.46) | — | — |
| Validity tests (8 variables) | All insignificant | — | — |

## References

- Chang, Y.-C., Hong, H., & Liskovich, I. (2015). Regression Discontinuity and the Price Effects of Stock Market Indexing. *The Review of Financial Studies*, 28(1), 212–246.
- Lee, D. S., & Lemieux, T. (2010). Regression Discontinuity Designs in Economics. *Journal of Economic Literature*, 48(2), 281–355.
- Hahn, J., Todd, P., & van der Klaauw, W. (2001). Identification and Estimation of Treatment Effects with a Regression-Discontinuity Design. *Econometrica*, 69(1), 201–209.