Setup
-----

In [1]:
import pandas as pd

# Load data
fowler_data = pd.read_stata('Pol_Analysis_replication.dta')
fowler_cites = pd.read_csv('Pol_Analysis_ussc_cites.csv')

First, get the number of distinct cases in the dataset. According to the paper, this should equal 26,681, which it does.

In [2]:
# Get distinct cases in dataset
fowler_cases = fowler_data.drop_duplicates('lexid')
print(len(fowler_cases))

26681


Next, calculate how many citation dyads are included in the replication data.

In [3]:
print(len(fowler_cites))

202167


But what happens when we drop dyads that reference cases that are NOT in the master 26,681 set of cases?

In [4]:
fowler_cites_cleaned = fowler_cites[fowler_cites['citing_case'].isin(fowler_data['lexid']).tolist()]
fowler_cites_cleaned = fowler_cites_cleaned[fowler_cites_cleaned['cited_case'].isin(fowler_data['lexid']).tolist()]
print(len(fowler_cites_cleaned))

182040


First question:
-----------------
Why are ~20,000 citation dyads included in the citation data if they are dyads between cases that are not included in the 26,681 master set?

For example, these cases are cited but they are not included in the 26,681 master set:

In [5]:
fowler_cites[(~fowler_cites['cited_case'].isin(fowler_data['lexid'])).tolist()]

Unnamed: 0,cited_case,citing_case
19040,1886 U.S. LEXIS 1920,1887 U.S. LEXIS 2151
21875,1886 U.S. LEXIS 1920,1890 U.S. LEXIS 1921
23687,1886 U.S. LEXIS 1920,1891 U.S. LEXIS 2363
24355,1891 U.S. LEXIS 2471,1891 U.S. LEXIS 2486
28023,1886 U.S. LEXIS 1920,1893 U.S. LEXIS 2360
...,...,...
201872,1992 U.S. LEXIS 6805,2005 U.S. LEXIS 4817
201908,1995 U.S. LEXIS 3181,2005 U.S. LEXIS 4842
201969,2000 U.S. LEXIS 1540,2005 U.S. LEXIS 5011
201984,2001 U.S. LEXIS 641,2005 U.S. LEXIS 5014


Likewise, these cases are recorded as citing, but they are not included in the 26,681 master set:

In [6]:
fowler_cites[(~fowler_cites['citing_case'].isin(fowler_data['lexid'])).tolist()]

Unnamed: 0,cited_case,citing_case
4080,1957 U.S. LEXIS 1627,1858 U.S. LEXIS 1794
17816,1886 U.S. LEXIS 1956,1886 U.S. LEXIS 1957
18062,1887 U.S. LEXIS 2054,1886 U.S. LEXIS 2228
18063,1885 U.S. LEXIS 1839,1886 U.S. LEXIS 2236
19351,1887 U.S. LEXIS 2035,1887 U.S. LEXIS 2530
...,...,...
202162,2005 U.S. LEXIS 628,2005 U.S. LEXIS 998
202163,2005 U.S. LEXIS 628,2005 U.S. LEXIS 999
202164,1973 U.S. LEXIS 154,2006 U.S. LEXIS 1816
202165,1973 U.S. LEXIS 154,2006 U.S. LEXIS 4895


Second question:
---------------

Why are some dyads simply duplicated?

In [7]:
fowler_cites[fowler_cites.duplicated(keep=False)]

Unnamed: 0,cited_case,citing_case
86,1807 U.S. LEXIS 382,1810 U.S. LEXIS 343
87,1807 U.S. LEXIS 382,1810 U.S. LEXIS 343
158,1796 U.S. LEXIS 399,1816 U.S. LEXIS 327
160,1796 U.S. LEXIS 399,1816 U.S. LEXIS 327
444,1803 U.S. LEXIS 352,1821 U.S. LEXIS 362
...,...,...
201764,1987 U.S. LEXIS 2980,2005 U.S. LEXIS 4342
201766,1994 U.S. LEXIS 4826,2005 U.S. LEXIS 4342
201881,1995 U.S. LEXIS 4069,2005 U.S. LEXIS 4839
201882,1995 U.S. LEXIS 4069,2005 U.S. LEXIS 4839


Third question:
---------------

Why do some dyads record citations citing *future* cases?

In [8]:
fowler_cites['citing_year'] = fowler_cites.citing_case.str[:4]
fowler_cites['cited_year'] = fowler_cites.cited_case.str[:4]
fowler_cites[fowler_cites['citing_year'] < fowler_cites['cited_year']]

Unnamed: 0,cited_case,citing_case,citing_year,cited_year
0,1796 U.S. LEXIS 409,1792 U.S. LEXIS 587,1792,1796
2,1796 U.S. LEXIS 409,1793 U.S. LEXIS 247,1793,1796
3,1796 U.S. LEXIS 409,1793 U.S. LEXIS 248,1793,1796
6,1798 U.S. LEXIS 145,1793 U.S. LEXIS 249,1793,1798
7,1950 U.S. LEXIS 2624,1793 U.S. LEXIS 249,1793,1950
...,...,...,...,...
169469,1988 U.S. LEXIS 2733,1983 U.S. LEXIS 83,1983,1988
170947,1985 U.S. LEXIS 17,1984 U.S. LEXIS 19,1984,1985
171099,1985 U.S. LEXIS 121,1984 U.S. LEXIS 2776,1984,1985
174241,1990 U.S. LEXIS 2294,1985 U.S. LEXIS 89,1985,1990
