### Sourcing SCOTUS from Harvard's [Caselaw Access Project (CAP)](https://case.law/)

Goal: retrieve all opinions written by the Supreme Court for a specified year range.

SCOTUS denies thousands of cases every year, so we can't just grab all SCOTUS documents from CAP for a specified year. We need docket numbers for the cases that granted cert and argued before the court. Here, we source those docket numbers from the [Super-SCOTUS dataset](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/POWQIT) [[paper](https://aclanthology.org/2023.nllp-1.20/)].

1. Get docket numbers for the years 1986-2019 from superscotus.
2. For each year, request a small sample (~15) cases from CAP. (waiting on unmetered API access before pulling full set)

**Case issues**
- [Board of Education v. Tom F.](https://cite.case.law/us/552/1/) Here, there was a recusal, and the court split 4-4, leading to a ~2-sentence per curiam opinion saying the lower court was affirmed by default. For some reason, you can't search for this case by docket number (via web or API)
- [Altantic Sounding Co. v. Townsend](https://cite.case.law/us/557/404/): Classified as 11th circuit instead of SCOTUS, so 9009 court filter returns 0 results with this case's docket number

In [None]:
%cd -q ../..

import jsonlines
from collections import defaultdict
import os
import requests

In [None]:
docket_nums_by_year = defaultdict(list)
with jsonlines.open("data/super_scotus/1986_to_2019.jsonl", "r") as f:
    for case in f:
        # Example case id: "1986_84-2022"
        year = case["year"]
        docket_number = case["id"][5:]
        docket_nums_by_year[year].append(docket_number)

In [None]:
fourteen_from_each = []
for year in range(1986,2020):
    fourteen_from_each += docket_nums_by_year[str(year)][:14]

In [None]:
CAP_TOKEN = os.environ["CAP_TOKEN"]
def case_json_by_id(case_id):
    return requests.get(f"https://api.case.law/v1/cases/{case_id}?full_case=true",headers={"Authorization": f"Token {CAP_TOKEN}"}).json()

def case_by_docket_number(docket_number):
    return requests.get(f"https://api.case.law/v1/cases?court_id=9009&docket_number={docket_number}", headers={"Authorization": f"Token {CAP_TOKEN}"}).json()

def longest_casebody_in_results(json_response):
    max_word_count = 0
    case_id_to_return = ""
    for case in json_response["results"]:
        word_count = case["analysis"]["word_count"]
        if word_count > max_word_count:
            max_word_count = word_count
            case_id_to_return = case["id"]
    return case_id_to_return

In [None]:
cases = []
data_report = []
for docket_number in fourteen_from_each:
    api_response = case_by_docket_number(docket_number)
    num_results = api_response["count"]
    if num_results == 0:
        # TODO: Error handling
        pass
    else:
        # Make note of the count and which doc we ended up choosing
        case_id = longest_casebody_in_results(api_response)
        case_json = case_json_by_id(case_id)
        # TODO: Decide if the whole json is necessary to save or if we can just get the opinions
        # TODO: Count number of opinions and add
        cases.append(case_json)

In [None]:
target = "data/harvard_cap/14_cases_from_1986_to_2019.jsonl"
with jsonlines.open(target, "w") as f:
    f.write_all(cases)
    print(f"{len(cases)} written to {target}")