### Sourcing SCOTUS from Harvard's [Caselaw Access Project (CAP)](https://case.law/)

Goal: retrieve all opinions written by the Supreme Court for a specified year range.

SCOTUS denies thousands of cases every year, and each denial gets its own document, so we can't just grab all SCOTUS documents from CAP for a specified year. We need docket numbers for the cases that granted cert and argued before the court. Here, we source those docket numbers from the [Super-SCOTUS dataset](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/POWQIT) [[paper](https://aclanthology.org/2023.nllp-1.20/)].

1. Get docket numbers for the years 1986-2019 from superscotus.
2. For each year, request a small sample (~15) cases from CAP. (waiting on unmetered API access before pulling full set)

**Case issues**
- [Board of Education v. Tom F.](https://cite.case.law/us/552/1/) Here, there was a recusal, and the court split 4-4, leading to a ~2-sentence per curiam opinion saying the lower court was affirmed by default. For some reason, you can't search for this case by docket number (via web or API)
- [Altantic Sounding Co. v. Townsend](https://cite.case.law/us/557/404/): Classified as 11th circuit instead of SCOTUS, so 9009 court filter returns 0 results with this case's docket number

In [None]:
%cd -q ../..

import csv
import os
from collections import defaultdict
from pathlib import Path

import jsonlines

from scotus_metalang import cap

In [None]:
docket_nums_by_year = defaultdict(list)
with jsonlines.open("data/super_scotus/1986_to_2019.jsonl", "r") as f:
    for case in f:
        # Example case id: "1986_84-2022"
        year = case["year"]
        docket_number = case["id"][5:]
        docket_nums_by_year[year].append(docket_number)
fourteen_from_each = []
for year in range(1986,2020):
    fourteen_from_each += docket_nums_by_year[year][:14]

In [None]:
log_dir = Path("data/logs")
Path.mkdir(log_dir, exist_ok=True, parents=True)
log_path = Path(log_dir, "cap_scraping_log.tsv")
log_exists = os.path.exists(log_path)

with open(log_path, "r+") as f:
    header = ["docket_number", "status", "cases_returned", "case_id_selected", "num_opinions", "authors"]
    reader = csv.DictReader(f, header, delimiter="\t")
    # Add each TSV row to a dict indexed by docket number
    log = {}
    for line in reader:
        docket_number = int(line.pop("docket_number"))
        log[docket_number] = line
    log_writer = csv.DictWriter(f, header, delimiter="\t")
    if not log_exists:
        log_writer.writeheader()
    for docket_number in fourteen_from_each[:5]:  # Sample here to limit API usage while tinkering
        cap.save_opinions_by_docket_number(docket_number, log, log_writer)