# Caselaw Access Project
The project contains 360 years of US caselaw and is maintained by Harvard law school. The project can be accessed [here](https://case.law/).

The data used in this project is just a subsample of latest caselaw for each state.

## Import libraries

In [1]:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
import time

## Data
Get state addverbs

In [2]:
state_addverbs = {
    'Alabama': 'ala',
    'Alaska': 'alaska',
    'Arizona': 'ariz',
    'Arkansas': 'ark',
    'California': 'cal',
    'Colorado': 'colo',
    'Connecticut': 'conn',
    'Delaware': 'del',
    'Florida': 'fla',
    'Georgia': 'ga',
    'Hawaii': 'haw',
    'Idaho': 'idaho',
    'Illinois': 'ill',
    'Indiana': 'ind',
    'Iowa': 'iowa',
    'Kansas': 'kan',
    'Kentucky': 'ky',
    'Louisiana': 'la',
    'Maine': 'me',
    'Maryland': 'md',
    'Massachusetts': 'mass',
    'Michigan': 'mich',
    'Minnesota': 'minn',
    'Mississippi': 'miss',
    'Missouri': 'mo',
    'Montana': 'mont',
    'Nebraska': 'neb',
    'Nevada': 'nev',
    'New Hampshire': 'nh',
    'New Jersey': 'nj',
    'New Mexico': 'nm',
    'New York': 'ny',
    'North Carolina': 'nc',
    'North Dakota': 'nd',
    'Ohio': 'ohio',
    'Oklahoma': 'okla',
    'Oregon': 'or',
    'Pennsylvania': 'pa',
    'Rhode Island': 'ri',
    'South Carolina': 'sc',
    'South Dakota': 'sd',
    'Tennessee': 'tenn',
    'Texas': 'tex',
    'Utah': 'utah',
    'Vermont': 'vt',
    'Virginia': 'va',
    'Washington': 'wash',
    'West Virginia': 'w-va',
    'Wisconsin': 'wis',
    'Wyoming': 'wyo'
}

Method for getting the latest data volume

In [3]:
def get_latest_volume(url: str) -> int:
    """Get the number of latest volume."""
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    for link in soup.find_all('a')[::-1]:
        href = link.get('href')
        if href.endswith(".pdf"): return re.search(r'/(\d+)\.pdf$', href).group(1)

Get the latest caselaw pdfs and metadata for each state

In [4]:
df = []
base_url = "https://static.case.law"
for state, addverb in state_addverbs.items():
    state_url = f"{base_url}/{addverb}/"
    latest_volume = get_latest_volume(state_url)
    pdf_url = f"{state_url}{latest_volume}.pdf"
    metadata_url = f"{state_url}{latest_volume}/VolumeMetadata.json"

    metadata = requests.get(metadata_url)
    metadata.raise_for_status()
    metadata = metadata.json()
    
    term = f"{metadata['start_year']}-{metadata['end_year']}"
    jurisdictions = ",".join([j["name_long"] for j in metadata["jurisdictions"]])

    df.append({
        "state": state,
        "volume_number": latest_volume,
        "pdf_url": pdf_url,
        "term": term,
        "jurisdictions": jurisdictions,
    })
    print(f"Fetched latest caselaw document(s) for {state}")
    time.sleep(0.5)

Fetched latest caselaw document(s) for Alabama
Fetched latest caselaw document(s) for Alaska
Fetched latest caselaw document(s) for Arizona
Fetched latest caselaw document(s) for Arkansas
Fetched latest caselaw document(s) for California
Fetched latest caselaw document(s) for Colorado
Fetched latest caselaw document(s) for Connecticut
Fetched latest caselaw document(s) for Delaware
Fetched latest caselaw document(s) for Florida
Fetched latest caselaw document(s) for Georgia
Fetched latest caselaw document(s) for Hawaii
Fetched latest caselaw document(s) for Idaho
Fetched latest caselaw document(s) for Illinois
Fetched latest caselaw document(s) for Indiana
Fetched latest caselaw document(s) for Iowa
Fetched latest caselaw document(s) for Kansas
Fetched latest caselaw document(s) for Kentucky
Fetched latest caselaw document(s) for Louisiana
Fetched latest caselaw document(s) for Maine
Fetched latest caselaw document(s) for Maryland
Fetched latest caselaw document(s) for Massachusetts
Fe

Convert dataframe into `pd.DataFrame`

In [5]:
df = pd.DataFrame(df)
df.head()

Unnamed: 0,state,volume_number,pdf_url,term,jurisdictions
0,Alabama,295,https://static.case.law/ala/295.pdf,1975-1975,Alabama
1,Alaska,17,https://static.case.law/alaska/17.pdf,1957-1958,Alaska
2,Arizona,242,https://static.case.law/ariz/242.pdf,2017-2017,Arizona
3,Arkansas,375,https://static.case.law/ark/375.pdf,2008-2009,Arkansas
4,California,220,https://static.case.law/cal/220.pdf,1934-1934,California


save the data

In [6]:
save_path = "./data/caselaw_data.csv"
df.to_csv(save_path, index=False)