# About this Legal Data Collection and Preparation Notebook

This notebook forms the foundation of the **NM Law Data Pipeline**, a modular system for collecting and preparing legal documents from the State of New Mexico. The goal is to support downstream tasks like topic modeling, retrieval-augmented generation (RAG), and legal knowledge graph construction by providing clean, structured inputs from raw public sources.

---

## Purpose

The legal domain is defined by dense, interdependent, and jurisdiction-specific documents—such as statutes, constitutions, and case law. These documents are largely available online in unstructured formats that are not directly usable for modern machine learning or retrieval systems.

This notebook serves to:
- **Collect legal texts** from publicly accessible sources (primarily [Justia](https://law.justia.com/new-mexico/)),
- **Clean and organize** them into machine-readable structures (e.g., CSVs),
- **Standardize formats** across different legal types (statutes, constitutional provisions, court opinions),
- **Prepare inputs** for legal topic discovery, model evaluation, and vector-based semantic search.

---

## Structure

The notebook is organized into two primary phases:

### 1. **Data Collection**
Using the `NewMexicoScraper` class, the notebook scrapes:
- **Statutes** – Codified laws by year.
- **Constitution** – Foundational articles and sections of NM law.
- **Court of Appeals Decisions** – Case opinions from the intermediate appellate court.
- **Supreme Court Decisions** – Precedential decisions from the state’s highest court.

The scraper is equipped to:
- Avoid duplicate downloads,
- Follow legal citation patterns and structure,
- Respect HTTP headers and delays for responsible scraping.

### 2. **Data Formatting**
After collection, documents are parsed from their raw formats (usually HTML or JSON) into consistent, flat tabular CSVs using the `JSONToCSVProcessor`. This includes:
- Mapping metadata and content fields,
- Flattening hierarchical legal structures (e.g., nested statute titles or court metadata),
- Saving results for downstream benchmarking, topic modeling, and retrieval evaluation.

---
---

# Data Collection
This section is responsible for initializing and executing the web scraping of legal documents from public sources such as Justia. The scraper collects statutes, constitutional provisions, and court opinions for New Mexico law. Each step below is modularized to allow flexibility, reusability, and reproducibility across different legal document types and years.

In [None]:
import os
from nm_scraper import NewMexicoScraper

## Collection Definitions

Before scraping begins, this section sets up the core configuration in the following cell:
- Base URL: The root of the legal document source (e.g., Justia’s New Mexico law portal).
- Output Directory: The location where downloaded or parsed files will be saved.
- URL Cache File: A local record of visited URLs to avoid redundant downloads.
- HTTP Headers: Custom headers (e.g., user-agent) for polite and reliable web scraping.
- Year Selection: For statutes, the user must specify which year’s legal codes to retrieve. This is necessary because statutes are archived by year on Justia.

In [None]:
BASE_URL = "https://law.justia.com/new-mexico/"
OUTPUT = "output"
HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; YourScraper/1.0)"}
VISITED_URLS = os.path.join(OUTPUT, "saved_urls.txt")

### Create the scraping class instance using the definitions
Instantiates the NewMexicoScraper class using the previously defined parameters. This object wraps all scraping logic, including request throttling, error handling, and file I/O. It ensures that all legal documents are downloaded consistently, with clear directory structures for each document type.

In [None]:
scraper = NewMexicoScraper(BASE_URL, OUTPUT, HEADERS, VISITED_URLS)

## Statutes Collection

This section uses the scraper instance to collect New Mexico statutes. These laws are organized by year and structured hierarchically. The year must be explicitly defined by the user. This ensures that only relevant and updated legal codes are included in the dataset.

In [None]:
# Specify the year to collect 
statutes_year = "2024"  # Change this to the year you want to scrape

# overwrite existing files if the documents already exist
scraper.scrape_laws_by_year(statutes_year, overwrite=True)

## Constitution Collection

Scrapes constitutional provisions of the State of New Mexico. These are typically less frequently updated than statutes or case law, but are foundational to interpreting both. The scraper will navigate to the relevant section of the legal portal and download each article and section systematically.

In [None]:
# overwrite existing files if the documents already exist
scraper.scrape_constitution(overwrite=True)

## Cases Collection

Downloads court opinions from two branches:
- New Mexico Court of Appeals
- New Mexico Supreme Court

The user must provide a range or list of decision years. Each case will be parsed from the court’s official listings and saved to disk. This step is essential for downstream tasks like precedent prediction, timeline modeling, and historical legal analysis.

In [None]:
# Specify the years to collect 
decision_years = [str(y) for y in range(1960, 2026)]
# overwrite existing files if the documents already exist
scraper.scrape_all_supreme_court(decision_years, overwrite=True)
# overwrite existing files if the documents already exist
scraper.scrape_all_court_of_appeals(decision_years, overwrite=True)

# Data Formatting
After collection, legal text data often contains extra formatting, footers, and irrelevant metadata. This section converts the scraped data into structured CSV files that align with the pipeline’s data schema. This prepares the documents for evaluation, topic modeling, and knowledge graph construction.

In [1]:
from format_csv import JSONToCSVProcessor

## Format Legal Documents into CSVs

This section consolidates the formatting of all legal document types—statutes, constitution, appellate cases, and supreme court cases—into a single processing loop using the `JSONToCSVProcessor` class.

Each document type has its own format and processing logic:
- **Constitution**: Structured hierarchically and processed with `processor_type=2` to account for its article/section format.
- **Statutes**: Parsed with `processor_type=1`, structured by titles and sections, and converted into machine-readable CSVs.
- **Court of Appeals Cases**: Judicial opinions from the appellate court are parsed with `processor_type=1` to extract metadata and ruling text.
- **Supreme Court Cases**: Similar to appeals, processed to extract key fields and structure the full legal opinion text.

All documents are saved into their respective CSVs:
- `CONSTITUTION.csv`
- `STATUTE.csv`
- `APPEALS.csv`
- `SUPREME.csv`

The resulting dataframes are stored in memory under the following names:
- `constitution_df`
- `statute_df`
- `appeals_df`
- `supreme_df`

This modular loop ensures consistency in formatting while reducing code duplication and simplifying future extensions or batch runs.


In [None]:
# Define all sources and configurations in a list of dictionaries
formats = [
    {"name": "constitution", "path": "CONSTITUTION PATH", "csv": "CONSTITUTION.csv", "type": 2},
    {"name": "statute", "path": "STATUTE PATH", "csv": "STATUTE.csv", "type": 1},
    {"name": "appeals", "path": "APPEALS PATH", "csv": "APPEALS.csv", "type": 1},
    {"name": "supreme", "path": "SUPREME PATH", "csv": "SUPREME.csv", "type": 1},
]
dataframes = {}

# Loop through each data type and process
for item in formats:
    processor = JSONToCSVProcessor(root_path=item["path"], output_path=item["csv"])
    df = processor.process_json_files_to_csv(processor_type=item["type"])
    dataframes[f"{item['name']}_df"] = df

constitution_df =  dataframes["constitution_df"]
statute_df =  dataframes["statute_df"]
appeals_df = dataframes["appeals_df"]
supreme_df = dataframes["supreme_df"]
