# How to use `json_to_csv`

## What this does

`json_to_csv` converts **ADAH legal case JSON** into CSVs ready for **Neo4j Aura Data Importer**.

It expects **ADAH-style objects** (each case has at least:

* `ID` (int-like)
* `opinion` (string)

You can choose how to store opinion text:

* **`opinion_storage="chunks"` (default):** sentence-aware chunking into `OpinionChunk` rows (linked back to `Case`).
* **`opinion_storage="inline"`:** full opinion text stored as a single property on the `Case` row (no chunk rows are produced).

---

## Current behavior & data shaping

* **Scope: ADAH only**

  * The converter checks that the JSON array contains ADAH-style cases (objects with `ID` and `opinion`).
  * If no ADAH cases are found, it raises a `ValueError`.

* **ADAH field handling**

  * `ID` → written to **`id`** in `cases.csv`.
  * `date_filed` → written to **`decision_date`**.
  * `docket_number` or **`docker_number`** (ADAH typo) → written to **`docket_number`**.
  * `case_full_name` is preserved on ADAH rows; literal `"NULL"` becomes `"N/A"`.

* **Unified citation column**

  * ADAH `citation` can be a string or list.
  * All values are joined into **`citation_pipe`** (joined with `" | "`).

* **Stable ADAH `court_id`**

  * `court_id` is a deterministic integer based on ADAH court name + abbreviation:

    ```text
    court_id = stable_crc32("adah-court:{court_name}|{court_abbrev}")
    ```

* **Court level inference (`court_level`)**

  * For each case, the court’s level is inferred from **`jurisdiction_inferred`**.
  * Levels are integers **1–5** (1 = highest, 5 = lowest).
  * Unknown or unmatched jurisdictions default to **5**.
  * If the same court appears with multiple levels, the **smallest** level seen is kept.

* **Jurisdictions output**

  * Jurisdictions are built from ADAH `jurisdiction_inferred`.
  * Final `jurisdictions.csv` is in **minimal ADAH form** with columns:

    * `id` (synthetic, deterministic int per jurisdiction name)
    * `jurisdiction_name`

* **Optional CourtListener URL**

  * Set **`include_url=True`** to add **`court_listener_url`** to `cases.csv`.
  * URL pattern: `https://www.courtlistener.com/opinion/{ID}/{slug}/`
    * `slug` is taken from the tail of `absolute_url` in the ADAH JSON.
  * `cites_to.csv` **always** includes CourtListener URLs for source/target cases, independent of `include_url`.

* **Filtering behavior**

  * By default, the converter keeps:

    * cases with `Citing_Relationship = "adah"` (the ADAH seed cases), and  
    * cases with `Citing_Relationship = "citing_adah"` (cases that cite ADAH).

  * **`subset`** (model-dev sampler):

    * When `subset` is set:
      1. Select up to `subset` ADAH cases where `Citing_Relationship = "adah"`.
      2. Then add **all cases that cite any of those ADAH cases** (via `cite_to`).
    * This gives a smaller, but still connected, ADAH subgraph.

  * **`only_case_ids=[...]`**:

    * Optional hard filter applied after the above logic.
    * Accepts strings or ints; duplicates are allowed.
    * Internally converted to a set of valid int IDs, and intersected with the allowed cases.

* **Opinion handling and chunking**

  * Opinions are cleaned before storage:

    * Control characters removed.
    * Hyphenation across line breaks healed.
    * Standalone line-number lines dropped.
    * Whitespace collapsed to single spaces.

  * **Chunked mode (`opinion_storage="chunks"`):**
    * The ADAH `opinion` string is compacted and then split into sentence-aware chunks of about `semantic_chunk_size` characters.
    * Each chunk becomes an **OpinionChunk** row with fields:
      * `id` (string `"{case_id}:0:{chunk_index}"`)
      * `case_id`
      * `chunk_index`
      * `opinion_type` (from `opinion_type`)
      * `opinion_author` (from `judge` + optional `panel_names`)
      * `text` (cleaned chunk text)

  * **Inline mode (`opinion_storage="inline"`):**
    * The full compacted opinion goes into `cases.csv` under `opinion_property` (default `opinion_text`).
    * No opinion chunk files are produced.

  * **`max_opinion_chars`**:
    * If set and the (inline) opinion string is longer than this limit, it is truncated to that many characters.

* **Opinion splits**

  * When `opinion_splits > 1` and `opinion_storage="chunks"`:
    * Opinion chunks are split into several folders named `Opinion Chunks/split_XX`.
    * Splits are **case-aligned** (a case never gets split across multiple files).
    * The function tries to balance the number of chunks per split.

---

## What gets produced

Subject to filters and options, the converter writes:

* `cases.csv` — ADAH case metadata

  * Key columns include:

    * `id`
    * `name`
    * `case_full_name`
    * `decision_date`
    * `docket_number`
    * `court_id`
    * `jurisdiction_id`
    * `court_name_abbreviation`
    * `court_name`
    * `jurisdiction_name`
    * `citation_pipe`
    * `file_name`
    * `adah_case` (Boolean-like; `True` for `Citing_Relationship="adah"`)
    * optional `court_listener_url`
    * optional **inline opinion column** (name = `opinion_property`, e.g. `opinion_text`)

* `courts.csv` — court reference table

  * Columns:
    * `id`
    * `name`
    * `name_abbreviation`
    * `court_level` (1–5; default 5 when unknown)

* `jurisdictions.csv` — ADAH jurisdiction table

  * Columns:
    * `id`
    * `jurisdiction_name`

* `cites_to.csv` — **resolved** case-to-case citations

  * Columns:
    * `src_case_id`
    * `src_case_name`
    * `tgt_case_id`
    * `tgt_case_name`
    * `src_case_court_listener_url`
    * `tgt_case_court_listener_url`

* If `opinion_storage="chunks"`:

  * `opinion_chunks.csv` — opinion chunk nodes

    * Columns:
      * `id`
      * `case_id`
      * `chunk_index`
      * `opinion_type`
      * `opinion_author`
      * `text`

  * `case_opinion_edges.csv` — mapping from Case to OpinionChunk

    * Columns:
      * `case_id`
      * `chunk_id`

* If `opinion_storage="inline"`:

  * No opinion chunk files are produced.
  * Full opinion text is stored on the Case rows in `cases.csv`.

---

## Parameters (high-level)

```python
json_to_csv(
    json_path: str | None = None,      # local JSON array file
    outdir: str | None = None,         # output folder (defaults to "<json_stem>_csv_out")
    # OR load from S3
    s3_bucket: str | None = None,
    s3_key: str | None = None,

    # opinions
    semantic_chunk_size: int = 4000,   # ~target chars per chunk
    opinion_splits: int = 1,           # if >1, writes balanced, case-aligned splits
    opinion_storage: str = "chunks",   # "chunks" | "inline"
    opinion_property: str = "opinion_text",
    max_opinion_chars: int | None = None,

    # ADAH sampling / filtering
    subset: int | None = None,         # ADAH seed sample size (adah + its citers)
    only_case_ids: Iterable[int|str] | None = None,

    # extras
    include_url: bool = False          # adds "court_listener_url" to cases.csv only
)
````

Notes:

* Exactly one of `json_path` or (`s3_bucket` + `s3_key`) must be provided.
* The input must be a JSON **array** of ADAH case objects.

---

## Examples

### 1) Default: ADAH + citing_ADAH, chunked opinions

```python
json_to_csv("full_adah_set.json")
```

* Keeps all cases with:

  * `Citing_Relationship = "adah"` or `"citing_adah"`.
* Writes chunked opinions to `opinion_chunks.csv` and `case_opinion_edges.csv`.

---

### 2) Inline full opinions on Case nodes

```python
json_to_csv(
    "full_adah_set.json",
    opinion_storage="inline",
    opinion_property="full_opinion_text"  # column name in cases.csv
)
```

* Puts the full (cleaned) opinion text into `cases.csv.full_opinion_text`.
* Does not create opinion chunk CSVs.

---

### 3) Split opinion chunks into three balanced folders

```python
json_to_csv("full_adah_set.json", opinion_splits=3)
```

Output:

* Base CSVs in `full_adah_set_csv_out/`:

  * `cases.csv`
  * `courts.csv`
  * `jurisdictions.csv`
  * `cites_to.csv`

* Opinion chunks split into folders:

  * `Opinion Chunks/split_01/opinion_chunks.csv`
  * `Opinion Chunks/split_01/case_opinion_edges.csv`
  * `Opinion Chunks/split_02/...`
  * `Opinion Chunks/split_03/...`

Each case appears in **exactly one** split.

---

### 4) ADAH model-dev subset: ~100 ADAH cases + all citers of those

```python
json_to_csv("full_adah_set.json", subset=100)
```

* Step 1: picks up to 100 cases with `Citing_Relationship="adah"`.
* Step 2: adds all cases that cite any of those 100 cases.
* Applies the same opinion and URL options as usual.

---

### 5) Restrict to a specific set of case IDs

```python
json_to_csv(
    "full_adah_set.json",
    only_case_ids=["2113004", "174213", "174213", 2461771]
)
```

* Filters the default ADAH + citing_ADAH set to the provided IDs.
* Duplicate values are safe; they are deduplicated internally.

---

### 6) Include CourtListener URLs in `cases.csv`

```python
json_to_csv("full_adah_set.json", include_url=True)
```

* Adds `court_listener_url` to **`cases.csv`**.
* `cites_to.csv` always includes URLs regardless of this flag.

---

### 7) Load from S3 instead of local

```python
json_to_csv(
    s3_bucket="my-bucket",
    s3_key="exports/full_adah_set.json",
    outdir="adah_csv_out"
)
```

* Uses `boto3` to read the JSON array from S3.
* Writes all CSVs into `adah_csv_out/`.

---

## Importing into Neo4j Aura Data Importer

### A) Base CSVs

Upload from the output folder:

* `cases.csv`
* `courts.csv`
* `jurisdictions.csv`
* `cites_to.csv`

**Node mappings**

* **Case** ← `cases.csv`

  * Key: `id` (Integer, Unique)
  * Useful properties:

    * `name`
    * `case_full_name`
    * `decision_date`
    * `docket_number`
    * `citation_pipe`
    * `file_name`
    * `adah_case` (Boolean)
    * `court_listener_url` (optional)
    * `opinion_text` or your custom `opinion_property` (if using inline mode)

* **Court** ← `courts.csv`

  * Key: `id` (Integer, Unique)
  * Properties:

    * `name`
    * `name_abbreviation`
    * `court_level` (Integer)

* **Jurisdiction** ← `jurisdictions.csv`

  * Key: `id` (Integer, Unique)
  * Map `jurisdiction_name` → node property `name`.

**Relationship mappings**

* `(:Case)-[:HEARD_IN]->(:Court)` from `cases.csv`

  * Source: `cases.id` → Case.id
  * Target: `cases.court_id` → Court.id

* `(:Case)-[:UNDER_JURISDICTION]->(:Jurisdiction)` from `cases.csv`

  * Source: `cases.id` → Case.id
  * Target: `cases.jurisdiction_id` → Jurisdiction.id

* `(:Case)-[:CITES_TO]->(:Case)` from `cites_to.csv`

  * Source: `src_case_id` → Case.id
  * Target: `tgt_case_id` → Case.id
  * You can keep the `src_case_name`, `tgt_case_name`, and URL fields as edge properties for easier debugging and visualization.

---

### B) Opinions (if `opinion_storage="chunks"`)

1. **Nodes:** `opinion_chunks.csv` → **OpinionChunk**

   * Key: `id` (String, Unique)
   * Properties:

     * `case_id` (Integer)
     * `chunk_index` (Integer)
     * `opinion_type` (String)
     * `opinion_author` (String)
     * `text` (String)

2. **Edges:** `case_opinion_edges.csv` → `(:Case)-[:HAS_OPINION_CHUNK]->(:OpinionChunk)`

   * Source: `case_id` → Case.id
   * Target: `chunk_id` → OpinionChunk.id

If you enabled `opinion_splits > 1`, repeat the node and edge mappings for each `Opinion Chunks/split_XX` folder.



In [2]:
# -*- coding: utf-8 -*-
# JSON → CSV converter for ADAH cases

from __future__ import annotations

import csv
import io
import json
import math
import re
import zlib
from collections import defaultdict
from pathlib import Path
from typing import Any, Dict, Iterable, List, Optional, Tuple, Union

try:
    import boto3  # optional for S3 mode
except Exception:
    boto3 = None  # type: ignore

In [None]:
# =========================
# Court level mapping
# =========================
COURT_LEVEL_MAP: Dict[int, List[str]] = {
    1: [
        "Federal Supreme Court",
        "U.S. Court of International Trade",
        "U.S. Tax Court",
        "Merit Systems Protection Board",
        "Office of Legal Counsel",
        "U.S. Bankruptcy Court, D of Penn",
        "Board of Immigration Appeals",
        "Armed Services Board of Contract Appeals",
        "U.S. Court of Federal Claims",
    ],
    2: [
        "U.S. Court of Appeals for the First Circuit",
        "U.S. Court of Appeals for the Second Circuit",
        "U.S. Court of Appeals for the Third Circuit",
        "U.S. Court of Appeals for the Fourth Circuit",
        "U.S. Court of Appeals for the Fifth Circuit",
        "U.S. Court of Appeals for the Sixth Circuit",
        "U.S. Court of Appeals for the Seventh Circuit",
        "U.S. Court of Appeals for the Eighth Circuit",
        "U.S. Court of Appeals for the Ninth Circuit",
        "U.S. Court of Appeals for the Tenth Circuit",
        "U.S. Court of Appeals for the Eleventh Circuit",
        "U.S. Court of Appeals for the D.C. Circuit",
        "U.S. Court of Appeals for the Federal Circuit",
        "U.S. Court of Appeals for Veterans Claims",
        "U.S. Court of Appeals for the Armed Forces",
        "U.S. Army Court of Criminal Appeals",
    ],
    3: [
        "U.S. District Court for the Northern District of California",
        "U.S. District Court for the Central District of California",
        "U.S. District Court for the Southern District of California",
        "U.S. District Court for the Eastern District of California",
        "U.S. District Court for the District of Massachusetts",
        "U.S. District Court for the District of Maryland",
        "U.S. District Court for the Middle District of Louisiana",
        "U.S. District Court for the District of Colorado",
        "U.S. District Court for the District of Oregon",
        "U.S. District Court for the District of Hawaii",
        "U.S. District Court for the Southern District of New York",
        "U.S. District Court for the Eastern District of New York",
        "U.S. District Court for the District of New Jersey",
        "U.S. District Court for the District of Columbia",
        "U.S. District Court for the District of the Virgin Islands",
        "U.S. District Court for the District of Minnesota",
        "U.S. District Court for the Middle District of Pennsylvania",
    ],
    4: [
        "Alabama","Alaska","Arizona","Arkansas","California","Colorado","Connecticut","Delaware","Florida","Georgia",
        "Hawaii","Idaho","Illinois","Indiana","Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland",
        "Massachusetts","Michigan","Minnesota","Mississippi","Missouri","Montana","Nebraska","Nevada",
        "New Hampshire","New Jersey","New Mexico","New York","North Carolina","North Dakota","Ohio","Oklahoma",
        "Oregon","Pennsylvania","Rhode Island","South Carolina","South Dakota","Tennessee","Texas","Utah",
        "Vermont","Virginia","Washington","West Virginia","Wisconsin","Wyoming","Virgin Islands","Puerto Rico",
        "Guam","Northern Mariana Islands",
    ],
    5: ["Unknown"],
}

In [None]:
# Build a lowercase lookup: jurisdiction_name -> level
_JURIS_TO_LEVEL: Dict[str, int] = {}
for lvl, names in COURT_LEVEL_MAP.items():
    for nm in names:
        key = (nm or "").strip().lower()
        if key:
            # keep the smallest level if duplicates appear
            _JURIS_TO_LEVEL[key] = min(lvl, _JURIS_TO_LEVEL.get(key, lvl))


def court_level_from_jurisdiction(name: Optional[str]) -> int:
    if not name:
        return 5
    return _JURIS_TO_LEVEL.get(name.strip().lower(), 5)

In [3]:
# =========================
# Helpers
# =========================

def clean_text_basic(s: str) -> str:
    """Remove control chars, collapse whitespace; safe for CSV/DB."""
    if s is None:
        return ""
    s = str(s)
    TRANSLATE = {
        0x00A0: 0x20, 0x0085: 0x20, 0x2028: 0x20, 0x2029: 0x20,
        0x200E: 0x20, 0x200F: 0x20, 0x202A: 0x20, 0x202B: 0x20,
        0x202C: 0x20, 0x202D: 0x20, 0x202E: 0x20, 0x2066: 0x20,
        0x2067: 0x20, 0x2068: 0x20, 0x2069: 0x20, 0x0009: 0x20,
        0x000A: 0x20, 0x000D: 0x20,
    }
    s = s.translate(TRANSLATE)
    s = re.sub(r"[\x00-\x1F\x7F-\x9F]", " ", s)
    s = re.sub(r"[ \t]{2,}", " ", s).strip()
    return s


def wcsv(path: Path, rows: Iterable[Dict[str, Any]], header: List[str]):
    """Write CSV with UTF-8 BOM and CRLF; create parent dirs."""
    path.parent.mkdir(parents=True, exist_ok=True)
    rows = list(rows)
    if not rows:
        return
    with open(path, "w", newline="", encoding="utf-8-sig") as f:
        w = csv.DictWriter(f, fieldnames=header, quoting=csv.QUOTE_ALL, lineterminator="\r\n")
        w.writeheader()
        for r in rows:
            w.writerow({k: r.get(k, "") for k in header})


def infer_header(rows: List[Dict[str, Any]], preferred_order: Optional[List[str]] = None) -> List[str]:
    """Create a stable header from the union of keys across all rows."""
    keys: List[str] = []
    seen = set()
    if preferred_order:
        for k in preferred_order:
            if any(k in r for r in rows):
                keys.append(k)
                seen.add(k)
    for r in rows:
        for k in r.keys():
            if k not in seen:
                keys.append(k)
                seen.add(k)
    return keys


def join_or_blank(items, sep=" | "):
    if not items:
        return ""
    return sep.join(str(x) for x in items if x)


def _stable_id(s: str) -> int:
    """Deterministic positive 31-bit int from a string (for synthetic IDs)."""
    return zlib.crc32(s.encode("utf-8")) & 0x7FFFFFFF


def _tail_slug(absolute_url: Optional[str]) -> str:
    if not absolute_url:
        return ""
    parts = absolute_url.strip("/").split("/")
    return parts[-1] if parts else ""


def _norm_rel(val: Any) -> str:
    """Normalize Citing_Relationship values to lowercase strings."""
    return (str(val).strip().lower() if val is not None else "")


# =========================
# Text compaction (opinion)
# =========================

def compact_snippet(text: str) -> str:
    """
    Normalize opinion text/snippet into a single readable paragraph:
      - remove form feeds
      - heal hyphenations across line breaks
      - drop standalone line-number lines
      - collapse whitespace to single spaces
    """
    if not text:
        return text
    s = text.replace("\r", "\n").replace("\x0c", " ")
    s = re.sub(r'(\w)-\s*\n\s*(\w)', r'\1\2', s)  # heal hyphenation
    lines = []
    for ln in s.splitlines():
        if re.match(r'^\s*\d{1,3}\s*$', ln):
            continue
        lines.append(ln)
    s = "\n".join(lines)
    s = re.sub(r'\s+', ' ', s).strip()
    s = re.sub(r'\s+([,.;:!?])', r'\1', s)
    return s


def _compact_for_csv(s: str) -> str:
    """Compact first (line numbers, hyphenation), then scrub control chars."""
    return clean_text_basic(compact_snippet(s))


# =========================
# Sentence-aware chunking
# =========================

_CLOSERS_AFTER_SPACE = set(['"', "'", ")", "]", "”", "’", "»"])
_TERMINATORS = {'.', '!', '?'}


def _precompute_paren_depths(text: str) -> List[int]:
    depths = [0] * len(text)
    depth = 0
    for i, ch in enumerate(text):
        depths[i] = depth
        if ch == '(':
            depth += 1
        elif ch == ')' and depth > 0:
            depth -= 1
    return depths


def _is_ellipsis_dot(text: str, idx: int) -> bool:
    if idx < 0 or idx >= len(text) or text[idx] != '.':
        return False

    def count_forward(i: int) -> int:
        cnt = 0
        j = i
        if j < len(text) and text[j] == '.':
            cnt += 1
            j += 1
        while j < len(text):
            if text[j].isspace():
                j += 1
                continue
            if text[j] == '.':
                cnt += 1
                j += 1
                continue
            break
        return cnt

    def count_backward(i: int) -> int:
        cnt = 0
        j = i
        if j >= 0 and text[j] == '.':
            cnt += 1
            j -= 1
        while j >= 0:
            if text[j].isspace():
                j -= 1
                continue
            if text[j] == '.':
                cnt += 1
                j -= 1
                continue
            break
        return cnt

    total = 1
    total += count_forward(idx + 1)
    total += count_backward(idx - 1)
    return total >= 3


def _is_good_boundary_next_char_rule(text: str, abs_pos: int) -> bool:
    ch = text[abs_pos]
    if ch == '.' and _is_ellipsis_dot(text, abs_pos):
        return False
    k = abs_pos + 1
    n = len(text)
    while k < n and text[k].isspace():
        k += 1
    if k >= n:
        return True
    nxt = text[k]
    if nxt in _CLOSERS_AFTER_SPACE:
        return False
    if nxt.isalpha():
        return True
    return False


def sentence_chunks(text: str, target_size: int) -> List[str]:
    if not text:
        return [""]

    n = len(text)
    out: List[str] = []
    start = 0
    min_len = max(1, int(target_size * 0.6))
    paren_depths = _precompute_paren_depths(text)

    while start < n:
        max_end = min(start + target_size, n)
        window = text[start:max_end]
        candidates = [i for i, ch in enumerate(window) if ch in _TERMINATORS]
        chosen_end = None

        for i in reversed(candidates):
            abs_i = start + i
            if paren_depths[abs_i] > 0:
                continue
            if window[i] == '.' and _is_ellipsis_dot(text, abs_i):
                continue
            if i + 1 >= min_len and _is_good_boundary_next_char_rule(text, abs_i):
                chosen_end = abs_i + 1
                break

        if chosen_end is None:
            ws = window.rfind(" ")
            if ws != -1 and ws + 1 >= min_len:
                chosen_end = start + ws
            else:
                chosen_end = max_end

        piece = text[start:chosen_end].strip()
        if piece:
            out.append(piece)
        start = chosen_end

    return out if out else [""]


# =========================
# Utilities for splitting
# =========================

def _build_case_splits(
    case_id_order: List[int],
    case_chunk_counts: Dict[int, int],
    total_chunks: int,
    num_splits: int,
) -> Dict[int, int]:
    """
    Return case_id -> split_index (0..num_splits-1), approximating equal
    opinion_chunk rows per split while never splitting a case across files.
    """
    if num_splits <= 1:
        return {cid: 0 for cid in case_id_order}
    target = int(math.ceil(total_chunks / num_splits))
    mapping: Dict[int, int] = {}
    split_idx = 0
    running = 0
    for cid in case_id_order:
        cnt = case_chunk_counts.get(cid, 0)
        if split_idx < num_splits - 1 and running > 0 and running + cnt > target:
            split_idx += 1
            running = 0
        mapping[cid] = split_idx
        running += cnt
    return mapping


# =========================
# Type check
# =========================

def _is_adah_case(obj: Dict[str, Any]) -> bool:
    """Heuristic: ADAH cases have 'ID' (int) and 'opinion' (string) keys."""
    return ("ID" in obj) and ("opinion" in obj)


# =========================
# S3 / Local loader
# =========================

def _load_json_from_local_or_s3(
    json_path: Optional[str],
    s3_bucket: Optional[str],
    s3_key: Optional[str],
) -> List[Dict[str, Any]]:
    """
    Load a JSON array either from local path or from S3 (via boto3).
    Exactly one of (json_path) or (s3_bucket+s3_key) must be provided.
    """
    if json_path and (s3_bucket or s3_key):
        raise ValueError("Provide either json_path OR (s3_bucket and s3_key), not both.")
    if not json_path and not (s3_bucket and s3_key):
        raise ValueError("You must provide json_path OR (s3_bucket and s3_key).")

    if json_path:
        with open(json_path, "r", encoding="utf-8") as f:
            return json.load(f)

    if boto3 is None:
        raise RuntimeError("boto3 is required for S3 mode")
    s3 = boto3.client("s3")
    obj = s3.get_object(Bucket=s3_bucket, Key=s3_key)  # type: ignore[arg-type]
    with io.TextIOWrapper(obj["Body"], encoding="utf-8") as f:
        return json.load(f)


def _default_outdir_name(json_path: Optional[str], s3_key: Optional[str]) -> str:
    """Derive a default output directory name from local json filename or S3 key."""
    if json_path:
        stem = Path(json_path).stem
        return f"{stem}_csv_out"
    if s3_key:
        stem = Path(s3_key).stem
        return f"{stem}_csv_out"
    return "csv_out"


# =========================
# Small helpers for IDs & edges (ADAH-only)
# =========================

def _get_case_id(obj: Dict[str, Any]) -> Optional[int]:
    """Return canonical int case id from an ADAH object, else None."""
    try:
        return int(obj["ID"])
    except Exception:
        return None


def _get_cited_ids(obj: Dict[str, Any]) -> List[int]:
    """
    Return list of resolved target ids that this ADAH object cites.
    Uses top-level 'cite_to' (list of ints/strings).
    """
    out: List[int] = []
    for tid in (obj.get("cite_to") or []):
        try:
            out.append(int(tid))
        except Exception:
            continue
    return out


def _compute_case_meta_for_allowed(
    data: List[Dict[str, Any]],
    allowed_case_ids: set[int],
) -> Dict[int, Dict[str, str]]:
    """
    Build id → {name, file_slug, url} for all allowed ADAH cases, so cites_to rows can
    include names and CourtListener URLs regardless of processing order.
    """
    meta: Dict[int, Dict[str, str]] = {}
    for c in data:
        if not _is_adah_case(c):
            continue
        cid = _get_case_id(c)
        if cid is None or cid not in allowed_case_ids:
            continue
        name = c.get("case_name") or ""
        file_slug = clean_text_basic(_tail_slug(c.get("absolute_url")))
        url = f"https://www.courtlistener.com/opinion/{cid}/{file_slug}/" if file_slug else ""
        meta[cid] = {
            "name": clean_text_basic(name),
            "file_slug": file_slug,
            "url": url,
        }
    return meta


# =========================
# Main conversion
# =========================

def json_to_csv(
    json_path: Optional[str] = None,
    outdir: Optional[str] = None,
    *,
    # ALSO SUPPORT S3
    s3_bucket: Optional[str] = None,
    s3_key: Optional[str] = None,
    # opinion handling
    semantic_chunk_size: int = 4000,
    opinion_splits: int = 1,          # used only when opinion_storage="chunks"
    opinion_storage: str = "chunks",  # "chunks" | "inline"
    opinion_property: str = "opinion_text",
    max_opinion_chars: Optional[int] = None,
    # ADAH model-dev subsetting:
    # keep up to `subset` ADAH cases (Citing_Relationship='adah')
    # plus all cases that cite these ADAH cases
    subset: Optional[int] = None,
    # Optional hard filter to keep only these case IDs (strings or ints; deduped)
    only_case_ids: Optional[Iterable[Union[str, int]]] = None,
    # Optional: include URLs in cases.csv (cites_to.csv always includes URLs)
    include_url: bool = False,
):
    """
    Convert ADAH JSON cases into CSVs for Neo4j Aura Data Importer.

    Behavior:
      - Always keeps ADAH cases (Citing_Relationship='adah') and cases that cite ADAH
        (Citing_Relationship='citing_adah').
      - If `subset` is given, first selects up to `subset` ADAH cases, then adds all
        cases that cite any of those ADAH cases.
      - cites_to.csv outputs: src_case_id, src_case_name, tgt_case_id, tgt_case_name,
        src_case_court_listener_url, tgt_case_court_listener_url.
      - courts.csv includes `court_level` derived from jurisdiction.
      - include_url toggles URLs in cases.csv only (cites_to.csv always includes URLs).

    This converter only supports ADAH-style JSON objects (with keys 'ID' and 'opinion').
    """
    # ---------- load payload ----------
    data = _load_json_from_local_or_s3(json_path, s3_bucket, s3_key)

    # Check that this looks like ADAH data
    if not any(_is_adah_case(obj) for obj in data):
        raise ValueError(
            "This converter only supports ADAH-format cases (objects with 'ID' and 'opinion'). "
            "The provided JSON does not appear to contain ADAH cases."
        )

    # Normalize only_case_ids to a set of ints, if provided
    only_ids_set: Optional[set[int]] = None
    if only_case_ids is not None:
        only_ids_set = set()
        for v in only_case_ids:
            try:
                only_ids_set.add(int(v))
            except Exception:
                continue

    # ---------- setup ----------
    if outdir is None:
        outdir = _default_outdir_name(json_path, s3_key)
    OUTDIR = Path(outdir)
    OUTDIR.mkdir(parents=True, exist_ok=True)

    # ---- Build allowed_case_ids ----
    # Always using Citing_Relationship values "adah" and "citing_adah"
    allowed_case_ids: set[int] = set()
    REL_ADAH = "adah"
    REL_CITING_ADAH = "citing_adah"

    if subset:
        # Step 1: pick up to `subset` ADAH cases
        selected_adah_ids: List[int] = []
        for obj in data:
            if not _is_adah_case(obj):
                continue
            if _norm_rel(obj.get("Citing_Relationship")) == REL_ADAH:
                cid = _get_case_id(obj)
                if cid is None:
                    continue
                selected_adah_ids.append(cid)
                if len(selected_adah_ids) >= subset:
                    break

        selected_adah_ids_set = set(selected_adah_ids)
        allowed_case_ids.update(selected_adah_ids_set)

        # Step 2: include all cases that cite any selected ADAH case
        for obj in data:
            if not _is_adah_case(obj):
                continue
            cid = _get_case_id(obj)
            if cid is None:
                continue
            cited = _get_cited_ids(obj)
            if any(t in selected_adah_ids_set for t in cited):
                allowed_case_ids.add(cid)

    else:
        # Default: keep all ADAH and citing-ADAH cases
        for obj in data:
            if not _is_adah_case(obj):
                continue
            rel = _norm_rel(obj.get("Citing_Relationship"))
            if rel in (REL_ADAH, REL_CITING_ADAH):
                cid = _get_case_id(obj)
                if cid is not None:
                    allowed_case_ids.add(cid)

    if only_ids_set is not None:
        allowed_case_ids &= only_ids_set

    # ---- Precompute meta (names, URLs) for all allowed cases ----
    case_meta = _compute_case_meta_for_allowed(data, allowed_case_ids)

    # ---- Root-level tables ----
    cases_rows: List[Dict[str, Any]] = []
    courts: Dict[Any, Dict[str, Any]] = {}
    jurisdictions: Dict[Any, Dict[str, Any]] = {}

    cites_to: List[Dict[str, Any]] = []

    # Opinion (chunks mode)
    opinion_chunks: List[Dict[str, Any]] = []
    case_opinion_edges: List[Dict[str, Any]] = []

    case_id_order_with_opinions: List[int] = []
    per_case_chunk_count: Dict[int, int] = defaultdict(int)

    # Synthetic jurisdiction id mapping for ADAH (by name)
    adah_juris_name_to_id: Dict[str, int] = {}

    def chunk_id(case_id, op_idx, ch_idx):
        return f"{case_id}:{op_idx}:{ch_idx}"

    def maybe_cap(s: str) -> str:
        if s is None:
            return ""
        if max_opinion_chars is not None and len(s) > max_opinion_chars:
            return s[:max_opinion_chars]
        return s

    # Helper to upsert courts with court_level (min level wins if repeated)
    def upsert_court(court_key, name, abbrev, level: int):
        cur = courts.get(court_key)
        if not cur:
            courts[court_key] = {
                "id": court_key,
                "name": name,
                "name_abbreviation": abbrev,
                "court_level": int(level) if level else 5,
            }
        else:
            cur["name"] = cur.get("name") or name
            cur["name_abbreviation"] = cur.get("name_abbreviation") or abbrev
            prev = cur.get("court_level")
            if prev is None:
                cur["court_level"] = int(level) if level else 5
            else:
                cur["court_level"] = min(int(prev), int(level) if level else 5)

    # ---- Transform loop (ADAH-only) ----
    for c in data:
        if not _is_adah_case(c):
            continue
        cid = _get_case_id(c)
        if cid is None or cid not in allowed_case_ids:
            continue

        rel = _norm_rel(c.get("Citing_Relationship"))

        # Base ADAH metadata
        name = c.get("case_name") or ""
        decision_date = c.get("date_filed")
        docket_number = c.get("docket_number") or c.get("docker_number")

        # Court & jurisdiction (ADAH)
        court_name = (c.get("court") or "").strip()
        court_abbrev = (c.get("court_citation_string") or "").strip() or ""
        court_id = _stable_id(f"adah-court:{court_name}|{court_abbrev}")

        juris_name = c.get("jurisdiction_inferred") or ""
        jurisdiction_id = None
        if juris_name:
            if juris_name not in adah_juris_name_to_id:
                adah_juris_name_to_id[juris_name] = _stable_id(f"adah-juris:{juris_name}")
            jurisdiction_id = adah_juris_name_to_id[juris_name]
            jurisdictions[jurisdiction_id] = {
                "id": jurisdiction_id,
                "name": juris_name,
                "name_long": "",
            }

        # Court level from jurisdiction (ADAH)
        lvl = court_level_from_jurisdiction(juris_name)
        upsert_court(court_id, court_name, court_abbrev, lvl)

        # case_full_name (tolerate literal "NULL")
        cf_raw = c.get("case_full_name")
        case_full_name = (
            "N/A"
            if (isinstance(cf_raw, str) and cf_raw.strip().upper() == "NULL")
            else (clean_text_basic(cf_raw) if cf_raw else "")
        )

        # citation pipe (single or list)
        cit_val = c.get("citation")
        if isinstance(cit_val, list):
            citation_pipe = join_or_blank([str(x).strip() for x in cit_val if x], sep=" | ")
        else:
            citation_pipe = str(cit_val).strip() if cit_val else ""

        # file slug + optional CourtListener URL for cases.csv
        file_name_val = case_meta.get(cid, {}).get("file_slug", "")
        cl_url = case_meta.get(cid, {}).get("url", "")

        # Base case row (ADAH)
        base = {
            "id": cid,
            "name": name,
            "case_full_name": case_full_name,
            "decision_date": decision_date,
            "docket_number": docket_number,
            "court_id": court_id,
            "jurisdiction_id": jurisdiction_id,
            "court_name_abbreviation": court_abbrev,
            "court_name": court_name,
            "jurisdiction_name": juris_name,
            "citation_pipe": citation_pipe,
            "file_name": file_name_val,
            "adah_case": (rel == "adah"),
        }
        if include_url:
            base["court_listener_url"] = cl_url
        if opinion_storage == "inline":
            base[opinion_property] = ""

        # Opinions (ADAH is one long string)
        otype = (c.get("opinion_type") or "") or ""
        author = (c.get("judge") or "") or ""
        panel = c.get("panel_names") or []
        if author and panel:
            author = f"{author} (panel: {', '.join(panel)})"
        otext = (c.get("opinion") or "") or ""

        if opinion_storage == "inline":
            full_txt = _compact_for_csv(otext)
            base[opinion_property] = maybe_cap(full_txt)
        else:
            if otext.strip():
                otext_comp = _compact_for_csv(otext)
                sem_chunks = sentence_chunks(otext_comp, semantic_chunk_size)
                if cid not in per_case_chunk_count:
                    case_id_order_with_opinions.append(cid)
                for ch_idx, ch_text in enumerate(sem_chunks):
                    oid = chunk_id(cid, 0, ch_idx)
                    opinion_chunks.append(
                        {
                            "id": oid,
                            "case_id": cid,
                            "chunk_index": ch_idx,
                            "opinion_type": otype,
                            "opinion_author": author,
                            "text": ch_text,
                        }
                    )
                    per_case_chunk_count[cid] += 1
                    case_opinion_edges.append({"case_id": cid, "chunk_id": oid})

        cases_rows.append(base)

        # cites_to (ADAH → numeric IDs)
        for tid_raw in (c.get("cite_to") or []):
            try:
                tid = int(tid_raw)
            except Exception:
                continue
            if tid in allowed_case_ids:
                cites_to.append(
                    {
                        "src_case_id": cid,
                        "src_case_name": case_meta.get(cid, {}).get("name", ""),
                        "tgt_case_id": tid,
                        "tgt_case_name": case_meta.get(tid, {}).get("name", ""),
                        "src_case_court_listener_url": case_meta.get(cid, {}).get("url", ""),
                        "tgt_case_court_listener_url": case_meta.get(tid, {}).get("url", ""),
                    }
                )

    # ---------- write CSVs ----------
    if cases_rows:
        preferred = [
            "id",
            "name",
            "case_full_name",
            "decision_date",
            "docket_number",
            "court_id",
            "jurisdiction_id",
            "court_name_abbreviation",
            "court_name",
            "jurisdiction_name",
            "citation_pipe",
            "file_name",
            "court_listener_url",
            "adah_case",
        ]
        if any(opinion_property in r for r in cases_rows):
            preferred.append(opinion_property)
        header = infer_header(cases_rows, preferred_order=preferred)
        wcsv(Path(OUTDIR / "cases.csv"), cases_rows, header)

    if courts:
        # ensure every court row has court_level (default 5 if somehow missing)
        for v in courts.values():
            if "court_level" not in v or v["court_level"] in (None, ""):
                v["court_level"] = 5
        wcsv(
            Path(OUTDIR / "courts.csv"),
            courts.values(),
            ["id", "name", "name_abbreviation", "court_level"],
        )

    # jurisdictions.csv (two passes; final file is ADAH minimal rows)
    if jurisdictions:
        adah_ids = set(adah_juris_name_to_id.values()) if adah_juris_name_to_id else set()
        cap_jur_rows = [r for r in jurisdictions.values() if r.get("id") not in adah_ids]
        if cap_jur_rows:
            # no-op in pure ADAH case, but kept for compatibility
            wcsv(Path(OUTDIR / "jurisdictions.csv"), cap_jur_rows, ["id", "name", "name_long"])

    if adah_juris_name_to_id:
        minimal_rows = [
            {"id": jid, "jurisdiction_name": name}
            for name, jid in sorted(adah_juris_name_to_id.items(), key=lambda x: x[0].lower())
        ]
        wcsv(Path(OUTDIR / "jurisdictions.csv"), minimal_rows, ["id", "jurisdiction_name"])

    if cites_to:
        wcsv(
            Path(OUTDIR / "cites_to.csv"),
            cites_to,
            [
                "src_case_id",
                "src_case_name",
                "tgt_case_id",
                "tgt_case_name",
                "src_case_court_listener_url",
                "tgt_case_court_listener_url",
            ],
        )

    # ---- Opinion output ----
    if opinion_storage == "inline":
        print("Done (inline opinions). Files in:", OUTDIR.resolve())
        return

    if not opinion_chunks:
        print("Done. Files in:", OUTDIR.resolve())
        return

    # Single-file opinion chunks
    if opinion_splits <= 1:
        oc_pref = [
            "id",
            "case_id",
            "chunk_index",
            "opinion_type",
            "opinion_author",
            "text",
        ]
        oc_header = infer_header(opinion_chunks, preferred_order=oc_pref)
        wcsv(Path(OUTDIR / "opinion_chunks.csv"), opinion_chunks, oc_header)

        wcsv(Path(OUTDIR / "case_opinion_edges.csv"), case_opinion_edges, ["case_id", "chunk_id"])
        print("Done. Files in:", OUTDIR.resolve())
        return

    # Multi-file split by case boundaries, balancing chunk counts
    total_chunks = len(opinion_chunks)
    case_to_split = _build_case_splits(
        case_id_order_with_opinions, per_case_chunk_count, total_chunks, opinion_splits
    )

    base = OUTDIR / "Opinion Chunks"
    base.mkdir(parents=True, exist_ok=True)

    for k in range(opinion_splits):
        split_dir = base / f"split_{k+1:02d}"
        split_dir.mkdir(parents=True, exist_ok=True)

        cases_in_split = {cid for cid, sidx in case_to_split.items() if sidx == k}

        oc_rows = [r for r in opinion_chunks if r["case_id"] in cases_in_split]
        coe_rows = [r for r in case_opinion_edges if r["case_id"] in cases_in_split]

        if oc_rows:
            oc_pref = [
                "id",
                "case_id",
                "chunk_index",
                "opinion_type",
                "opinion_author",
                "text",
            ]
            oc_header = infer_header(oc_rows, preferred_order=oc_pref)
            wcsv(split_dir / "opinion_chunks.csv", oc_rows, oc_header)
        if coe_rows:
            wcsv(split_dir / "case_opinion_edges.csv", coe_rows, ["case_id", "chunk_id"])

    print("Done. Files in:", OUTDIR.resolve())


# ---------------------------
# Example calls
# ---------------------------

# 1) Default “production” filter: keep ADAH + citing_ADAH, chunk opinions into single CSVs
# json_to_csv("adah_cases_full.json")

# 2) Inline full opinions
# json_to_csv("adah_cases_full.json", opinion_storage="inline", opinion_property="full_opinion_text")

# 3) Chunked opinions, split across 3 folders
# json_to_csv("adah_cases_full.json", opinion_splits=3)

# 4) ADAH model-dev subset: keep ~100 ADAH cases + ALL cases that cite them
# json_to_csv("adah_cases_full.json", subset=100)

# 5) Restrict to a specific set of case IDs (strings or ints; duplicates OK)
# json_to_csv("adah_cases_full.json", only_case_ids=["2113004", "174213", "174213", 2461771])

# 6) Include CourtListener URLs in cases.csv (cites_to.csv always includes URLs)
# json_to_csv("adah_cases_full.json", include_url=True)

## Example of How to Extract a Subset (15 ADAH Cases)

In [None]:
# json_to_csv(json_path = "adah_cases_full.json", outdir = "adah_cases_subset_eval", semantic_chunk_size=4000, subset = 15, include_url=True)

Done. Files in: C:\Users\kentb\OneDrive\Desktop\Berkeley MIDS\DATASCI 210\Post-ADA Dataset Work\adah_cases_subset_eval


: 