# Opinion Snippet Retriever

This notebook **scans opinions and writes regex-based snippets** onto each `(:Case)-[:CITES_TO]->(:Case)` edge as `snippet_1`, `snippet_2`, … plus metadata (`snippet_method`, `snippet_count`, and optional `id_hit`).

---

## What this does

1. **Page through citation edges**  
   Batches over `:CITES_TO` using `id(r)` as a cursor (`BATCH_RELATIONSHIPS`).

2. **Load the citer’s opinion text**  
   Concatenates all `OpinionChunk` nodes for the source case ordered by `chunk_index`.

3. **Find “hits” in strict order (first method that hits wins):**  
   A) **`citation_pipe_first` (strict-ish)**: exact regex for the **first** element in `t.citation_pipe` (periods may be followed by an optional space; all whitespace collapsed).  
   B) **`relaxed_citation_first`**: robust relaxed regex for the **first** cite that **ignores all punctuation and whitespace** (e.g., `53 L. Ed. 2d 113` ⇔ `53 L.Ed.2d 113`, `432 U. S. 63` ⇔ `432 U.S. 63`).  
   C) **`exact_name`**: exact case-name regex (case-insensitive).  
   D) **`relaxed_name`**: tolerant case-name regex (handles punctuation/abbrev variants like `Department/Dep't/Dept.`).  
   E) **`citation_pipe_other`** (only if `search_all_citation_pipe_elements=True`): exact regex over **all remaining** `citation_pipe` elements (handles spacing like `432 U.S. 63` vs `432 U. S. 63`).  
   F) **`relaxed_citation_other`** (only if enabled): relaxed regex over **remaining** cites (punctuation/whitespace ignored).

   > Once a method yields hits, **all snippets for that edge come from that same method** (no mixing).

4. **Merge overlapping hits into a single expanded snippet**  
   For each edge, hits become character windows (`APPROX_CHAR_WINDOW` total). Overlapping/touching windows are **merged** to avoid near-duplicates, then optionally **expanded to sentence boundaries** (`EXPAND_TO_SENTENCE`).

5. **“Id.” follow-on hits (optional)**  
   If `find_Id_hits=True`, after each primary hit we **scan forward** and treat `"Id."` as an additional hit **until** another case appears (detected by a `v`/`v.` pattern). If any `"Id."` is included, we set `id_hit = true`; otherwise `false`. If `find_Id_hits=False`, this step is skipped.

6. **Build snippet text**  
   From each (merged) window, extract a **character window** and optionally **compact** the text (`USE_COMPACT_SNIPPET`) to heal hyphenations, drop line-number rows, and collapse whitespace.

7. **Write back to Neo4j**  
   On each `:CITES_TO` edge we set:
   * `snippet_1 … snippet_N` (up to `MAX_SNIPPETS_PER_EDGE`)
   * `snippet_method ∈ {"citation_pipe_first","relaxed_citation_first","exact_name","relaxed_name","citation_pipe_other","relaxed_citation_other"}`
   * `snippet_count` (N)
   * `id_hit` (boolean) **only if** `find_Id_hits=True`
   * If `force=True` (or global `CLEAR_PREVIOUS_SNIPPETS=True`), we first clear any existing `snippet_*` props.

8. **Progress & debug reporting**  
   Every `SECONDS_BETWEEN_REPORTS` seconds we print:
   * edges processed
   * successful edges (and breakdown by method)
   * unsuccessful (no-hit) count
   * total `"Id."` hits (when enabled)

   Additionally, with `echo=True` the notebook prints **per-edge** debug lines:
   * On name/other-element hits:  
     `source case → target case - {hit_type} - {#hits}`
   * On misses:  
     `source case → target case - No hit found`
   * When the source opinion text is missing/empty:  
     `source case → target case - No source opinion text`

---

## How to run

```python
# Small dry run (first page)
retrieve_all_snippets(
    batch_size=50,
    echo=True,
    force=True,                           # overwrite existing snippet_* props
    search_all_citation_pipe_elements=False,  # only first element of citation_pipe
    find_Id_hits=False                    # skip "Id." follow-on scanning
)

# Full run with broader reporter-cite search + Id. detection
retrieve_all_snippets(
    echo=True,
    force=True,
    search_all_citation_pipe_elements=True,   # search remaining citation_pipe elements as LAST resort
    find_Id_hits=True
)

# Quiet run, keep existing snippet_* (no overwrite)
retrieve_all_snippets(
    echo=False,
    force=False,
    search_all_citation_pipe_elements=False,
    find_Id_hits=False
)
```

**Parameter notes**

* `force`:

  * `True` → clear/overwrite any existing `snippet_*` before writing
  * `False` → keep existing `snippet_*`
  * `None` → use global `CLEAR_PREVIOUS_SNIPPETS`
* `search_all_citation_pipe_elements`:

  * `False` (default) → search **only the first** `citation_pipe` element
  * `True` → after name strategies fail, search **remaining** `citation_pipe` elements (exact, then relaxed)
* `find_Id_hits`:

  * `False` (default) → no `"Id."` scanning
  * `True` → scan forward for `"Id."` before the next case (`v`/`v.`); set `id_hit` accordingly

---

## Key tunables

* **Batching:** `BATCH_RELATIONSHIPS` (edges per page)
* **Snippets per edge:** `MAX_SNIPPETS_PER_EDGE`
* **Snippet windowing:** `APPROX_CHAR_WINDOW`, `EXPAND_TO_SENTENCE`, `USE_COMPACT_SNIPPET`
* **Overwrite behavior:** `CLEAR_PREVIOUS_SNIPPETS` (used when `force=None`)
* **Logging cadence:** `SECONDS_BETWEEN_REPORTS`

---

## Properties written on `:CITES_TO`

* `snippet_1 … snippet_N` (text)
* `snippet_method` (string)
* `snippet_count` (int)
* `id_hit` (bool; **only** if `find_Id_hits=True`)

---

## Environment & paths

* **SageMaker/local project folder:** set `PROJECT_DIR` (optional).
* **Credentials:** `.env` loaded from `ENV_PATH` (default `"../.env"`). Required keys:

  * `NEO4J_URI` (e.g., `neo4j+s://<host>`)
  * `NEO4J_USERNAME`
  * `NEO4J_PASSWORD`
  * `NEO4J_DATABASE` (optional; defaults to `"neo4j"`)

> Install dependencies with `pip install neo4j python-dotenv pandas` (SageMaker: install into the active kernel).

---

## Assumptions

* Graph contains `Case`, `OpinionChunk`, and `:CITES_TO`.
  `OpinionChunk` has `case_id`, `chunk_index`, `text`.
* Target cases have `t.citation_pipe` populated (pipe/semicolon-separated reporter cites).

---

## Troubleshooting

* **Few/no hits:** verify `t.citation_pipe` values; rely on relaxed citation matching; optionally enable `search_all_citation_pipe_elements=True`.
* **Near-duplicate snippets:** merging is automatic; consider increasing `APPROX_CHAR_WINDOW` for dense clusters.
* **Unexpected `"Id."` behavior:** `"Id."` scanning is forward-only until a `v`/`v.`; enable `find_Id_hits=True` to set and observe `id_hit`.
* **Connection errors:** confirm `NEO4J_URI` reachability and `.env` path (`ENV_PATH`).


In [1]:
!pip install neo4j



In [2]:
# ---------- imports ----------
import os, re, time, pathlib, logging
from typing import List, Dict, Any, Tuple, Optional

import pandas as pd  # (handy for ad-hoc checks / CSVs if you add later)
from neo4j import GraphDatabase
from dotenv import load_dotenv

# =========================
# Optional: project directory
# =========================
PROJECT_DIR = ""  # e.g., "/home/ec2-user/SageMaker/edge-classifier"
if PROJECT_DIR:
    os.makedirs(PROJECT_DIR, exist_ok=True)
    os.chdir(PROJECT_DIR)
    print(f"Working directory set to: {os.getcwd()}")
else:
    print(f"Working in current directory: {os.getcwd()}")

# =========================
# .env loading
# =========================
ENV_PATH = "../.env"
if not os.path.exists(ENV_PATH):
    print(f"Warning: .env not found at {ENV_PATH}. Make sure your Neo4j creds are available in environment.")
load_dotenv(ENV_PATH, override=True)

# Suppress very noisy Neo4j notifications in console
for _n in ("neo4j", "neo4j.notifications", "neo4j.work.simple"):
    logging.getLogger(_n).setLevel(logging.ERROR)

# =========================
# ENV / CONFIG
# =========================
def _clean_env(v: Optional[str]) -> Optional[str]:
    if v is None:
        return None
    v = v.strip().strip('"').strip("'").strip()
    return v or None

NEO4J_URI       = _clean_env(os.getenv("NEO4J_URI"))
NEO4J_USERNAME  = _clean_env(os.getenv("NEO4J_USERNAME"))
NEO4J_PASSWORD  = _clean_env(os.getenv("NEO4J_PASSWORD"))
NEO4J_DATABASE  = _clean_env(os.getenv("NEO4J_DATABASE")) or "neo4j"

if not NEO4J_URI:
    raise RuntimeError("Missing NEO4J_URI. Set it in .env (e.g., NEO4J_URI=neo4j+s://<host>).")
if not (NEO4J_USERNAME and NEO4J_PASSWORD):
    raise RuntimeError("Missing NEO4J_USERNAME or NEO4J_PASSWORD. Set them in .env.")

Working in current directory: /home/sagemaker-user/Snippet Retriever


In [3]:
import os, re, time, pathlib, logging
from typing import List, Dict, Any, Tuple, Optional

import pandas as pd 
from neo4j import GraphDatabase
from dotenv import load_dotenv

ENV_PATH = "../.env"
if not os.path.exists(ENV_PATH):
    print(f"Warning: .env not found at {ENV_PATH}. Make sure your Neo4j creds are available in environment.")
load_dotenv(ENV_PATH, override=True)

# Suppress very noisy Neo4j notifications in console
for _n in ("neo4j", "neo4j.notifications", "neo4j.work.simple"):
    logging.getLogger(_n).setLevel(logging.ERROR)

# =========================
# ENV / CONFIG
# =========================
def _clean_env(v: Optional[str]) -> Optional[str]:
    if v is None:
        return None
    v = v.strip().strip('"').strip("'").strip()
    return v or None

NEO4J_URI       = _clean_env(os.getenv("NEO4J_URI"))
NEO4J_USERNAME  = _clean_env(os.getenv("NEO4J_USERNAME"))
NEO4J_PASSWORD  = _clean_env(os.getenv("NEO4J_PASSWORD"))
NEO4J_DATABASE  = _clean_env(os.getenv("NEO4J_DATABASE")) or "neo4j"

## Tunables

In [5]:
# =========================
# Tunables
# =========================
BATCH_RELATIONSHIPS     = 250      # :CITES_TO per page
MAX_SNIPPETS_PER_EDGE   = 10       # hard cap of snippets stored per edge
APPROX_CHAR_WINDOW      = 1200     # total chars (≈ half on each side)
EXPAND_TO_SENTENCE      = True     # widen to sentence boundaries
USE_COMPACT_SNIPPET     = True     # heal hyphenations, drop line numbers, collapse spaces
SECONDS_BETWEEN_REPORTS = 5        # periodic progress print
CLEAR_PREVIOUS_SNIPPETS = True     # default behavior; overridden by 'force' param

## Cypher Queries

In [6]:
# =========================
# Cypher Queries
# =========================
Q_PAGE_REL = """
MATCH (s:Case)-[r:CITES_TO]->(t:Case)
WHERE id(r) > $after_id
RETURN id(r) AS rel_id,
       s.id AS src_id, coalesce(s.name,'') AS src_name,
       t.id AS tgt_id, coalesce(t.name,'') AS tgt_name,
       t.citation_pipe AS tgt_cite
ORDER BY rel_id
LIMIT $limit
"""

Q_GET_CHUNKS_FOR_CASE = """
MATCH (oc:OpinionChunk {case_id:$case_id})
RETURN oc.chunk_index AS chunk_index, oc.text AS text
ORDER BY oc.chunk_index
"""

In [7]:
def _mk_clear_snippets_cypher(max_snips:int) -> str:
    assigns = []
    for i in range(1, max_snips+1):
        assigns.append(f"r.snippet_{i} = NULL")
    assigns.append("r.snippet_method = NULL")
    assigns.append("r.snippet_count  = 0")
    assigns.append("r.`Id. hit?` = NULL")
    return " MATCH (src:Case {id:$src_id})-[r:CITES_TO]->(t:Case {id:$tgt_id}) SET " + ", ".join(assigns) + " RETURN 1 as ok"

Q_WRITE_SNIPPETS_BASE = """
MATCH (src:Case {id:$src_id})-[r:CITES_TO]->(t:Case {id:$tgt_id})
SET r.snippet_method = $snippet_method,
    r.snippet_count  = $snippet_count,
    r.`Id. hit?`     = $id_hit
SET r += $snippet_map
RETURN r.snippet_count AS snippet_count
"""

# =========================
# Regex helpers
# =========================
_ABBREV_ALTS = {
    "county": r"(County|Cnty\.?|Cty\.?)",
    "city": r"(City|Cty\.?)",
    "department": r"(Department|Dep't|Dept\.?|Dep’t)",
    "corporation": r"(Corporation|Corp\.?)",
    "incorporated": r"(Incorporated|Inc\.?)",
    "limited": r"(Limited|Ltd\.?)",
    "company": r"(Company|Co\.?)",
    "association": r"(Association|Ass'n|Assn\.?)",
    "commission": r"(Commission|Comm'n|Commn\.?)",
    "board": r"(Board|Bd\.?)",
    "authority": r"(Authority|Auth\.?)",
    "saint": r"(Saint|St\.?)",
    "mount": r"(Mount|Mt\.?)",
}

def compile_name_regex_exact(case_name: str) -> re.Pattern:
    escaped = re.escape(case_name.strip())
    escaped = re.sub(r"\\\s+", r"\\s+", escaped)
    return re.compile(rf"(?i)\b{escaped}\b")

def _token_to_relaxed(tok: str) -> str:
    base = tok.strip().strip(",.;:")
    key = base.lower().replace("’", "'").replace("–","-").replace("—","-")
    if key in _ABBREV_ALTS:
        return _ABBREV_ALTS[key]
    esc = re.escape(base)
    esc = re.sub(r'\\\.', r'\\.?', esc)
    return esc

def make_relaxed_case_name_regex(tgt_name: str) -> re.Pattern:
    m = re.split(r'\s+v\.?\s+', tgt_name, flags=re.IGNORECASE, maxsplit=1)
    if len(m) == 2:
        left, right = m[0], m[1]
    else:
        left, right = tgt_name, ""
    right = re.sub(r',\s*[A-Za-z][A-Za-z.]*\s*$', '', right).strip()

    def party_to_regex(party: str) -> str:
        tokens = re.split(r'\s+', party.strip())
        parts = [_token_to_relaxed(t) for t in tokens if t]
        return r'(?:\s+|,\s*)'.join(parts)

    left_re  = party_to_regex(left)  if left  else ""
    right_re = party_to_regex(right) if right else ""
    if left_re and right_re:
        pat = rf"(?i)\b{left_re}\s+v\.?\s+{right_re}(?:,\s*[A-Za-z][A-Za-z.]+)?\b"
    else:
        side = left_re or right_re or re.escape(tgt_name)
        pat = rf"(?i)\b{side}(?:,\s*[A-Za-z][A-Za-z.]+)?\b"

    try:
        return re.compile(pat)
    except re.error:
        return compile_name_regex_exact(tgt_name)

def compile_citation_regex(cite: str) -> re.Pattern:
    """
    Strict(ish) citation: keeps periods but allows optional single space after them and collapses whitespace.
    Good for exact matches like '261 A.D.2d 890'.
    """
    s = cite.strip()
    s = re.escape(s)
    s = re.sub(r'\\\s+', r'\\s*', s)      # any whitespace => \s*
    s = s.replace(r'\.', r'\. ?')         # period + optional space
    return re.compile(rf"(?i){s}")

def compile_relaxed_citation_regex(cite: str) -> re.Pattern:
    """
    Robust relaxed citation: ignore ALL punctuation and whitespace.
    Works for '53 L. Ed. 2d 113' vs '53 L.Ed.2d 113' and '432 U. S. 63' vs '432 U.S. 63'.
    Strategy:
      - Strip everything except A–Z / a–z / 0–9 from the target cite to form 'base'
      - Build a regex that allows arbitrary punctuation/space between EVERY character
        in base when searching the opinion text.
    """
    base = re.sub(r'[^A-Za-z0-9]+', '', cite or '')
    if not base or len(base) < 3:
        return compile_citation_regex(cite or '')
    # Allow ANY mix of punctuation/space (incl. non-breaking) between chars
    sep = r'[\s\.\,\-–—;:\(\)\[\]/\u00A0]*'
    parts = [re.escape(ch) for ch in base]
    pat = r'(?i)' + sep.join(parts)
    return re.compile(pat)

def split_citation_pipe(citation_pipe: Optional[str]) -> List[str]:
    if not citation_pipe:
        return []
    return [p.strip() for p in re.split(r'[;|]+', citation_pipe) if p.strip()]

# =========================
# Text & snippet helpers
# =========================
_SENT_SPLIT_RE = re.compile(r'(?<=[.!?])\s+(?=[A-Z(“"\'\[])')

def _split_sentences_with_spans(text: str) -> List[Tuple[int,int,str]]:
    spans: List[Tuple[int,int,str]] = []
    start = 0
    n = len(text)
    for m in _SENT_SPLIT_RE.finditer(text):
        end = m.start()
        if end > start:
            spans.append((start, end, text[start:end]))
        start = m.end()
    if start < n:
        spans.append((start, n, text[start:]))
    return spans

def compact_snippet(text: str) -> str:
    if not text:
        return text
    s = text.replace("\r", "\n").replace("\x0c", " ")
    # heal hyphenation across line breaks
    s = re.sub(r'(\w)-\s*\n\s*(\w)', r'\1\2', s)
    # drop bare line-number rows
    lines = []
    for ln in s.splitlines():
        if re.match(r'^\s*\d{1,3}\s*$', ln):
            continue
        lines.append(ln)
    s = "\n".join(lines)
    s = re.sub(r'\s+', ' ', s).strip()
    s = re.sub(r'\s+([,.;:!?])', r'\1', s)
    return s

def _expand_bounds_to_sentences(text: str, L: int, R: int) -> Tuple[int,int]:
    if not EXPAND_TO_SENTENCE or not text:
        return L, R
    sents = _split_sentences_with_spans(text)
    if not sents:
        return L, R
    n = len(text)
    L = max(0, min(L, n)); R = max(0, min(R, n))
    if L >= R:
        return L, R
    li = next((i for i,(sa,sb,_) in enumerate(sents) if sa <= L < sb), None)
    ri = next((i for i,(sa,sb,_) in enumerate(sents) if sa <= max(0,R-1) < sb), None)
    if li is not None: L = sents[li][0]
    if ri is not None: R = sents[ri][1]
    return L, R

def _extract_by_bounds(text: str, L: int, R: int) -> str:
    if R <= L:
        return ""
    L, R = _expand_bounds_to_sentences(text, L, R)
    snippet = text[L:R].strip()
    return compact_snippet(snippet) if USE_COMPACT_SNIPPET else snippet

# =========================
# Hit finding helpers (first element vs other elements)
# =========================
def _collect_hits_citation_first_exact(full_text: str, citation_pipe: Optional[str]) -> List[Tuple[int,int]]:
    cites = split_citation_pipe(citation_pipe)
    if not cites:
        return []
    pat = compile_citation_regex(cites[0])
    return [(m.start(), m.end()) for m in pat.finditer(full_text)]

def _collect_hits_citation_first_relaxed(full_text: str, citation_pipe: Optional[str]) -> List[Tuple[int,int]]:
    cites = split_citation_pipe(citation_pipe)
    if not cites:
        return []
    pat = compile_relaxed_citation_regex(cites[0])
    return [(m.start(), m.end()) for m in pat.finditer(full_text)]

def _collect_hits_citation_other_exact(full_text: str, citation_pipe: Optional[str]) -> List[Tuple[int,int]]:
    cites = split_citation_pipe(citation_pipe)
    if not cites or len(cites) < 2:
        return []
    spans: List[Tuple[int,int]] = []
    for cite in cites[1:]:
        pat = compile_citation_regex(cite)
        for m in pat.finditer(full_text):
            spans.append((m.start(), m.end()))
    spans.sort()
    return spans

def _collect_hits_citation_other_relaxed(full_text: str, citation_pipe: Optional[str]) -> List[Tuple[int,int]]:
    cites = split_citation_pipe(citation_pipe)
    if not cites or len(cites) < 2:
        return []
    spans: List[Tuple[int,int]] = []
    for cite in cites[1:]:
        pat = compile_relaxed_citation_regex(cite)
        for m in pat.finditer(full_text):
            spans.append((m.start(), m.end()))
    spans.sort()
    return spans

def _collect_hits_exact_name(full_text: str, tgt_name:str) -> List[Tuple[int,int]]:
    if not tgt_name:
        return []
    pat = compile_name_regex_exact(tgt_name)
    return [(m.start(), m.end()) for m in pat.finditer(full_text)]

def _collect_hits_relaxed_name(full_text: str, tgt_name:str) -> List[Tuple[int,int]]:
    if not tgt_name:
        return []
    pat = make_relaxed_case_name_regex(tgt_name)
    return [(m.start(), m.end()) for m in pat.finditer(full_text)]

_ID_RE   = re.compile(r'\bId\.\b')
_VTOK_RE = re.compile(r'\bv\.?\b')

def _collect_following_id_hits(full_text: str, start_pos: int) -> List[Tuple[int,int]]:
    """
    From start_pos, collect ALL 'Id.' spans that occur BEFORE the next 'v'/'v.' token.
    """
    out = []
    cur = start_pos
    n = len(full_text)
    while cur < n:
        m_v  = _VTOK_RE.search(full_text, cur)
        m_id = _ID_RE.search(full_text, cur)
        if not m_id:
            break
        if m_v and (m_v.start() <= m_id.start()):
            break  # hit a new 'v' before the next Id.
        out.append((m_id.start(), m_id.end()))
        cur = m_id.end() + 1
    return out

def _merge_hit_windows(text_len: int, spans: List[Tuple[int,int]], total_chars:int) -> List[Tuple[int,int]]:
    """
    Convert raw hit spans to [L,R] windows and MERGE overlapping/touching windows.
    """
    if not spans:
        return []
    half = max(1, total_chars // 2)
    windows = []
    for a,b in spans:
        L = max(0, a - half)
        R = min(text_len, b + half)
        windows.append((L,R))
    windows.sort()
    merged = []
    curL, curR = windows[0]
    for L,R in windows[1:]:
        if L <= curR:
            curR = max(curR, R)
        else:
            merged.append((curL, curR))
            curL, curR = L, R
    merged.append((curL, curR))
    return merged

# =========================
# Per-edge retrieval
# =========================
def retrieve_snippets_for_edge(session, row: Dict[str,Any], *,
                               search_all_citation_pipe_elements: bool,
                               find_Id_hits: bool,
                               echo: bool=False) -> Dict[str,Any]:
    """
    Returns dict with:
      status: 'ok'|'no_source_chunks'|'no_hit'
      method: one of:
        'citation_pipe_first', 'relaxed_citation_first',
        'exact_name', 'relaxed_name',
        'citation_pipe_other', 'relaxed_citation_other'
      snippets: List[str]
      id_hit: bool
      hit_count: int
    """
    src_id   = row["src_id"]
    src_name = row.get("src_name") or ""
    tgt_name = row["tgt_name"] or ""
    tgt_cite = row.get("tgt_cite") or ""

    chunks = session.run(Q_GET_CHUNKS_FOR_CASE, {"case_id": src_id}).data()
    if not chunks:
        return {"status":"no_source_chunks","method":None,"snippets":[], "id_hit": False, "hit_count": 0}

    full_text = "".join(ch.get("text") or "" for ch in chunks)
    if not full_text.strip():
        return {"status":"no_source_chunks","method":None,"snippets":[], "id_hit": False, "hit_count": 0}

    # a) first-element exact citation
    spans = _collect_hits_citation_first_exact(full_text, tgt_cite)
    method = None
    if spans:
        method = "citation_pipe_first"
    else:
        # b) first-element relaxed citation
        spans = _collect_hits_citation_first_relaxed(full_text, tgt_cite)
        if spans:
            method = "relaxed_citation_first"
        else:
            # c) exact name
            spans = _collect_hits_exact_name(full_text, tgt_name)
            if spans:
                method = "exact_name"
                if echo:
                    print(f"{src_name} -> {tgt_name} - exact_name - {len(spans)}")
            else:
                # d) relaxed name
                spans = _collect_hits_relaxed_name(full_text, tgt_name)
                if spans:
                    method = "relaxed_name"
                    if echo:
                        print(f"{src_name} -> {tgt_name} - relaxed_name - {len(spans)}")
                else:
                    # e) other-elements exact (only if enabled)
                    spans = []
                    if search_all_citation_pipe_elements:
                        spans = _collect_hits_citation_other_exact(full_text, tgt_cite)
                        if spans:
                            method = "citation_pipe_other"
                            if echo:
                                print(f"{src_name} -> {tgt_name} - citation_pipe_other_elements - {len(spans)}")
                        else:
                            # f) other-elements relaxed (only if enabled)
                            spans = _collect_hits_citation_other_relaxed(full_text, tgt_cite)
                            if spans:
                                method = "relaxed_citation_other"
                                if echo:
                                    print(f"{src_name} -> {tgt_name} - relaxed_citation_pipe_other_elements - {len(spans)}")

    if not spans:
        return {"status":"no_hit","method":None,"snippets":[], "id_hit": False, "hit_count": 0}

    # Optional: add 'Id.' follow-up hits that occur BEFORE next 'v' token from each original hit
    id_hit_found = False
    if find_Id_hits:
        id_spans = []
        for (a,b) in spans:
            id_spans.extend(_collect_following_id_hits(full_text, b))
        if id_spans:
            id_hit_found = True
            spans.extend(id_spans)
            spans.sort()

    # Merge windows BEFORE extracting text to avoid overlapping snippets
    merged_windows = _merge_hit_windows(len(full_text), spans, total_chars=APPROX_CHAR_WINDOW)

    # Extract snippets up to cap
    snippets: List[str] = []
    for (L,R) in merged_windows[:MAX_SNIPPETS_PER_EDGE]:
        snip = _extract_by_bounds(full_text, L, R)
        if snip:
            snippets.append(snip)

    return {"status":"ok","method":method,"snippets":snippets, "id_hit": id_hit_found, "hit_count": len(spans)}

# =========================
# Write helper
# =========================
def write_snippets(session, src_id:int, tgt_id:int, method:str, snippets:List[str], *,
                   id_hit: bool,
                   clear_previous:bool=True):
    if clear_previous:
        cy = _mk_clear_snippets_cypher(max_snips=MAX_SNIPPETS_PER_EDGE)
        session.run(cy, {"src_id": src_id, "tgt_id": tgt_id}).consume()

    snip_map = {f"snippet_{i+1}": s for i, s in enumerate(snippets)}
    session.run(
        Q_WRITE_SNIPPETS_BASE,
        {
            "src_id": src_id,
            "tgt_id": tgt_id,
            "snippet_method": method,
            "snippet_count": len(snippets),
            "id_hit": bool(id_hit),
            "snippet_map": snip_map,
        }
    ).consume()

# =========================
# Driver
# =========================
def retrieve_all_snippets(*,
                          batch_size:int = BATCH_RELATIONSHIPS,
                          echo:bool = True,
                          force: Optional[bool] = None,
                          search_all_citation_pipe_elements: bool = False,
                          find_Id_hits: bool = False):
    """
    force:
        - True  -> clear previous snippet_* before writing (overwrite)
        - False -> keep existing snippet_* (only write if needed)
        - None  -> use global CLEAR_PREVIOUS_SNIPPETS
    search_all_citation_pipe_elements:
        - False (default) -> only search FIRST element of citation_pipe
        - True            -> after name strategies, also search OTHER elements (exact then relaxed)
    find_Id_hits:
        - False (default) -> do not scan 'Id.' follow-ups
        - True            -> add 'Id.' hits occurring before the next 'v'/'v.' token after a hit
    """
    clear_previous_effective = CLEAR_PREVIOUS_SNIPPETS if (force is None) else bool(force)

    driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))
    try:
        session = driver.session(database=NEO4J_DATABASE)
    except TypeError:
        session = driver.session()

    # counters
    processed = 0
    success_total = 0
    success_by_method = {
        "citation_pipe_first": 0,
        "relaxed_citation_first": 0,
        "exact_name": 0,
        "relaxed_name": 0,
        "citation_pipe_other": 0,
        "relaxed_citation_other": 0
    }
    unsuccessful = 0
    id_hit_total = 0

    last_report = time.time()
    after = -1

    if echo:
        print("Starting Opinion Snippet Retriever…")
        print(f"  · clear_previous_snippets = {clear_previous_effective}")
        print(f"  · search_all_citation_pipe_elements = {search_all_citation_pipe_elements}")
        print(f"  · find_Id_hits = {find_Id_hits}")

    with session as s:
        while True:
            rels = s.run(Q_PAGE_REL, {"after_id": after, "limit": batch_size}).data()
            if not rels:
                break

            for row in rels:
                res = retrieve_snippets_for_edge(
                    s, row,
                    search_all_citation_pipe_elements=search_all_citation_pipe_elements,
                    find_Id_hits=find_Id_hits,
                    echo=echo  # enable edge-level debug prints per your request
                )
                if res["status"] == "ok" and res["snippets"]:
                    success_total += 1
                    if res["method"] in success_by_method:
                        success_by_method[res["method"]] += 1
                    if res.get("id_hit"): id_hit_total += 1

                    write_snippets(
                        s,
                        src_id=row["src_id"],
                        tgt_id=row["tgt_id"],
                        method=res["method"],
                        snippets=res["snippets"],
                        id_hit=res.get("id_hit", False),
                        clear_previous=clear_previous_effective
                    )
                else:
                    unsuccessful += 1
                    if echo:
                        src_nm = row.get("src_name", "")
                        tgt_nm = row.get("tgt_name", "")
                        if res["status"] == "no_hit":
                            print(f"{src_nm} -> {tgt_nm} - No hit found")
                        elif res["status"] == "no_source_chunks":
                            print(f"{src_nm} -> {tgt_nm} - No source opinion text")


                processed += 1
                after = row["rel_id"]

                # periodic report every N seconds
                now = time.time()
                if echo and (now - last_report) >= SECONDS_BETWEEN_REPORTS:
                    print(
                        f"[progress] edges processed: {processed} | "
                        f"success: {success_total} "
                        f"(first_exact:{success_by_method['citation_pipe_first']}, "
                        f"first_relaxed:{success_by_method['relaxed_citation_first']}, "
                        f"exact_name:{success_by_method['exact_name']}, "
                        f"relaxed_name:{success_by_method['relaxed_name']}, "
                        f"other_exact:{success_by_method['citation_pipe_other']}, "
                        f"other_relaxed:{success_by_method['relaxed_citation_other']}) | "
                        f"unsuccessful: {unsuccessful} | "
                        f"\"Id.\" hits: {id_hit_total}"
                    )
                    last_report = now

    # final summary
    if echo:
        print("\n=== Final Summary ===")
        print(f"Edges processed: {processed}")
        print(f"Successful (hit found): {success_total}")
        print("  · By method → "
              f"first_exact: {success_by_method['citation_pipe_first']}, "
              f"first_relaxed: {success_by_method['relaxed_citation_first']}, "
              f"exact_name: {success_by_method['exact_name']}, "
              f"relaxed_name: {success_by_method['relaxed_name']}, "
              f"other_exact: {success_by_method['citation_pipe_other']}, "
              f"other_relaxed: {success_by_method['relaxed_citation_other']}")
        print(f"Unsuccessful (no hit / no source): {unsuccessful}")
        print(f"\"Id.\" hits recorded: {id_hit_total}")

In [8]:
# =========================
# Run
# =========================
retrieve_all_snippets(
    force=True,
    echo=True,
    find_Id_hits=False,
    search_all_citation_pipe_elements=True
)


Starting Opinion Snippet Retriever…
  · clear_previous_snippets = True
  · search_all_citation_pipe_elements = True
  · find_Id_hits = False
Alexander v. Choate -> Southeastern Community College v. Davis - exact_name - 3
Olmstead v. L.C. -> Southeastern Community College v. Davis - exact_name - 1
Olmstead v. L.C. -> Alexander v. Choate - exact_name - 1
Singh v. George Washington University School of Medicine & Health Sciences -> Southeastern Community College v. Davis - citation_pipe_other_elements - 1
[progress] edges processed: 27 | success: 27 (first_exact:23, first_relaxed:0, exact_name:3, relaxed_name:0, other_exact:1, other_relaxed:0) | unsuccessful: 0 | "Id." hits: 0
Keane, Judith v. Sears Roebuck -> US Airways, Inc. v. Barnett - exact_name - 1
State of Hawaii v. Usedu -> Christensen v. Harris County - relaxed_name - 1
Daniel Lopez v. Catalina Channel Express, Inc. -> US Airways, Inc. v. Barnett - citation_pipe_other_elements - 1
Felicia Pugliese v. State-Operated School Distric