---

# Case Opinion Summarization

This notebook writes a **single-paragraph opinion summary** to each `(:Case).opinion_summary` using **Amazon Bedrock – `mistral.mistral-7b-instruct-v0:2`**.
It uses the **Mistral v0.2 tokenizer** locally (no model weights downloaded) for **token-accurate chunking** and calls Bedrock for generation.

---

## What the pipeline does

1. **Loads env & connects to Neo4j**
   Requires `.env` with `NEO4J_URI`, `NEO4J_USERNAME`, `NEO4J_PASSWORD`, `NEO4J_DATABASE`.

2. **Uses Bedrock for generation (no GPU needed locally)**
   Invokes `mistral.mistral-7b-instruct-v0:2` via `boto3` in the configured AWS Region.

3. **Token-based chunking with Mistral tokenizer**

   * Downloads only the **tokenizer** (`mistralai/Mistral-7B-Instruct-v0.2`).
   * Splits long opinions into parts of ≤ **context_tokens − overhead − output_reserve**, with **80-token overlap**.

4. **Fetches and compacts opinion text**

   * If chunked: `(:Case)-[:HAS_OPINION_CHUNK]->(:OpinionChunk)` ordered by `chunk_index`.
   * Else: inline `c.opinion_text`.
   * Cleans hyphenations, line-number stubs, and collapses whitespace.

5. **Summarizes each part, then merges**
   Rubric: parties’ key arguments (≤6 sentences), court conclusion (≤3 sentences), major statutes (≤3 laws).
   Post-processing trims “Here’s a summary…” fluff and ends with proper punctuation.

6. **Writes back to Neo4j**
   Only `c.opinion_summary` is written (one paragraph). Final console shows a run summary.

---

## What you’ll see in the logs

For each case:

```
- Case <CASE_ID> — <Case Name>
    · opinion too long → split into <K> parts (<= <max_ctx_tokens> tokens each)
      - finished part 1/<K>
      - finished part 2/<K>
      ...
    · wrote opinion_summary
```

If the opinion fits:

```
- Case <CASE_ID> — <Case Name>
    · wrote opinion_summary
```

Skips:

```
- Case <CASE_ID> — <Case Name>
    · SKIP (no opinion text)
```

or

```
- Case <CASE_ID> — <Case Name>
    · SKIP (already summarized; force=False)
```

End of run:

```
=== Summary ===
LLM backend: Bedrock Mistral (mistral.mistral-7b-instruct-v0:2, region=<REGION>)
Processed: <N> | Wrote: <M> | Skipped-existing: <A> | Skipped-empty: <B> | Chunked cases: <C>
Elapsed: <minutes> min
```

---

## How to run (SageMaker Studio / Notebook Instance)

1. **AWS prerequisites**

   * Bedrock access in your account/region (`bedrock:InvokeModel`).
   * Network egress to Bedrock (public internet access or proper VPC endpoints).
   * IAM execution role attached to the notebook with Bedrock permissions.

2. **Project setup**

   * Place a `.env` file (Neo4j creds) at project root or update the path.
   * Ensure your graph has `Case`, optional `OpinionChunk`, and relationships as described.

3. **Run the notebook cells top-to-bottom**
   The notebook will fetch the tokenizer automatically and then call Bedrock for text generation.

---

## The `summarize(...)` function

**Signature (defaults):**

```python
summarize(
    max_new_tokens=500,      # generation cap per Bedrock call
    min_out_tokens=200,      # best-effort minimum (post-processing)
    batch_cases=60,          # cases queried per Neo4j page
    echo=True,               # per-case logging
    force=False              # re-summarize even if opinion_summary exists
)
```

**Behavior:**

* Computes a conservative **token budget** based on Mistral v0.2 context, subtracting overhead and output reserve.
* Splits with **token overlap (80)** to preserve context across parts.
* Summarizes each part via Bedrock, merges, and writes `c.opinion_summary`.

---

## Assumptions & Notes

* **Tokenizer only** is downloaded from Hugging Face; all generation happens in Bedrock.
* Neo4j driver is configured to **suppress `DEPRECATION` notifications** (hides the `id()` warning); driver loggers are also set to `ERROR` as a fallback.
* If many cases are very long, expect a handful of parts per case rather than many tiny chunks (token-aware splitting).

---

## Troubleshooting

* **AccessDenied / Model not enabled** → Verify Bedrock access for `mistral.mistral-7b-instruct-v0:2` in your Region.
* **Networking** → If running in a VPC-only notebook, confirm NAT or Bedrock VPC endpoints.
* **Neo4j auth / TLS** → Check `.env` values and URI scheme (`neo4j+s://` for TLS).
* **Throughput** → Increase `batch_cases` for faster DB paging; lower it if your DB is small/slow.

---


In [19]:
!pip install neo4j

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [20]:
import os, re, time, pathlib, datetime as dt, json, logging
from typing import Optional, Dict, Any, List

import boto3
from dotenv import load_dotenv
from neo4j import GraphDatabase
from neo4j.exceptions import SessionExpired, ServiceUnavailable
from transformers import AutoTokenizer

# =========================
# Configuration
# =========================
BEDROCK_REGION   = os.getenv("BEDROCK_REGION", "us-east-1")
BEDROCK_MODEL_ID = "mistral.mistral-7b-instruct-v0:2"

# Neo4j connection
load_dotenv("../.env")
NEO4J_URI      = os.getenv("NEO4J_URI")
NEO4J_USERNAME = os.getenv("NEO4J_USERNAME")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD")
NEO4J_DATABASE = os.getenv("NEO4J_DATABASE", "neo4j")

# Silence noisy driver logs as a fallback (in case notification suppression
# below isn't supported by your driver/server combo)
for _name in ("neo4j", "neo4j.notifications", "neo4j.work.simple"):
    logging.getLogger(_name).setLevel(logging.ERROR)

## Summary Prompt

In [21]:
SUMMARY_SYSTEM_PROMPT = """You are a legal assistant specializing in writing court case briefs.
Summarize the case by outlining the key arguments of both parties (max 6 sentences),
the court's conclusion (max 3 sentences) and the major statute law being applied (max 3 laws).
Do NOT repeat the instructions or case text. Output only one paragraph of case summary.

Example output:
'The plaintiff (the Clinic) argued that Bill 39-02 was enacted due to community hostility toward methadone clinics, violating Title II of the ADA and substantive due process rights.
The Defendant (Baltimore County) argued there was insufficient evidence that it regarded the Clinic's patients as disabled and challenged the district court's rulings and remedies.
The Fourth Circuit held that a reasonable jury could differ on whether the County regarded patients as disabled. It affirmed parts of the due process verdict but vacated the injunction. The case was affirmed in part, reversed in part, vacated in part, and remanded.
Major statute law applied: Title II of the Americans with Disabilities Act (42 U.S.C. § 12131 et seq.) and Substantive Due Process (Fourteenth Amendment)'
"""

USER_SUMMARY_PROMPT = """Summarize the case below by outlining the key arguments of both parties (max 6 sentences),
the court's conclusion (max 3 sentences) and the major statute law being applied (max 3 laws).
Case Name: {case_name}
Case Context:
{case_context}
"""

def _inst(system_text: str, user_text: str) -> str:
    return f"<s>[INST]{system_text}\n{user_text}[/INST]"


## Cypher Queries

In [22]:
# =========================
# Cypher Queries
# =========================
Q_PAGE_CASES = """
MATCH (c:Case)
WHERE id(c) > $after_id
RETURN id(c) AS nid,
       c.id AS case_id,
       coalesce(c.name, '') AS case_name,
       coalesce(c.opinion_summary, '') AS existing_summary
ORDER BY nid
LIMIT $limit
"""

Q_GET_CHUNKS = """
MATCH (c:Case {id:$case_id})-[:HAS_OPINION_CHUNK]->(oc:OpinionChunk)
RETURN oc.text AS text
ORDER BY oc.chunk_index
"""

Q_GET_INLINE = "MATCH (c:Case {id:$case_id}) RETURN c.opinion_text AS text"

Q_WRITE_SUMMARY = "MATCH (c:Case {id:$case_id}) SET c.opinion_summary = $summary RETURN c.id AS case_id"

## Cleaning Functions

In [23]:
# =========================
# Cleaning & small utils
# =========================
LEADING_FLUFF_RE = re.compile(
    r"^\s*(certainly[, ]+)?(here(?:’|')?s|here is)\s+(a|the)\s+summary\s*[:\-–—]*\s*",
    re.IGNORECASE,
)
REDACTION_RE = re.compile(
    r"\b(portions of (?:the )?(?:document|legal document) have been redacted|redacted pursuant to)\b",
    re.IGNORECASE
)
_SENT_END_RE = re.compile(r'[.?!](?:["”\'\)\]]+)?')

def _strip_leading_fluff(text: str) -> str:
    return LEADING_FLUFF_RE.sub("", text).strip()

def _finalize_sentence(text: str) -> str:
    s = text.strip()
    last = None
    for m in _SENT_END_RE.finditer(s):
        last = m
    if last:
        s = s[: last.end()].rstrip()
    if not s or s[-1] not in ".!?":
        s = s.rstrip() + "."
    return s

def _compact(text: str) -> str:
    if not text:
        return text
    s = text.replace("\r", "\n").replace("\x0c", " ")
    s = re.sub(r'(\w)-\s*\n\s*(\w)', r'\1\2', s)
    lines = []
    for ln in s.splitlines():
        if re.match(r'^\s*\d{1,3}\s*$', ln):
            continue
        lines.append(ln)
    s = "\n".join(lines)
    s = re.sub(r'\s+', ' ', s).strip()
    s = re.sub(r'\s+([,.;:!?])', r'\1', s)
    return s

def _ordinal(n: int) -> str:
    return ["first","second","third","fourth","fifth","sixth","seventh","eighth","ninth","tenth"][n-1] if 1<=n<=10 else f"{n}th"

## Token-based chunking

In [24]:
# =========================
# Token-based chunking (Mistral tokenizer)
# =========================
# We only download the tokenizer; no model weights are fetched.
_MISTRAL_TOKENIZER_ID = "mistralai/Mistral-7B-Instruct-v0.2"
_tokenizer = AutoTokenizer.from_pretrained(_MISTRAL_TOKENIZER_ID, use_fast=True)

def _split_text_by_tokens(tokenizer, text: str, max_tokens: int, overlap: int = 50) -> List[str]:
    ids = tokenizer(text, add_special_tokens=False)["input_ids"]
    if len(ids) <= max_tokens:
        return [text]
    parts: List[str] = []
    step = max(1, max_tokens - overlap)
    for start in range(0, len(ids), step):
        end = min(len(ids), start + max_tokens)
        chunk_ids = ids[start:end]
        part = tokenizer.decode(chunk_ids, skip_special_tokens=True)
        parts.append(_compact(part))
        if end == len(ids):
            break
    return parts

def _mistral_max_ctx_tokens(max_new_tokens: int) -> int:
    """
    Conservative context calculator for Mistral 7B Instruct v0.2 (~32k context).
    Reserve some tokens for formatting + output.
    """
    model_max = getattr(_tokenizer, "model_max_length", 32000)
    if model_max is None or model_max > 1_000_000_000:
        model_max = 32000
    overhead = 512
    reserve_out = max_new_tokens + 64
    return max(1024, model_max - overhead - reserve_out)

In [25]:
# =========================
# Bedrock client (lazy)
# =========================
_bedrock_client = None
def _get_bedrock_client():
    global _bedrock_client
    if _bedrock_client is None:
        _bedrock_client = boto3.client("bedrock-runtime", region_name=BEDROCK_REGION)
    return _bedrock_client

In [26]:
def summarize(
    *,
    max_new_tokens: int = 500,
    min_out_tokens: int = 200,   # best-effort (post-gen; unused but kept for API)
    batch_cases: int = 60,
    echo: bool = True,
    force: bool = False,
):
    """
    Summarizes each Case node into c.opinion_summary using Amazon Bedrock
    model 'mistral.mistral-7b-instruct-v0:2'.

    If a Neo4j connection error (SessionExpired / ServiceUnavailable / OSError)
    occurs for a given case, that case (and the remainder of that batch) are
    skipped and processing continues with a fresh session on the next batch.
    """
    client = _get_bedrock_client()

    # Build chunker based on tokens
    max_ctx_tokens = _mistral_max_ctx_tokens(max_new_tokens)

    def _split_context(opinion_text: str) -> List[str]:
        return _split_text_by_tokens(
            _tokenizer,
            opinion_text,
            max_tokens=max_ctx_tokens,
            overlap=80,
        )

    def _gen_chunk(case_name: str, chunk_text: str) -> str:
        user = USER_SUMMARY_PROMPT.format(
            case_name=case_name,
            case_context=chunk_text,
        )
        formatted_prompt = _inst(SUMMARY_SYSTEM_PROMPT, user)

        payload = {
            "prompt": formatted_prompt,
            "max_tokens": max_new_tokens,
            "temperature": 0,
        }
        resp = client.invoke_model(
            modelId=BEDROCK_MODEL_ID,
            body=json.dumps(payload),
            contentType="application/json",
            accept="application/json",
        )
        body = json.loads(resp["body"].read())
        text = body["outputs"][0]["text"]
        text = re.sub(r"\s+", " ", text).strip()
        text = _strip_leading_fluff(text)

        if REDACTION_RE.search(text):
            text = re.sub(REDACTION_RE, "", text).strip()
        return _finalize_sentence(text)

    # ---- Neo4j driver ----
    driver = GraphDatabase.driver(
        NEO4J_URI,
        auth=(NEO4J_USERNAME, NEO4J_PASSWORD),
    )

    # Try to disable only DEPRECATION notifications (cleaner logs)
    session_kwargs: Dict[str, Any] = {"database": NEO4J_DATABASE}
    try:
        from neo4j.notifications import NotificationDisabledCategory

        session_kwargs["notifications_disabled_categories"] = [
            NotificationDisabledCategory.DEPRECATION
        ]
    except Exception:
        # Older driver versions may not support this; ignore
        pass

    processed = wrote = sk_empty = 0
    skipped_existing = 0
    chunked_cases = 0
    after_id = -1
    t0 = time.time()
    done = False

    try:
        while not done:
            # Open a fresh session for each batch
            with driver.session(**session_kwargs) as s:
                session_had_error = False

                # --- Fetch next batch of cases ---
                try:
                    batch = s.run(
                        Q_PAGE_CASES,
                        {"after_id": after_id, "limit": batch_cases},
                    ).data()
                except (SessionExpired, ServiceUnavailable, OSError) as e:
                    session_had_error = True
                    if echo:
                        print(
                            f"\nConnection error while paging cases after id {after_id}: {e}"
                        )
                        print("    · Skipping this batch and retrying with a new session.")
                    # Go back to the top of the while loop, new session next time
                    continue

                if not batch:
                    if echo:
                        print("Done. No more cases.")
                    done = True
                    continue

                if echo:
                    print(f"\nBatch after internal id {after_id}: {len(batch)} case(s)")

                # --- Process each case in the batch ---
                for row in batch:
                    nid = row["nid"]
                    case_id = row["case_id"]
                    case_name = row["case_name"] or ""
                    existing = (row.get("existing_summary") or "").strip()
                    after_id = nid  # important for paging

                    # Skip if already summarized (unless force=True)
                    if (not force) and existing:
                        skipped_existing += 1
                        processed += 1
                        if echo:
                            print(
                                f"- Case {case_id} — {case_name[:90] or '[no name]'}"
                            )
                            print("    · SKIP (already summarized; force=False)")
                        continue

                    # --- Read opinion text from Neo4j (with error handling) ---
                    try:
                        rows = s.run(
                            Q_GET_CHUNKS, {"case_id": case_id}
                        ).data()
                    except (SessionExpired, ServiceUnavailable, OSError) as e:
                        session_had_error = True
                        if echo:
                            print(
                                f"- Case {case_id} — {case_name[:90] or '[no name]'}"
                            )
                            print(
                                f"    · SKIP (Neo4j read error while getting chunks: {e})"
                            )
                            print("    · Will reopen session for remaining cases.")
                        # Break out of this batch; new session next loop
                        break

                    if rows:
                        parts_raw = [
                            _compact((r.get("text") or ""))
                            for r in rows
                            if (r.get("text") or "").strip()
                        ]
                        opinion_text = "\n\n".join(parts_raw)
                    else:
                        # Need inline text; may also fail
                        try:
                            r = s.run(
                                Q_GET_INLINE, {"case_id": case_id}
                            ).single()
                        except (SessionExpired, ServiceUnavailable, OSError) as e:
                            session_had_error = True
                            if echo:
                                print(
                                    f"- Case {case_id} — {case_name[:90] or '[no name]'}"
                                )
                                print(
                                    f"    · SKIP (Neo4j read error while getting inline text: {e})"
                                )
                                print("    · Will reopen session for remaining cases.")
                            break

                        raw = (r["text"] or "") if r and r.get("text") else ""
                        opinion_text = _compact(raw) if raw else ""

                    if not opinion_text:
                        sk_empty += 1
                        processed += 1
                        if echo:
                            print(
                                f"- Case {case_id} — {case_name[:90] or '[no name]'}"
                            )
                            print("    · SKIP (no opinion text)")
                        continue

                    if echo:
                        print(
                            f"- Case {case_id} — {case_name[:90] or '[no name]'}"
                        )

                    # --- Token-based chunking ---
                    parts = _split_context(opinion_text)
                    if len(parts) > 1 and echo:
                        chunked_cases += 1
                        print(
                            f"    · opinion too long → split into {len(parts)} parts "
                            f"(<= {max_ctx_tokens} tokens each)"
                        )

                    # --- Summarize all parts (LLM only; no Neo4j calls) ---
                    part_summaries: List[str] = []
                    for idx, ctx in enumerate(parts, 1):
                        psum = _gen_chunk(case_name, ctx)
                        part_summaries.append(psum)
                        if echo and len(parts) > 1:
                            print(f"      - finished part {idx}/{len(parts)}")

                    # Merge partial summaries
                    if len(part_summaries) == 1:
                        merged_summary = part_summaries[0]
                    else:
                        merged_lines = []
                        for i, ps in enumerate(part_summaries, 1):
                            tag = _ordinal(i)
                            merged_lines.append(
                                f"The {tag} part of the opinion text is summarized as follows:\n{ps}"
                            )
                        merged_summary = "\n\n".join(merged_lines).strip()

                    # --- Write summary back to Neo4j (with error handling) ---
                    try:
                        s.run(
                            Q_WRITE_SUMMARY,
                            {"case_id": case_id, "summary": merged_summary},
                        )
                    except (SessionExpired, ServiceUnavailable, OSError) as e:
                        session_had_error = True
                        if echo:
                            print(
                                f"    · SKIP (Neo4j write error on case {case_id}: {e})"
                            )
                            print(
                                "    · This case summary may not have been saved; "
                                "reopening session for remaining cases."
                            )
                        # Do not count as wrote; break out to reopen session
                        break

                    wrote += 1
                    processed += 1
                    if echo:
                        print("    · wrote opinion_summary")

                # End for row in batch

                if session_had_error:
                    # We broke out of the for-loop due to a connection error.
                    # Outer while-loop will open a new session and continue from
                    # the current `after_id` (effectively skipping this case).
                    continue

        # End while not done

    finally:
        driver.close()

    dt_sec = time.time() - t0
    print("\n=== Summary ===")
    print(
        f"LLM backend: Bedrock Mistral ({BEDROCK_MODEL_ID}, region={BEDROCK_REGION})"
    )
    print(
        "Processed: {p} | Wrote: {w} | "
        "Skipped-existing: {se} | Skipped-empty: {sk} | Chunked cases: {cc}".format(
            p=processed,
            w=wrote,
            se=skipped_existing,
            sk=sk_empty,
            cc=chunked_cases,
        )
    )
    print(f"Elapsed: {dt_sec/60:.1f} min")


In [27]:
# =========================
# Example usage
# =========================
summarize(force=False, batch_cases=60, max_new_tokens=500, min_out_tokens=200)


Batch after internal id -1: 60 case(s)
- Case 187059 — Singh v. George Washington University School of Medicine & Health Sciences
    · SKIP (already summarized; force=False)
- Case 2349194 — Napreljac v. John Q. Hammons Hotels, Inc.
    · SKIP (already summarized; force=False)
- Case 3039638 — Raymond Battle v. UPS
    · SKIP (already summarized; force=False)
- Case 793399 — Raymond Battle, Plaintiff-Appellee/cross-Appellant v. United Parcel Service, Inc., Defenda
    · SKIP (already summarized; force=False)
- Case 2450784 — Gretillat v. Care Initiatives
    · SKIP (already summarized; force=False)
- Case 2567088 — Shape v. Barnes County, ND
    · SKIP (already summarized; force=False)
- Case 2998138 — Keane, Judith v. Sears Roebuck
    · SKIP (already summarized; force=False)
- Case 791297 — Equal Employment Opportunity Commission, and Judith Keane, Intervening v. Sears, Roebuck &
    · SKIP (already summarized; force=False)
- Case 2408258 — Burrell v. Cummins Great Plains, Inc.
   

[#BD66]  _: <CONNECTION> error: Failed to read from defunct connection IPv4Address(('si-df1a98d5-fd3b.production-orch-0696.neo4j.io', 7687)) (ResolvedIPv4Address(('34.28.184.63', 7687))): OSError('No data')


      - finished part 31/31
    · SKIP (Neo4j write error on case 4040130: Failed to read from defunct connection IPv4Address(('si-df1a98d5-fd3b.production-orch-0696.neo4j.io', 7687)) (ResolvedIPv4Address(('34.28.184.63', 7687))))
    · This case summary may not have been saved; reopening session for remaining cases.

Batch after internal id 3865: 60 case(s)
- Case 117835 — Ticor Title Insurance v. Brown
    · wrote opinion_summary
- Case 7218418 — Potomac Electric Power Co. v. United States
    · wrote opinion_summary
- Case 2592996 — Seward v. B.O.C. Division of General Motors Corp.
    · wrote opinion_summary
- Case 4762369 — Brownfield v. City of Yakima
    · wrote opinion_summary
- Case 8737217 — Bruggeman v. Blagojevich
    · wrote opinion_summary
- Case 6935679 — Brumley v. Pena
    · wrote opinion_summary
- Case 4560204 — Brumley, Melissa v. United Parcel Service, Inc.
    · wrote opinion_summary
- Case 6927205 — Brunet v. City of Columbus
    · wrote opinion_summary
- Case 303