# Ontario Damages Compendium - Hybrid Camelot + LLM Extraction

This notebook uses a hybrid approach:
1. **Camelot** extracts tables from PDF (better table detection)
2. **LLM** parses each row (handles multi-plaintiff cases and complex data)

This approach combines the best of both worlds!

## 1. Setup and Imports

In [1]:
from damages_parser_table import parse_compendium_tables
from data_transformer import add_embeddings_to_cases
import json
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')
import os
from dotenv import load_dotenv

# Configuration
PDF_PATH = "2024damagescompendium.pdf"
OUTPUT_JSON = "damages_table_based.json"
DASHBOARD_JSON = "data/damages_with_embeddings.json"

# Azure Configuration (fill these in)
ENDPOINT = os.getenv("ENDPOINT")
API_KEY = os.getenv("API_KEY")
MODEL = os.getenv("MODEL")

# Create data directory
Path("data").mkdir(exist_ok=True)

print("‚úÖ Imports complete")

‚úÖ Imports complete


## 2. Test Table Extraction (Small Sample)

Let's first test on a small page range to verify the approach works:

In [2]:
# Test on just 5 pages first
test_cases = parse_compendium_tables(
    PDF_PATH,
    endpoint=ENDPOINT,
    api_key=API_KEY,
    model=MODEL,
    output_json="test_output.json",
    start_page=1,
    end_page=5,
    verbose=True
)

print(f"\n‚úÖ Test complete: {len(test_cases)} cases extracted")

# View a sample case
if test_cases:
    print("\nSample case:")
    print(json.dumps(test_cases[0], indent=2))

Rate limiting: 200 requests/minute
Parsing pages 1-5
Using Camelot table extraction + LLM row parsing
Model: gpt-5-chat

üìÑ Extracting section headers with stream mode...
‚úÖ Found 2 section headers from stream mode

üìÑ Extracting tables with lattice mode...
‚úÖ Extracted 3 tables from lattice mode

Page 1... SKIP - headers: ['COMPENDIUM OF DAMAGES AWARDED IN', 'PERSONAL INJURY ACTIONS ACROSS ONTARIO', 'JANUARY 1999 - OCTOBER 2024', 'THE HONOURABLE JUSTICE JAYE HOOPER', 'AND THE HONOURABLE JAMES B. CHADWICK, Q.C.']

Page 4... 
DEBUG Type 3:
  row1_cell0: 'Plaintiff'
  num_filled_row1: 10
  row1_values: ['Plaintiff', 'Defendant', 'Year']
Headers: ['Plaintiff', 'Defendant', 'Year', 'Citation', 'Court'], data_start: 2, df_len: 6
4 rows, 4 new, 0 merged

Page 5... 
DEBUG Type 3:
  row1_cell0: 'Plaintiff \nDefendant \nYear \nCitation \nCourt \nJudge \nSex \nNon-Pecuniary \nOther Damages \nComments \nAge'
  num_filled_row1: 1
  row1_values: ['Plaintiff \nDefendant \nYear \nCitation \nCou

## 3. Parse Full PDF

Once the test looks good, parse the entire PDF:

In [None]:
# Parse full PDF
cases = parse_compendium_tables(
    PDF_PATH,
    endpoint=ENDPOINT,
    api_key=API_KEY,
    model=MODEL,
    output_json=OUTPUT_JSON,
    verbose=True,
    requests_per_minute=200  # Azure rate limit
)

print(f"\n‚úÖ Parsed {len(cases)} cases")

Rate limiting: 200 requests/minute
Parsing pages all
Using Camelot table extraction + LLM row parsing
Model: gpt-5-chat

üìÑ Extracting section headers with stream mode...


## 4. Parse Specific Page Range

Or parse specific pages if you want to resume or test a section:

In [None]:
'''# Parse specific range
cases = parse_compendium_tables(
    PDF_PATH,
    endpoint=ENDPOINT,
    api_key=API_KEY,
    model=MODEL,
    output_json=OUTPUT_JSON,
    start_page=10,
    end_page=50,
    verbose=True
)

print(f"\n‚úÖ Parsed pages 10-50: {len(cases)} cases")'''

## 5. Generate Embeddings for Dashboard

Convert parsed cases to dashboard format with embeddings:

In [None]:
# Convert to dashboard format and generate embeddings
dashboard_cases = add_embeddings_to_cases(
    OUTPUT_JSON,
    DASHBOARD_JSON
)

print(f"\n‚úÖ Created {len(dashboard_cases)} dashboard-ready cases")
print(f"\nüìÅ Saved to:")
print(f"  - Raw parsed: {OUTPUT_JSON}")
print(f"  - Dashboard: {DASHBOARD_JSON}")

## 6. Analyze Results

In [None]:
# Load and analyze
with open(OUTPUT_JSON) as f:
    cases = json.load(f)

print(f"üìä Statistics:")
print(f"  Total cases: {len(cases):,}")

# Count multi-plaintiff cases
multi_plaintiff = sum(1 for c in cases if len(c.get('plaintiffs', [])) > 1)
print(f"  Multi-plaintiff cases: {multi_plaintiff:,}")

# Count cases with damages
with_damages = sum(1 for c in cases if c.get('non_pecuniary_damages'))
print(f"  Cases with damages: {with_damages:,}")

# Count by category
categories = {}
for c in cases:
    cat = c.get('category', 'UNKNOWN')
    categories[cat] = categories.get(cat, 0) + 1

print("\nüè• Top categories:")
for cat, count in sorted(categories.items(), key=lambda x: x[1], reverse=True)[:10]:
    print(f"  {cat}: {count:,}")

## 7. View Sample Cases

In [None]:
# Display sample cases
print("\nüìã Sample Cases:")
print("=" * 80)

for i, case in enumerate(cases[:3], 1):
    print(f"\nCase {i}:")
    print(f"  Case Name: {case.get('case_name')}")
    print(f"  Category: {case.get('category')}")
    print(f"  Year: {case.get('year')}")
    print(f"  Court: {case.get('court')}")
    print(f"  Judge: {case.get('judge')}")
    
    if case.get('plaintiffs'):
        print(f"  Plaintiffs: {len(case['plaintiffs'])}")
        for p in case['plaintiffs']:
            print(f"    - {p.get('plaintiff_id')}: ${p.get('non_pecuniary_damages'):,}" 
                  if p.get('non_pecuniary_damages') else f"    - {p.get('plaintiff_id')}")
    
    if case.get('injuries'):
        print(f"  Injuries: {', '.join(case['injuries'][:3])}")
    
    print("-" * 80)

## Next Steps

1. Run the Streamlit app: `streamlit run streamlit_app.py`
2. Test the search functionality with various injury descriptions
3. Verify that multi-plaintiff cases are handled correctly

## Why This Approach Works Better

**Camelot Table Extraction:**
- Better at detecting table boundaries
- Handles complex table layouts
- More reliable than pdfplumber for structured tables

**LLM Row Parsing:**
- Handles multiple plaintiffs in one cell
- Extracts complex damage breakdowns
- Normalizes judge names
- Detects continuation rows

**Cost Effective:**
- Only sends row text to LLM (not full pages)
- 10-50x cheaper than full-page approaches
- Works well with lighter models (gpt-5-nano, 4o-mini)

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
import json
from pathlib import Path
import tqdm

# Load the dashboard cases with embeddings
with open(DASHBOARD_JSON, "r", encoding="utf-8") as f:
    cases = json.load(f)

# Use the same embedding model
emb_model = SentenceTransformer("all-MiniLM-L6-v2")

# Build injury-focused search_text and embeddings
ids = []
inj_embs = []
out_cases = []

for c in tqdm.tqdm(cases, desc="Generate injury-focused embeddings"):
    # Build search_text from injuries + sequelae only
    ext = c.get("extended_data", {}) or {}
    injuries = ext.get("injuries") or []
    
    # join injuries into concise search text
    search_text = "; ".join(injuries) if injuries else ""
    
    # fallback if no injuries
    if not search_text:
        case_name = c.get("case_name", "")
        if case_name:
            search_text = case_name
        else:
            search_text = "case"
    
    c['search_text'] = search_text
    
    # Compute embedding
    emb = emb_model.encode(search_text).astype("float32")
    c['inj_emb'] = emb.tolist()
    
    ids.append(c['id'])
    inj_embs.append(emb)
    out_cases.append(c)

# Save artifacts for RAG search
data_dir = Path("data")
data_dir.mkdir(exist_ok=True)

# Save cases with search_text and embeddings
with open(data_dir / "compendium_inj.json", "w", encoding="utf-8") as f:
    json.dump(out_cases, f, ensure_ascii=False, indent=2)

# Save embedding matrix for fast load
emb_matrix = np.vstack(inj_embs)
np.save(data_dir / "embeddings_inj.npy", emb_matrix)

# Save case IDs for mapping
with open(data_dir / "ids.json", "w", encoding="utf-8") as f:
    json.dump(ids, f)

print(f"‚úÖ Created {len(out_cases)} injury-focused embeddings")
print(f"   - compendium_inj.json: {(data_dir / 'compendium_inj.json').stat().st_size / 1024 / 1024:.1f} MB")
print(f"   - embeddings_inj.npy: {(data_dir / 'embeddings_inj.npy').stat().st_size / 1024 / 1024:.1f} MB")
print(f"   - ids.json: {(data_dir / 'ids.json').stat().st_size / 1024:.1f} KB")

## Generate Injury-Focused Embeddings for RAG Search

Create embeddings for semantic search focused only on injuries and sequelae: