# Ontario Damages Compendium Parser - Gemini Version

This notebook demonstrates how to parse the Ontario Damages Compendium PDF using Google's Gemini API.

## Features

- Intelligent case parsing with Gemini 2.0 Flash
- Multi-plaintiff support
- Family Law Act claims extraction
- Checkpoint/resume functionality
- Automatic embedding generation
- Dashboard-compatible output

## Setup

In [None]:
# Install dependencies (if needed)
# !pip install pdfplumber requests pandas sentence-transformers -q

In [None]:
# Import the parser modules
from damages_parser_gemini import (
    parse_compendium,
    DamagesCompendiumParser,
    PDFTextExtractor,
    flatten_cases_to_records
)
from gemini_data_transformer import (
    add_embeddings_to_gemini_cases,
    extract_gemini_statistics,
    convert_gemini_to_dashboard_format
)
import json
import pandas as pd
from pathlib import Path

In [None]:
# Configuration
API_KEY = "YOUR_GEMINI_API_KEY_HERE"  # ‚ö†Ô∏è Replace with your actual API key

# PDF path (download from https://cdn.ymaws.com/www.ccla-abcc.ca/resource/resmgr/pp-civlit/2024damagescompendium.pdf)
PDF_PATH = "2024damagescompendium.pdf"

# Output paths
GEMINI_JSON = "damages_full.json"  # Raw Gemini output
DASHBOARD_JSON = "data/damages_with_embeddings.json"  # Dashboard-compatible format

## Download the PDF (if needed)

In [None]:
import requests

if not Path(PDF_PATH).exists():
    print("Downloading PDF...")
    url = "https://cdn.ymaws.com/www.ccla-abcc.ca/resource/resmgr/pp-civlit/2024damagescompendium.pdf"
    response = requests.get(url, timeout=120)
    Path(PDF_PATH).write_bytes(response.content)
    print(f"Downloaded {len(response.content) / 1024 / 1024:.1f} MB")
else:
    print(f"PDF already exists: {PDF_PATH}")

## Full Parse (All Pages)

**Estimated time:** 30-60 minutes  
**Estimated cost:** $0.20-$0.50 with Gemini 2.0 Flash

The parser saves checkpoints after each page. If API quota runs out or there's an error, use the resume cell below.

In [None]:
# Parse entire PDF (fresh start)
# ‚ö†Ô∏è Only run this cell if you want to start from scratch!

all_cases = parse_compendium(
    PDF_PATH,
    api_key=API_KEY,
    output_json=GEMINI_JSON
)

print(f"\n‚úÖ Parsed {len(all_cases)} cases")

## Resume After Interruption

If the parser stopped (API quota, network error, etc.), run this cell to **resume from where it left off**.

It automatically:
- Reads the checkpoint file to find the last processed page
- Goes back 1 page for safety (in case last page was incomplete)
- Loads existing parsed cases
- Skips duplicates

In [None]:
# RESUME from checkpoint (run this if parsing was interrupted)

all_cases = parse_compendium(
    PDF_PATH,
    api_key=API_KEY,
    output_json=GEMINI_JSON,
    resume=True  # <-- This resumes from checkpoint
)

print(f"\n‚úÖ Total cases: {len(all_cases)}")

In [None]:
# Check checkpoint status

if Path("parsing_checkpoint.json").exists():
    with open("parsing_checkpoint.json") as f:
        checkpoint = json.load(f)
    print(f"Last checkpoint:")
    print(f"  Page: {checkpoint.get('last_page')}")
    print(f"  Cases: {checkpoint.get('total_cases')}")
    print(f"  Time: {pd.to_datetime(checkpoint.get('timestamp'), unit='s')}")
else:
    print("No checkpoint found - start fresh")

## Generate Embeddings for Dashboard

Convert Gemini-parsed cases to dashboard format with embeddings.

In [None]:
# Convert to dashboard format and generate embeddings
# This may take 5-10 minutes for full dataset

dashboard_cases = add_embeddings_to_gemini_cases(
    GEMINI_JSON,
    DASHBOARD_JSON
)

print(f"\n‚úÖ Created {len(dashboard_cases)} dashboard-ready cases")
print(f"\nüìÅ Saved to:")
print(f"  - Raw Gemini: {GEMINI_JSON}")
print(f"  - Dashboard: {DASHBOARD_JSON}")

## Analyze Results

In [None]:
# Load from JSON if already parsed
with open(GEMINI_JSON, "r") as f:
    cases = json.load(f)

print(f"Total cases: {len(cases)}")

In [None]:
# Extract statistics
stats = extract_gemini_statistics(cases)

print("üìä Statistics:")
print(f"  Total cases: {stats['total_cases']:,}")
print(f"  Total plaintiffs: {stats['total_plaintiffs']:,}")
print(f"  Multi-plaintiff cases: {stats['multi_plaintiff_count']:,}")
print(f"  Family Law Act cases: {stats['family_law_act_count']:,}")

print("\nüí∞ Damages statistics:")
print(f"  Count: {stats['damages_stats']['count']:,}")
print(f"  Mean: ${stats['damages_stats']['mean']:,.0f}")
print(f"  Median: ${stats['damages_stats']['median']:,.0f}")
print(f"  Min: ${stats['damages_stats']['min']:,.0f}")
print(f"  Max: ${stats['damages_stats']['max']:,.0f}")

print("\nüè• Top categories:")
for cat, count in list(stats['categories'].items())[:10]:
    print(f"  {cat}: {count:,}")

## Convert to DataFrame

Flatten the nested structure for ML/analysis:

In [None]:
# Flatten to DataFrame
records = flatten_cases_to_records(cases)
df = pd.DataFrame(records)

print(f"DataFrame shape: {df.shape}")
df.head()

In [None]:
# Summary statistics
print("Non-pecuniary damages statistics:")
print(df['non_pecuniary_damages'].describe())

print("\nBy category (top 10):")
print(
    df.groupby('category')['non_pecuniary_damages']
    .agg(['count', 'mean', 'median'])
    .sort_values('count', ascending=False)
    .head(10)
)

In [None]:
# Save to CSV for ML
csv_path = "damages_flattened.csv"
df.to_csv(csv_path, index=False)
print(f"Saved to {csv_path}")

## View Multi-Plaintiff Cases

In [None]:
# Find cases with multiple plaintiffs
multi_plaintiff_cases = [c for c in cases if len(c.get('plaintiffs', [])) > 1]

print(f"Found {len(multi_plaintiff_cases)} multi-plaintiff cases\n")

if multi_plaintiff_cases:
    example = multi_plaintiff_cases[0]
    print(f"Example: {example.get('case_name')}")
    print(f"Year: {example.get('year')}")
    print(f"Category: {example.get('category')}")
    print(f"\nPlaintiffs:")
    for p in example.get('plaintiffs', []):
        damages = p.get('non_pecuniary_damages')
        print(f"  {p.get('plaintiff_id')}: {p.get('sex')} {p.get('age')} years - ${damages:,.2f}" if damages else f"  {p.get('plaintiff_id')}: {p.get('sex')} {p.get('age')} years")

## View Family Law Act Claims

In [None]:
# Cases with FLA claims
fla_cases = [c for c in cases if c.get('family_law_act_claims')]

print(f"Found {len(fla_cases)} cases with Family Law Act claims\n")

if fla_cases:
    example = fla_cases[0]
    print(f"Example: {example.get('case_name')}")
    print(f"\nFLA Claims:")
    for claim in example.get('family_law_act_claims', []):
        amt = claim.get('amount')
        desc = claim.get('description')
        print(f"  {desc}: ${amt:,.2f}" if amt else f"  {desc}: amount not specified")

## Test Dashboard Integration

Verify that the dashboard can load the data.

In [None]:
# Test loading dashboard format
with open(DASHBOARD_JSON) as f:
    dashboard_data = json.load(f)

print(f"‚úÖ Dashboard data loaded: {len(dashboard_data)} cases")

# Check format
sample = dashboard_data[0]
print("\nSample case structure:")
print(f"  case_name: {sample.get('case_name')}")
print(f"  region: {sample.get('region')}")
print(f"  year: {sample.get('year')}")
print(f"  damages: {sample.get('damages')}")
print(f"  embedding: {len(sample.get('embedding', []))} dimensions")
print(f"  has gemini_data: {'gemini_data' in sample}")

if 'gemini_data' in sample:
    gemini = sample['gemini_data']
    print(f"\nGemini data:")
    print(f"  plaintiff_id: {gemini.get('plaintiff_id')}")
    print(f"  injuries: {len(gemini.get('injuries', []))} listed")
    print(f"  citations: {len(gemini.get('citations', []))} listed")

## Cost Estimation

Gemini 2.0 Flash pricing:
- Input: $0.075 per 1M tokens
- Output: $0.30 per 1M tokens

For 655 pages:
- ~500 tokens input per page
- ~1000 tokens output per page
- Total: ~325K input + ~650K output
- **Estimated cost: ~$0.22**

In [None]:
# Rough cost estimate
num_pages = 655
input_tokens = num_pages * 500
output_tokens = num_pages * 1000

# Gemini 2.0 Flash pricing
input_cost = (input_tokens / 1_000_000) * 0.075
output_cost = (output_tokens / 1_000_000) * 0.30

print(f"Estimated tokens: {input_tokens:,} input, {output_tokens:,} output")
print(f"Estimated cost: ${input_cost + output_cost:.2f}")

## Next Steps

1. **Run the dashboard**: `streamlit run streamlit_app.py`
2. **Test search**: The dashboard will automatically detect and use the Gemini data
3. **Verify**: Check that multi-plaintiff and FLA data displays correctly

## Troubleshooting

### API Errors
- Check your API key is valid
- Gemini has rate limits - the parser includes delays and retries
- Use the resume functionality if interrupted

### Missing Data
- Some pages may not parse correctly
- Check the `errors` list on the parser
- Re-run specific pages if needed

### Validation
- Compare output against known cases
- Spot-check multi-plaintiff cases manually
- Verify FLA claims are accurate