# Ontario Damages Compendium Parser - Azure AI Foundry

Parse the Ontario Damages Compendium PDF using Azure AI Foundry (OpenAI or Claude models).

**Features:**
- Supports Azure OpenAI (GPT-4o, GPT-4) and Claude models
- Checkpoint/resume support for handling API quota limits
- Multi-plaintiff and Family Law Act claims extraction
- Automatic embedding generation for dashboard

## Setup

In [3]:
from damages_parser_azure import parse_compendium, DamagesCompendiumParser
from data_transformer import (
    add_embeddings_to_cases,
    extract_statistics
)
import json
import pandas as pd
from pathlib import Path

## Configure Azure AI Foundry

Fill in your Azure details below:

In [None]:

PDF_PATH = "2024damagescompendium.pdf"
OUTPUT_JSON = "damages_full.json"
DASHBOARD_JSON = "data/damages_with_embeddings.json"

## Download PDF (if needed)

In [3]:
import requests

if not Path(PDF_PATH).exists():
    print("Downloading PDF...")
    url = "https://cdn.ymaws.com/www.ccla-abcc.ca/resource/resmgr/pp-civlit/2024damagescompendium.pdf"
    response = requests.get(url, timeout=120)
    Path(PDF_PATH).write_bytes(response.content)
    print(f"Downloaded {len(response.content) / 1024 / 1024:.1f} MB")
else:
    print(f"PDF already exists: {PDF_PATH}")

PDF already exists: 2024damagescompendium.pdf


## Check Current Progress

See if you have any existing progress:

In [4]:
# Check checkpoint from previous run
if Path("parsing_checkpoint.json").exists():
    with open("parsing_checkpoint.json") as f:
        checkpoint = json.load(f)
    print(f"Checkpoint found:")
    print(f"  Last page: {checkpoint.get('last_page_processed')}")
    print(f"  Cases so far: {checkpoint.get('cases_count')}")
    print(f"  Timestamp: {pd.to_datetime(checkpoint.get('timestamp'), unit='s')}")
else:
    print("No checkpoint found")

# Check existing cases
if Path(OUTPUT_JSON).exists():
    with open(OUTPUT_JSON) as f:
        existing = json.load(f)
    print(f"\nExisting cases file: {len(existing)} cases")
else:
    print(f"\nNo existing cases file")

No checkpoint found

Existing cases file: 112 cases


## Parse PDF - Fresh Start

‚ö†Ô∏è **Run this cell to start parsing from the beginning**

In [4]:
# Start fresh with rate limiting for Azure 200 req/min limit
cases = parse_compendium(
    PDF_PATH,
    endpoint=ENDPOINT,
    api_key=API_KEY,
    model=MODEL,
    output_json=OUTPUT_JSON,
    resume=False,
    requests_per_minute=200  # Azure rate limit
)

print(f"\n‚úÖ Parsed {len(cases)} cases")

Rate limiting enabled: 200 requests/minute
Parsing pages 1 to 667 of 667
Using model: gpt-5-chat
Using sliding window for multi-page case handling

Page 1/667... found 0 new (total: 0, 0 duplicates processed)

Page 2/667... found 0 new (total: 0, 0 duplicates processed)

Page 3/667... found 0 new (total: 0, 0 duplicates processed)

Page 4/667... found 4 new (total: 4, 0 duplicates processed)

Page 5/667... found 2 new (total: 6, 0 duplicates processed)

Page 6/667... found 2 new (total: 8, 0 duplicates processed)

Page 7/667... found 4 new (total: 12, 0 duplicates processed)

Page 8/667... found 4 new (total: 16, 0 duplicates processed)

Page 9/667... found 2 new (total: 18, 0 duplicates processed)

Page 10/667... found 3 new (total: 21, 0 duplicates processed)

Page 11/667... found 4 new (total: 25, 0 duplicates processed)

Page 12/667... found 2 new (total: 27, 0 duplicates processed)

Page 13/667... found 2 new (total: 29, 0 duplicates processed)

Page 14/667... found 1 new (total: 

## Resume from Checkpoint

If parsing was interrupted, run this cell to continue:

In [None]:
'''# Resume from checkpoint with rate limiting
cases = parse_compendium(
    PDF_PATH,
    endpoint=ENDPOINT,
    api_key=API_KEY,
    model=MODEL,
    output_json=OUTPUT_JSON,
    resume=True,  # <-- Resume from last checkpoint
    requests_per_minute=200  # Azure rate limit
)

print(f"\n‚úÖ Total cases: {len(cases)}")'''

## Parse Specific Page Range

Useful for testing or parsing specific sections:

In [None]:
'''# Parse specific page range with rate limiting
cases = parse_compendium(
    PDF_PATH,
    endpoint=ENDPOINT,
    api_key=API_KEY,
    model=MODEL,
    output_json=OUTPUT_JSON,
    start_page=588,  # Start here
    end_page=591,    # Stop here
    requests_per_minute=200  # Azure rate limit
)

print(f"\n‚úÖ Parsed pages 588-591: {len(cases)} cases")'''

## Generate Embeddings for Dashboard

Convert parsed cases to dashboard format with embeddings:

In [None]:
# Convert to dashboard format and generate embeddings
# This may take 5-10 minutes for full dataset

dashboard_cases = add_embeddings_to_cases(
    OUTPUT_JSON,
    DASHBOARD_JSON
)

print(f"\n‚úÖ Created {len(dashboard_cases)} dashboard-ready cases")
print(f"\nüìÅ Saved to:")
print(f"  - Raw Azure: {OUTPUT_JSON}")
print(f"  - Dashboard: {DASHBOARD_JSON}")

## Analyze Results

In [None]:
# Load all cases
with open(OUTPUT_JSON) as f:
    cases = json.load(f)

stats = extract_statistics(cases)

print(f"üìä Statistics:")
print(f"  Total cases: {stats['total_cases']:,}")
print(f"  Total plaintiffs: {stats['total_plaintiffs']:,}")
print(f"  Multi-plaintiff cases: {stats['multi_plaintiff_count']:,}")
print(f"  Family Law Act cases: {stats['family_law_act_count']:,}")

print("\nüí∞ Damages statistics:")
print(f"  Count: {stats['damages_stats']['count']:,}")
print(f"  Mean: ${stats['damages_stats']['mean']:,.0f}")
print(f"  Median: ${stats['damages_stats']['median']:,.0f}")
print(f"  Min: ${stats['damages_stats']['min']:,.0f}")
print(f"  Max: ${stats['damages_stats']['max']:,.0f}")

print("\nüè• Top categories:")
for cat, count in list(stats['categories'].items())[:10]:
    print(f"  {cat}: {count:,}")

In [None]:
# View sample case
if cases:
    print("Sample case (most recent):")
    print(json.dumps(cases[-1], indent=2))

## Convert to DataFrame

In [None]:
def flatten_cases(cases):
    """Flatten nested case data to one row per plaintiff"""
    rows = []
    
    for case in cases:
        base = {
            'case_id': case.get('case_id'),
            'case_name': case.get('case_name'),
            'plaintiff_name': case.get('plaintiff_name'),
            'defendant_name': case.get('defendant_name'),
            'year': case.get('year'),
            'category': case.get('category'),
            'court': case.get('court'),
            'source_page': case.get('source_page'),
            'num_plaintiffs': len(case.get('plaintiffs', [])),
            'has_fla_claims': bool(case.get('family_law_act_claims')),
        }
        
        plaintiffs = case.get('plaintiffs', [])
        if not plaintiffs:
            rows.append(base)
        else:
            for p in plaintiffs:
                row = base.copy()
                row.update({
                    'plaintiff_id': p.get('plaintiff_id'),
                    'sex': p.get('sex'),
                    'age': p.get('age'),
                    'non_pecuniary_damages': p.get('non_pecuniary_damages'),
                    'is_provisional': p.get('is_provisional'),
                    'injuries': ', '.join(p.get('injuries', [])),
                })
                rows.append(row)
    
    return pd.DataFrame(rows)

df = flatten_cases(cases)
print(f"DataFrame shape: {df.shape}")
df.head(10)

In [None]:
# Save to CSV
df.to_csv("damages_flattened.csv", index=False)
print("Saved to damages_flattened.csv")

In [None]:
# Summary stats by category
print("Non-pecuniary damages by category:")
print(
    df.groupby('category')['non_pecuniary_damages']
    .agg(['count', 'mean', 'median'])
    .sort_values('count', ascending=False)
    .head(15)
)

## Test Dashboard Integration

In [None]:
# Verify dashboard can load the data
with open(DASHBOARD_JSON) as f:
    dashboard_data = json.load(f)

print(f"‚úÖ Dashboard data loaded: {len(dashboard_data)} cases")

# Check format
sample = dashboard_data[0]
print("\nSample case structure:")
print(f"  case_name: {sample.get('case_name')}")
print(f"  region: {sample.get('region')}")
print(f"  year: {sample.get('year')}")
print(f"  damages: {sample.get('damages')}")
print(f"  embedding: {len(sample.get('embedding', []))} dimensions")
print(f"  has extended_data: {'extended_data' in sample}")

if 'extended_data' in sample:
    extended = sample['extended_data']
    print(f"\nExtended data:")
    print(f"  source_page: {extended.get('source_page')}")
    print(f"  plaintiff_id: {extended.get('plaintiff_id')}")
    print(f"  injuries: {len(extended.get('injuries', []))} listed")
    print(f"  citations: {len(extended.get('citations', []))} listed")

<cell_type>markdown</cell_type>## Next Steps

1. **Run the dashboard**: `streamlit run streamlit_app.py`
2. **Test search**: The dashboard will automatically load your Azure-parsed data
3. **Verify**: Check that all enhanced features display correctly

## Cost Estimation

### Azure OpenAI (GPT-5)
- Input: ~$2.50 per 1M tokens
- Output: ~$10 per 1M tokens
- **Full PDF (655 pages)**: ~$4-6

### Azure OpenAI (GPT-4o)
- Input: ~$2.50 per 1M tokens
- Output: ~$10 per 1M tokens
- **Full PDF (655 pages)**: ~$4-6

## Troubleshooting

### API Errors
- Verify your endpoint URL is correct
- Check that your API key is valid
- Ensure the deployment name matches your Azure deployment

### Rate Limits
- The parser includes automatic retry with exponential backoff
- Use `resume=True` to continue after quota limits

### Model Not Found
- Check deployment name in Azure portal
- Verify the model is deployed and available (gpt-5-chat or gpt-4o)
- Ensure your deployment is active and not paused