# Parser Debugging Notebook

This notebook helps debug how the damages compendium PDF is being parsed.

**What to look for:**
- Section headers (Forearm, Spine, etc.) that become the `category` field
- Whether the `region` field should come from somewhere else
- What gets sent to the LLM for each row

In [None]:
import camelot
import json
from pathlib import Path
import pandas as pd
from IPython.display import display, HTML

# Configuration
PDF_PATH = "2024damagescompendium.pdf"
SAMPLE_PAGES = [10, 20, 30, 50, 75, 100, 150, 200, 250, 300]  # 10 sample pages

## Step 1: Extract Tables from Sample Pages

We'll use both STREAM (for section headers) and LATTICE (for table structure) modes.

In [None]:
def extract_page_info(pdf_path, page_num):
    """
    Extract table information from a single page using both modes.
    
    Returns:
        dict with 'stream_tables', 'lattice_tables', and 'section_header'
    """
    result = {
        'page': page_num,
        'stream_tables': [],
        'lattice_tables': [],
        'section_header': None
    }
    
    # Extract with STREAM mode (captures section headers)
    try:
        stream_tables = camelot.read_pdf(
            pdf_path,
            pages=str(page_num),
            flavor='stream',
            edge_tol=50
        )
        
        if stream_tables:
            for table in stream_tables:
                df = table.df
                result['stream_tables'].append(df)
                
                # Check first row for section header
                if len(df) > 0:
                    first_row = ' '.join(str(x) for x in df.iloc[0].values if str(x).strip())
                    # Common section headers: Forearm, Spine, Head, etc.
                    if len(first_row) < 50 and not any(kw in first_row.lower() for kw in ['case', 'damages', 'age']):
                        result['section_header'] = first_row.strip()
    except Exception as e:
        print(f"Stream extraction error on page {page_num}: {e}")
    
    # Extract with LATTICE mode (better table structure)
    try:
        lattice_tables = camelot.read_pdf(
            pdf_path,
            pages=str(page_num),
            flavor='lattice'
        )
        
        if lattice_tables:
            for table in lattice_tables:
                df = table.df
                result['lattice_tables'].append(df)
    except Exception as e:
        print(f"Lattice extraction error on page {page_num}: {e}")
    
    return result

# Extract all sample pages
print(f"Extracting tables from {len(SAMPLE_PAGES)} sample pages...\n")
page_data = {}

for page_num in SAMPLE_PAGES:
    print(f"Processing page {page_num}...")
    page_data[page_num] = extract_page_info(PDF_PATH, page_num)
    
print(f"\nâœ… Extracted data from {len(page_data)} pages")

## Step 2: Display Section Headers Found

These become the `category` field in the parsed data.

In [None]:
print("Section Headers Found:\n" + "="*50)

for page_num in SAMPLE_PAGES:
    info = page_data[page_num]
    section = info.get('section_header', 'None')
    print(f"Page {page_num:3d}: {section}")

## Step 3: View Sample Table Structure

Let's look at what the actual table data looks like.

In [None]:
# Pick a page with a section header
sample_page = None
for page_num in SAMPLE_PAGES:
    if page_data[page_num].get('section_header'):
        sample_page = page_num
        break

if sample_page:
    info = page_data[sample_page]
    
    print(f"\n{'='*70}")
    print(f"SAMPLE PAGE {sample_page}")
    print(f"Section Header: {info['section_header']}")
    print(f"{'='*70}\n")
    
    # Show STREAM table
    if info['stream_tables']:
        print("\n--- STREAM Mode (shows section headers) ---")
        display(info['stream_tables'][0].head(10))
    
    # Show LATTICE table
    if info['lattice_tables']:
        print("\n--- LATTICE Mode (cleaner structure) ---")
        display(info['lattice_tables'][0].head(10))
else:
    print("No pages with section headers found in sample")

## Step 4: Simulate What Gets Sent to the LLM

This shows exactly what the parser sends to the LLM for each row.

In [None]:
def simulate_llm_prompt(section, columns, row_data):
    """
    Recreate the exact prompt sent to the LLM.
    """
    prompt = f"""Parse this table row from a legal damages compendium.

Body Region/Category: {section}
Table Columns: {columns}
Row Data: {row_data}

Extract the following information and return as JSON:
{{
  "case_name": "Full case name (plaintiff v. defendant)" or null,
  "plaintiff_name": "Plaintiff name only" or null,
  "defendant_name": "Defendant name only" or null,
  "year": year as integer or null,
  "citation": "Citation string" or null,
  "court": "Court name" or null,
  "judge": "Judge's LAST NAME ONLY (e.g., 'Smith' not 'A. Smith J.'). For appeals with multiple judges, use a list like ['Smith', 'Jones', 'Brown']" or null,
  "sex": "M" or "F" or null,
  "age": age as integer or null,
  "non_pecuniary_damages": amount in dollars (number, no $ or commas) or null,
  "is_provisional": true/false or null,
  "injuries": ["injury1", "injury2"] or [],
  "other_damages": [{{"type": "future_loss_of_income|past_loss_of_income|cost_of_future_care|housekeeping_capacity|other", "amount": number, "description": "text"}}] or [],
  "comments": "Additional notes" or null,
  "is_continuation": true if this row lacks case_name/citation (continuation of previous case), false otherwise
}}
"""
    return prompt

# Show sample prompts
if sample_page and info['lattice_tables']:
    df = info['lattice_tables'][0]
    section = info['section_header'] or 'UNKNOWN'
    
    # Get column names from first row
    if len(df) > 1:
        columns = list(df.iloc[0].values)
        
        print(f"\n{'='*70}")
        print(f"SAMPLE LLM PROMPTS FOR PAGE {sample_page}")
        print(f"{'='*70}\n")
        
        # Show prompts for first 3 data rows
        for i in range(1, min(4, len(df))):
            row_data = list(df.iloc[i].values)
            prompt = simulate_llm_prompt(section, columns, row_data)
            
            print(f"\n--- ROW {i} PROMPT ---")
            print(prompt[:800] + "...\n" if len(prompt) > 800 else prompt)
            print(f"\nNote: The section '{section}' is saved as 'category' field, NOT 'region' field!\n")

## Step 5: Check Actual Parsed Data

Compare with what's actually in damages_full.json

In [None]:
# Load parsed data
with open('damages_full.json', 'r') as f:
    parsed_cases = json.load(f)

print(f"\nTotal parsed cases: {len(parsed_cases)}\n")

# Analyze region vs category
cases_with_region = sum(1 for c in parsed_cases if c.get('region'))
cases_with_category = sum(1 for c in parsed_cases if c.get('category'))

print(f"Cases with 'region' field:   {cases_with_region:4d} ({cases_with_region/len(parsed_cases)*100:.1f}%)")
print(f"Cases with 'category' field: {cases_with_category:4d} ({cases_with_category/len(parsed_cases)*100:.1f}%)")

# Sample cases
print("\n" + "="*70)
print("SAMPLE PARSED CASES")
print("="*70)

for i, case in enumerate(parsed_cases[:5]):
    print(f"\nCase {i+1}: {case.get('case_name', 'Unknown')}")
    print(f"  Category: {case.get('category', 'N/A')}")
    print(f"  Region:   {case.get('region', 'N/A')}")
    print(f"  Pages:    {case.get('source_pages', [])[:3]}")

## Step 6: Diagnosis

**Key Findings:**

1. The parser extracts section headers (e.g., "Forearm", "Spine") from the PDF
2. These section headers are sent to the LLM as `Body Region/Category: {section}`
3. The parsed data saves this as the `category` field
4. The `region` field is a SEPARATE field that appears to come from somewhere else

**Possible Issues:**

- The `region` field might be from a different data source (manual annotations?)
- The parser may need to be updated to ALSO save the section as `region`
- Or the dashboard should use `category` instead of expecting `region`

**Recommendation:**

Since the section header IS being captured and sent to the LLM, we should either:
1. Update the parser to save section as BOTH `category` AND `region`
2. Update the dashboard to use `category` when `region` is missing

## Step 7: Suggested Fix

Add this to the parser (around line 469 in damages_parser_table.py):

```python
data['source_page'] = page_number
data['category'] = section
data['region'] = [section]  # ADD THIS LINE - save section as region too
```

This would ensure both fields are populated with the anatomical section.