# Guided OCR with Structure Recognition

This notebook focuses on Phase 1 of our enhanced OCR process: identifying the document structure from PDF page images.

## Steps:
1. Convert PDF to page images
2. Send images to Claude 3.7 Sonnet API
3. Have Claude identify hierarchical heading structure
4. Create a structural map of the document

## Setup

First, let's import the necessary libraries and set up our environment.

In [None]:
import os
import json
import base64
from pathlib import Path
from typing import List, Dict, Optional, Any
import anthropic
import re
from pdf2image import convert_from_path
from pydantic import BaseModel, Field
from IPython.display import display, Image

# Set your API key
ANTHROPIC_API_KEY = os.environ.get("ANTHROPIC_API_KEY")

if not ANTHROPIC_API_KEY:
    print("Warning: No Anthropic API key found. Set the ANTHROPIC_API_KEY environment variable.")
    # Uncomment and set directly if needed
    # ANTHROPIC_API_KEY = "your_api_key_here"

## Define Data Models for Document Structure

We'll use Pydantic models to structure our document data.

In [None]:
class HeadingElement(BaseModel):
    """A heading element in the document"""
    text: str = Field(..., description="The heading text")
    level: int = Field(..., description="Heading level (1 for main headings, 2 for subheadings, etc.)")
    page: int = Field(..., description="Page number containing the heading (1-indexed)")
    position: Optional[Dict[str, int]] = Field(None, description="Position coordinates on page (x, y, width, height)")

class DocumentStructure(BaseModel):
    """Structure of the document extracted from images"""
    headings: List[HeadingElement] = Field(default_factory=list, description="All headings in the document")
    total_pages: int = Field(..., description="Total number of pages in the document")
    document_title: Optional[str] = Field(None, description="Title of the document")

## PDF Processing Functions

These functions handle the conversion of PDF to images and preparation for API calls.

In [None]:
def convert_pdf_to_images(pdf_path, dpi=200):
    """
    Convert PDF to a list of PIL Image objects.
    
    Args:
        pdf_path: Path to the PDF file
        dpi: Resolution for the images (higher = better quality but larger size)
        
    Returns:
        List of PIL Image objects
    """
    print(f"Converting PDF to images: {pdf_path}")
    images = convert_from_path(pdf_path, dpi=dpi)
    print(f"Converted {len(images)} pages")
    return images

def encode_image_to_base64(image):
    """
    Encode a PIL Image to base64 for API transmission.
    
    Args:
        image: PIL Image object
        
    Returns:
        Base64 encoded string
    """
    import io
    buffered = io.BytesIO()
    image.save(buffered, format="JPEG")
    return base64.b64encode(buffered.getvalue()).decode("utf-8")

def prepare_images_for_api(images, max_images=None):
    """
    Prepare images for the Anthropic API by encoding them to base64.
    
    Args:
        images: List of PIL Image objects
        max_images: Maximum number of images to process (None = all)
        
    Returns:
        List of dictionaries with page number and base64 data
    """
    if max_images is not None:
        images = images[:max_images]
        
    encoded_images = []
    for i, img in enumerate(images):
        encoded_images.append({
            "page_number": i + 1,  # 1-indexed page numbers
            "base64": encode_image_to_base64(img)
        })
        
    return encoded_images

## Structure Extraction Using Claude API

These functions handle sending images to Claude and processing the response.

In [None]:
def extract_structure_from_images(encoded_images, api_key, batch_size=5):
    """
    Extract document structure from page images using Claude API.
    Processes images in batches to handle API limits.
    
    Args:
        encoded_images: List of dictionaries with page number and base64 data
        api_key: Anthropic API key
        batch_size: Number of images to process in each API call
        
    Returns:
        DocumentStructure object
    """
    client = anthropic.Anthropic(api_key=api_key)
    
    # Process images in batches
    batches = [encoded_images[i:i+batch_size] for i in range(0, len(encoded_images), batch_size)]
    all_headings = []
    document_title = None
    
    for batch_index, batch in enumerate(batches):
        print(f"Processing batch {batch_index+1}/{len(batches)} (pages {batch[0]['page_number']}-{batch[-1]['page_number']})")
        
        # Prepare message content
        content = [
            {
                "type": "text",
                "text": f"""I'm sending you pages {batch[0]['page_number']}-{batch[-1]['page_number']} of a document. 
                Please identify all headings and their hierarchy in these pages by analyzing visual formatting cues:
                
                1. Look for section headings, indicated by larger font size, bold formatting, or numbering schemes (like 1., 1.1, A., etc.).
                2. Determine the level of each heading based on visual prominence (text size, styling, indentation).
                3. Level 1 should be the main headings, level 2 for subheadings, and so on.
                4. If this is the first batch and you see a document title, please identify it.
                
                For each heading, provide:
                - The exact text of the heading
                - The heading level (1, 2, 3, etc.)
                - The page number it appears on
                
                Respond in JSON format with an array of headings and document title if found."""
            }
        ]
        
        # Add images to the content
        for img_data in batch:
            content.append({
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/jpeg",
                    "data": img_data["base64"]
                }
            })
        
        # Call the API
        response = client.messages.create(
            model="claude-3-sonnet-20240229",
            max_tokens=4000,
            temperature=0,  # Use 0 for consistent, deterministic output
            system="You are an expert at document structure analysis. Your task is to identify the hierarchical structure of documents based on visual formatting cues like font size, style, indentation, and numbering systems. You will analyze document pages and extract the heading structure accurately. Respond with only JSON formatted data without explanations or markdown formatting.",
            messages=[
                {
                    "role": "user",
                    "content": content
                }
            ]
        )
        
        # Parse the response
        try:
            # Extract JSON from the response
            response_text = response.content[0].text
            # Sometimes the API returns the JSON inside a code block, so remove that
            json_text = re.sub(r'```json\n(.+?)\n```', r'\1', response_text, flags=re.DOTALL)
            json_text = re.sub(r'```(.+?)```', r'\1', json_text, flags=re.DOTALL)
            
            result = json.loads(json_text)
            
            # Extract the headings from the result
            if "headings" in result:
                batch_headings = result["headings"]
                all_headings.extend(batch_headings)
            
            # Extract document title if this is the first batch
            if batch_index == 0 and "document_title" in result and result["document_title"]:
                document_title = result["document_title"]
                
        except Exception as e:
            print(f"Error parsing response from batch {batch_index+1}: {e}")
            print(f"Response: {response.content[0].text}")
    
    # Create the final document structure
    structure = DocumentStructure(
        headings=[HeadingElement(**h) for h in all_headings],
        total_pages=len(encoded_images),
        document_title=document_title
    )
    
    return structure

## Process a Document

Now let's use these functions to process a PDF document.

In [None]:
def process_document(pdf_path, output_path=None, max_pages=None):
    """
    Process a PDF document to extract its structure.
    
    Args:
        pdf_path: Path to the PDF file
        output_path: Path to save the output JSON (default: same as pdf with _structure.json)
        max_pages: Maximum number of pages to process (None = all)
        
    Returns:
        DocumentStructure object
    """
    # Set default output path if not provided
    if output_path is None:
        output_path = Path(pdf_path).with_stem(f"{Path(pdf_path).stem}_structure").with_suffix(".json")
    
    # Convert PDF to images
    images = convert_pdf_to_images(pdf_path)
    
    # Limit the number of pages if specified
    if max_pages is not None:
        images = images[:max_pages]
        print(f"Limited to first {max_pages} pages")
    
    # Prepare images for API
    encoded_images = prepare_images_for_api(images)
    
    # Extract structure
    structure = extract_structure_from_images(encoded_images, ANTHROPIC_API_KEY)
    
    # Save the structure to JSON
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(structure.model_dump_json(indent=2))
    
    print(f"\nDocument structure saved to: {output_path}")
    
    return structure

In [None]:
# Enter the path to your PDF file
pdf_path = "R35_MIRA.pdf"  # Replace with your PDF file

# Process the document (limit to first 5 pages for testing)
# Remove max_pages parameter to process the entire document
if ANTHROPIC_API_KEY:
    try:
        structure = process_document(pdf_path, max_pages=5)
        
        # Display summary of the structure
        print(f"\nDocument Structure Summary:")
        print(f"- Document title: {structure.document_title}")
        print(f"- Total pages processed: {structure.total_pages}")
        print(f"- Headings found: {len(structure.headings)}")
        
        # Display the heading hierarchy
        print("\nHeading Hierarchy:")
        for heading in structure.headings:
            indent = "  " * (heading.level - 1)
            print(f"{indent}- {heading.text} (Level {heading.level}, Page {heading.page})")
    except Exception as e:
        print(f"Error processing document: {e}")
        import traceback
        traceback.print_exc()
else:
    print("Cannot process document: No API key provided.")

## Visualize the Structure

Let's create a visualization of the document structure.

In [None]:
def create_structure_visualization(structure_path):
    """
    Create a simple visualization of the document structure.
    
    Args:
        structure_path: Path to the structure JSON file
    """
    # Load the structure
    with open(structure_path, "r", encoding="utf-8") as f:
        structure_data = json.load(f)
    
    # Create a structure graph
    print(f"Document: {structure_data.get('document_title', 'Untitled')}")
    print(f"Total pages: {structure_data['total_pages']}")
    print("\nDocument Outline:")
    
    # Show headings by level with page numbers
    headings = sorted(structure_data["headings"], key=lambda h: (h["page"], h.get("position", {}).get("y", 0) if h.get("position") else 0))
    
    current_page = None
    for heading in headings:
        # Show page breaks
        if heading["page"] != current_page:
            current_page = heading["page"]
            print(f"\n--- Page {current_page} ---")
        
        # Show the heading with indentation based on level
        indent = "  " * (heading["level"] - 1)
        print(f"{indent}• {heading['text']}")

# If we've already processed the document, visualize the structure
structure_path = Path(pdf_path).with_stem(f"{Path(pdf_path).stem}_structure").with_suffix(".json")
if structure_path.exists():
    create_structure_visualization(structure_path)
else:
    print(f"Structure file not found at {structure_path}")

## Utility Function to Apply Structure to OCR Text

This function is a starting point for Phase 2, applying the identified structure to OCR text.

In [None]:
def apply_structure_to_markdown(markdown_text, structure_data):
    """
    Apply the extracted structure to OCR-generated markdown text.
    
    Args:
        markdown_text: The OCR-generated markdown text
        structure_data: The structure data from the DocumentStructure
        
    Returns:
        Markdown text with proper heading structure applied
    """
    # This is a placeholder implementation - Phase 2 will develop this further
    headings = structure_data["headings"]
    
    # Sort headings by their appearance in the text (page, then position)
    headings = sorted(headings, key=lambda h: (h["page"], h.get("position", {}).get("y", 0) if h.get("position") else 0))
    
    # For each heading, try to find it in the text and mark it with the proper markdown heading level
    modified_text = markdown_text
    
    for heading in headings:
        heading_text = heading["text"]
        heading_level = heading["level"]
        
        # Escape special characters for regex
        escaped_text = re.escape(heading_text)
        
        # Look for the heading text and replace it with proper markdown heading
        pattern = f"(^|\n)({escaped_text})($|\n)"
        replacement = f"\n{'#' * heading_level} \2\n"
        
        modified_text = re.sub(pattern, replacement, modified_text)
    
    return modified_text

# This will be implemented more fully in Phase 2

## Next Steps

In Phase 2, we'll build on this foundation to:
1. Apply the identified structure to OCR-extracted text
2. Send both images and structured text to Claude
3. Have Claude correct LaTeX, equations, and formatting issues