# PDF Outline Extractor - Adobe Hackathon Round 1A Demo

This notebook demonstrates the PDF Outline Extractor library for Adobe Hackathon Round 1A.

## Overview

The PDF Outline Extractor is a pure Python solution that:
- Extracts structured outlines from PDF documents
- Identifies titles and hierarchical headings (H1, H2, H3, H4)
- Outputs JSON with 0-based page indexing
- Supports multilingual content including CJK and RTL scripts
- Uses no ML models - only rule-based algorithms
- Handles special characters like \t, \n as literal text

## Key Features

1. **Font Size Analysis**: Ratios relative to page average
2. **Layout Analysis**: X-coordinate positioning for hierarchy
3. **Pattern Recognition**: Numbering, bullets, and structural patterns
4. **Multilingual Support**: Unicode normalization and script handling
5. **Performance**: <10s for 50-page PDFs, offline operation

## Installation

First, let's clone the repository and install the package:

In [None]:
# Clone the repository
!git clone https://github.com/youruser/pdf_outline_extractor.git
%cd pdf_outline_extractor

In [None]:
# Install dependencies
!pip install -r requirements.txt

# Install the package in development mode
!pip install -e .

## Setup Directory Structure

Let's create the necessary directories for our demo:

In [None]:
import os
from pathlib import Path

# Create demo directories
demo_dirs = [
    'examples/input_pdfs',
    'examples/output_json', 
    'examples/expected_json'
]

for dir_path in demo_dirs:
    Path(dir_path).mkdir(parents=True, exist_ok=True)
    print(f"Created directory: {dir_path}")

## Python API Usage

Let's demonstrate the Python API for extracting outlines:

In [None]:
# Import the extractor
from pdf_outline_extractor.extractor_new import PDFOutlineExtractor
import json

# Initialize the extractor with custom parameters
extractor = PDFOutlineExtractor(
    min_h1_size_ratio=1.5,      # H1 threshold
    min_h2_size_ratio=1.3,      # H2 threshold
    min_h3_size_ratio=1.1,      # H3 threshold
    h2_indent_threshold=20.0,   # H2 indentation limit
    h3_indent_threshold=40.0    # H3 indentation limit
)

print("PDF Outline Extractor initialized successfully!")
print(f"H1 ratio threshold: {extractor.min_h1_size_ratio}")
print(f"H2 ratio threshold: {extractor.min_h2_size_ratio}")
print(f"H3 ratio threshold: {extractor.min_h3_size_ratio}")

## Multilingual Text Processing Demo

Let's test the multilingual capabilities:

In [None]:
# Test multilingual text normalization
multilingual_samples = [
    "Résumé with café",           # French diacritics
    "第一章 导言",                # Chinese
    "序論",                      # Japanese
    "مقدمة",                     # Arabic
    "Введение",                  # Russian
    "Εισαγωγή",                  # Greek
    "1.\tIntroduction\nSection",  # Special characters
    "Price: $100 → Result",      # Symbols
]

print("Testing multilingual text normalization:")
print("=" * 50)

for text in multilingual_samples:
    normalized = extractor.normalize_text(text)
    print(f"Original:   {repr(text)}")
    print(f"Normalized: {repr(normalized)}")
    print(f"Preserved:  {text == normalized}")
    print("-" * 30)

## Pattern Recognition Demo

Let's test the heading pattern recognition:

In [None]:
# Test heading pattern detection
heading_samples = [
    ("1. Introduction", True),
    ("1.1 Overview", True),
    ("1.1.1 Details", True),
    ("Chapter 1", True),
    ("BACKGROUND", True),
    ("• Bullet Point", True),
    ("IV. Roman Numerals", True),
    ("Appendix A", True),
    ("第一章 导言", True),  # Chinese
    ("序論", True),        # Japanese
    ("This is a long paragraph that should not be detected as a heading.", False),
    ("small text", False),
]

print("Testing heading pattern recognition:")
print("=" * 60)

for text, expected in heading_samples:
    # Test numbering/bullet detection
    has_numbering = extractor._has_numbering_or_bullets(text)
    
    # Test heading-like patterns
    looks_like_heading = extractor._looks_like_heading(text, 1.2)  # Assume decent size ratio
    
    is_heading = has_numbering or looks_like_heading
    status = "✓" if is_heading == expected else "✗"
    
    print(f"{status} {text:<40} | Numbering: {has_numbering} | Pattern: {looks_like_heading} | Result: {is_heading}")

## Sample JSON Format

Let's create a sample expected output to demonstrate the format:

In [None]:
# Create sample expected output (following the provided format)
sample_output = {
    "title": "RFP:Request for Proposal To Present a Proposal for Developing the Business Plan for the Ontario Digital Library",
    "outline": [
        {
            "level": "H1",
            "text": "Ontario's Digital Library",
            "page": 1
        },
        {
            "level": "H1", 
            "text": "A Critical Component for Implementing Ontario's Road Map to Prosperity Strategy",
            "page": 1
        },
        {
            "level": "H2",
            "text": "Summary",
            "page": 1
        },
        {
            "level": "H3",
            "text": "Timeline:",
            "page": 1
        },
        {
            "level": "H2",
            "text": "Background",
            "page": 2
        }
    ]
}

# Save sample to expected directory
with open('examples/expected_json/sample.json', 'w', encoding='utf-8') as f:
    json.dump(sample_output, f, indent=4, ensure_ascii=False)

print("Sample JSON format (0-based page indexing):")
print(json.dumps(sample_output, indent=2, ensure_ascii=False))

## CLI Usage Demo

Now let's demonstrate the command-line interface:

In [None]:
# Check CLI help
!python -m pdf_outline_extractor.cli_clean --help

In [None]:
# Show extract command options
!python -m pdf_outline_extractor.cli_clean extract --help

In [None]:
# Show compare command options  
!python -m pdf_outline_extractor.cli_clean compare --help

## Process Sample PDFs

If you have sample PDFs in the input directory, let's process them:

In [None]:
# Check for PDFs in input directory
import glob

pdf_files = glob.glob('examples/input_pdfs/*.pdf')
print(f"Found {len(pdf_files)} PDF files:")
for pdf_file in pdf_files:
    print(f"  - {pdf_file}")

if pdf_files:
    print("\nProcessing PDFs...")
    !python -m pdf_outline_extractor.cli_clean extract \
        --input-dir examples/input_pdfs \
        --output-dir examples/output_json \
        --verbose
else:
    print("\nNo PDF files found. Upload PDFs to examples/input_pdfs/ to test extraction.")

## Compare Results with Ground Truth

If we have both predicted and expected results, let's compare them:

In [None]:
# Check for results to compare
output_files = glob.glob('examples/output_json/*.json')
expected_files = glob.glob('examples/expected_json/*.json')

print(f"Output files: {len(output_files)}")
print(f"Expected files: {len(expected_files)}")

if output_files and expected_files:
    print("\nComparing results with ground truth...")
    !python -m pdf_outline_extractor.cli_clean compare \
        --pred-dir examples/output_json \
        --gold-dir examples/expected_json \
        --verbose
else:
    print("\nNo files to compare. Process PDFs first and ensure expected results are available.")

## Manual Evaluation Example

Let's demonstrate manual evaluation of results:

In [None]:
# Manual evaluation example
from pdf_outline_extractor.cli_clean import calculate_metrics

# Example predicted results
pred_data = {
    "title": "Sample Document",
    "outline": [
        {"level": "H1", "text": "Introduction", "page": 0},
        {"level": "H2", "text": "Background", "page": 1},
        {"level": "H2", "text": "Methods", "page": 2},
        {"level": "H3", "text": "Data Collection", "page": 2},
        {"level": "H1", "text": "Results", "page": 3}
    ]
}

# Example ground truth
gold_data = {
    "title": "Sample Document", 
    "outline": [
        {"level": "H1", "text": "Introduction", "page": 0},
        {"level": "H2", "text": "Background", "page": 1},
        {"level": "H2", "text": "Methodology", "page": 2},  # Different text
        {"level": "H3", "text": "Data Collection", "page": 2},
        {"level": "H1", "text": "Results", "page": 3},
        {"level": "H2", "text": "Discussion", "page": 4}  # Missing in prediction
    ]
}

# Calculate metrics
metrics = calculate_metrics(pred_data, gold_data)

print("Evaluation Metrics:")
print("=" * 30)
print(f"Precision: {metrics['precision']:.3f}")
print(f"Recall:    {metrics['recall']:.3f}")
print(f"F1 Score:  {metrics['f1']:.3f}")
print(f"\nDetailed Counts:")
print(f"True Positives:  {metrics['true_positives']}")
print(f"False Positives: {metrics['false_positives']}")
print(f"False Negatives: {metrics['false_negatives']}")

## Docker Usage Example

Here's how you would use the Docker container:

In [None]:
# Note: These are example commands for Docker usage
# (Cannot run Docker directly in Colab, but showing the commands)

docker_commands = """
# Build the Docker image (AMD64 compatible)
docker build --platform linux/amd64 -t pdf-outline-extractor:latest .

# Run with volume mounts (as per hackathon requirements)
docker run --rm \
  -v $(pwd)/input:/app/input \
  -v $(pwd)/output:/app/output \
  --network none \
  pdf-outline-extractor:latest

# Custom parameters
docker run --rm \
  -v $(pwd)/input:/app/input \
  -v $(pwd)/output:/app/output \
  --network none \
  pdf-outline-extractor:latest \
  python -m pdf_outline_extractor.cli_clean extract \
  --input-dir /app/input \
  --output-dir /app/output \
  --min-h1-size-ratio 1.6 \
  --verbose
"""

print("Docker Usage Examples:")
print(docker_commands)

## Performance and Compliance Summary

This solution meets all Adobe Hackathon Round 1A requirements:

In [None]:
# Performance and compliance summary
compliance_checklist = {
    "Execution Time": "≤10 seconds for 50-page PDF ✓",
    "Network Access": "No internet calls (offline) ✓", 
    "Model Size": "No ML models (0 MB) ✓",
    "Architecture": "AMD64 compatible ✓",
    "Runtime": "CPU-only, 8 vCPUs, 16GB RAM ✓",
    "Dependencies": "PyMuPDF + standard library only ✓",
    "Output Format": "JSON with 0-based page indexing ✓",
    "Multilingual": "Unicode, CJK, RTL support ✓",
    "Special Characters": "Preserves \\t, \\n as literals ✓",
    "Edge Cases": "Empty docs, mixed languages ✓"
}

print("Adobe Hackathon Round 1A Compliance:")
print("=" * 50)
for requirement, status in compliance_checklist.items():
    print(f"{requirement:<20}: {status}")

print("\nScoring Potential:")
print("• Heading Detection Accuracy: 25 points (multi-heuristic approach)")
print("• Performance & Compliance: 10 points (optimized pure Python)")
print("• Multilingual Bonus: 10 points (full Unicode support)")
print("• Total Possible: 45 points")

## Conclusion

This notebook demonstrates the PDF Outline Extractor's capabilities:

1. **Pure Python Implementation**: No ML models, fast and lightweight
2. **Robust Algorithm**: Multi-layered heading detection with font analysis
3. **Multilingual Support**: Handles diverse scripts and special characters
4. **Production Ready**: Comprehensive testing, Docker support, CLI interface
5. **Hackathon Compliant**: Meets all technical requirements and constraints

The solution is designed for maximum accuracy and reliability while maintaining simplicity and performance.

**Next Steps:**
1. Upload your PDF files to `examples/input_pdfs/`
2. Run the extraction commands above
3. Compare results with ground truth data
4. Fine-tune parameters if needed for your specific document types

For more information, see the comprehensive README.md in the repository.