# API Testing Notebook - Entity Extraction from Real Files

This notebook tests the Resume NER API with actual test files and visualizes extracted entities.

**Note:** This notebook focuses on testing real files and visualizing results. For comprehensive error handling and edge case testing, see the integration tests in `tests/integration/api/test_api_local_server.py`.

## Prerequisites

Before running this notebook, start the API server with:

```bash
python -m src.api.cli.run_api \
  --onnx-model outputs/final_training/distilroberta/distilroberta_model.onnx \
  --checkpoint outputs/final_training/distilroberta/checkpoint
```

The server should be running on `http://localhost:8000` by default.


## 1. Setup and Configuration


In [37]:
import sys
from pathlib import Path
import json
import time
from typing import Dict, Any, Optional, List
import requests
from IPython.display import display, Markdown, JSON
import pandas as pd

# Add project root to path to import fixtures
# Find project root by looking for tests directory
current_dir = Path.cwd()
project_root = current_dir

# If we're in notebooks/, go up one level
if current_dir.name == "notebooks":
    project_root = current_dir.parent
else:
    # Try to find project root by looking for tests directory
    for parent in current_dir.parents:
        if (parent / "tests" / "test_data").exists():
            project_root = parent
            break

if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Import test fixtures
from tests.test_data.fixtures import (
    get_text_fixture,
    get_file_fixture,
    get_batch_text_fixture,
    get_batch_file_fixture,
    TEXT_FIXTURES,
    FILE_FIXTURES
)


In [38]:
# API Configuration
API_BASE_URL = "http://localhost:8000"
API_TIMEOUT = 30  # seconds


In [39]:
def make_request(method: str, endpoint: str, **kwargs) -> Dict[str, Any]:
    """
    Make HTTP request to API and return response data.
    
    Args:
        method: HTTP method (GET, POST)
        endpoint: API endpoint path
        **kwargs: Additional arguments for requests
        
    Returns:
        Dictionary with status_code, data, latency_ms, and error (if any)
    """
    url = f"{API_BASE_URL}{endpoint}"
    start_time = time.time()
    
    try:
        if method.upper() == "GET":
            response = requests.get(url, timeout=API_TIMEOUT, **kwargs)
        elif method.upper() == "POST":
            response = requests.post(url, timeout=API_TIMEOUT, **kwargs)
        else:
            raise ValueError(f"Unsupported method: {method}")
        
        latency_ms = (time.time() - start_time) * 1000
        
        result = {
            "status_code": response.status_code,
            "latency_ms": latency_ms,
            "error": None
        }
        
        try:
            result["data"] = response.json()
        except:
            result["data"] = {"text": response.text}
        
        return result
    except requests.exceptions.RequestException as e:
        latency_ms = (time.time() - start_time) * 1000
        return {
            "status_code": None,
            "latency_ms": latency_ms,
            "data": None,
            "error": str(e)
        }

def check_server_health() -> bool:
    """Check if server is running and healthy."""
    try:
        result = make_request("GET", "/health")
        if result["status_code"] == 200:
            print("✓ Server is healthy and ready")
            print(f"  Status: {result['data'].get('status', 'unknown')}")
            print(f"  Model loaded: {result['data'].get('model_loaded', False)}")
            return True
        else:
            print(f"✗ Server health check failed: {result['status_code']}")
            return False
    except Exception as e:
        print(f"✗ Cannot connect to server: {e}")
        print(f"  Make sure the server is running on {API_BASE_URL}")
        return False

# Check server health
if not check_server_health():
    print("\n⚠️  Please start the server before continuing!")


✓ Server is healthy and ready
  Status: ok
  Model loaded: True


In [None]:
def display_entities(entities: List[Dict[str, Any]], source_text: Optional[str] = None):
    """Display extracted entities in a formatted way."""
    if not entities:
        return
    
    # Group by label
    by_label = {}
    for entity in entities:
        label = entity.get("label", "UNKNOWN")
        if label not in by_label:
            by_label[label] = []
        by_label[label].append(entity)
    
    # Display by label
    for label, label_entities in sorted(by_label.items()):
        print(f"{label} ({len(label_entities)}):")
        for entity in label_entities:
            text = entity.get("text", "")
            confidence = entity.get("confidence")
            conf_str = f" (confidence: {confidence:.3f})" if confidence else ""
            print(f"  - '{text}'{conf_str}")
    
    # Show entities in context if source_text provided
    if source_text:
        highlighted_text = source_text
        sorted_entities = sorted(entities, key=lambda e: e.get("start", 0), reverse=True)
        for entity in sorted_entities:
            start = entity.get("start", 0)
            end = entity.get("end", 0)
            text = entity.get("text", "")
            label = entity.get("label", "UNKNOWN")
            highlighted_text = (
                highlighted_text[:start] + 
                f"[{text}]({label})" + 
                highlighted_text[end:]
            )
        print(f"\nContext: {highlighted_text}")


## 2. Single Text Prediction

Test entity extraction from individual text inputs.


### 2.1 Test with Sample Text


In [None]:
# Test with text_1
text_1 = get_text_fixture("text_1")
result = make_request("POST", "/predict", json={"text": text_1})
if result.get("status_code") == 200 and result.get("data"):
    entities = result["data"].get("entities", [])
    display_entities(entities, source_text=text_1)


Input text: Amazon Web Services, Austin, TX. Lead Machine Learning Engineer with 7 years of experience in NLP, recommender systems, and deep learning. Skills: Python, PyTorch, Spark.

Processing time: 375.0ms

Extracted 7 entities:

DESIGNATION (1):
  - 'Lead Machine Learning Engineer' at [33:63] (confidence: 0.832)

EXPERIENCE (1):
  - '7 years of experience' at [69:90] (confidence: 0.662)

LOCATION (1):
  - 'Austin, TX' at [21:31] (confidence: 0.970)

SKILL (4):
  - 'Web Services' at [7:19] (confidence: 0.751)
  - 'NLP' at [94:97] (confidence: 0.942)
  - 'recommender systems' at [99:118] (confidence: 0.963)
  - 'deep learning' at [124:137] (confidence: 0.908)

Entities in context:
--------------------------------------------------------------------------------
Amazon [Web Services](SKILL), [Austin, TX](LOCATION). [Lead Machine Learning Engineer](DESIGNATION) with [7 years of experience](EXPERIENCE) in [NLP](SKILL), [recommender systems](SKILL), and [deep learning](SKILL). Skills: Pyt

In [None]:
# Test with text_2 (contains email, phone, location)
text_2 = get_text_fixture("text_2")
result = make_request("POST", "/predict", json={"text": text_2})
if result.get("status_code") == 200 and result.get("data"):
    entities = result["data"].get("entities", [])
    display_entities(entities, source_text=text_2)



Input text: Alice Johnson is a data analyst at Meta Platforms. Email: alice.johnson@meta.com. Phone: +1-408-555-7890. Location: Menlo Park, CA.

Processing time: 422.1ms

Extracted 6 entities:

DESIGNATION (1):
  - 'data analyst' at [19:31] (confidence: 0.945)

EMAIL (2):
  - 'alice.johnson@' at [58:72] (confidence: 0.772)
  - 'com' at [77:80] (confidence: 0.401)

SKILL (3):
  - 'Email' at [51:56] (confidence: 0.755)
  - 'Phone' at [82:87] (confidence: 0.973)
  - 'Location' at [106:114] (confidence: 0.907)

Entities in context:
--------------------------------------------------------------------------------
Alice Johnson is a [data analyst](DESIGNATION) at Meta Platforms. [Email](SKILL): [alice.johnson@](EMAIL)meta.[com](EMAIL). [Phone](SKILL): +1-408-555-7890. [Location](SKILL): Menlo Park, CA.
--------------------------------------------------------------------------------


In [None]:
# Test with text_special (contains email, phone, URL)
text_special = get_text_fixture("text_special")
result = make_request("POST", "/predict", json={"text": text_special})
if result.get("status_code") == 200 and result.get("data"):
    entities = result["data"].get("entities", [])
    display_entities(entities, source_text=text_special)



Input text: Email: test@example.com, Phone: +1-555-123-4567, URL: https://example.com

Processing time: 401.0ms

Extracted 4 entities:

EMAIL (1):
  - 'example' at [12:19] (confidence: 0.473)

SKILL (3):
  - 'Email' at [0:5] (confidence: 0.996)
  - 'com' at [20:23] (confidence: 0.491)
  - 'Phone' at [25:30] (confidence: 0.995)

Entities in context:
--------------------------------------------------------------------------------
[Email](SKILL): test@[example](EMAIL).[com](SKILL), [Phone](SKILL): +1-555-123-4567, URL: https://example.com
--------------------------------------------------------------------------------


## 3. Single PDF File Prediction

Test entity extraction from PDF files.


In [None]:
# Test with PDF file
file_path = get_file_fixture("file_1", "pdf")
try:
    with open(file_path, "rb") as f:
        file_content = f.read()
    files = {"file": (file_path.name, file_content, "application/pdf")}
    result = make_request("POST", "/predict/file", files=files)
    
    if result.get("status_code") == 200 and result.get("data"):
        extracted_text = result["data"].get("extracted_text", "")
        entities = result["data"].get("entities", [])
        display_entities(entities, source_text=extracted_text)
except Exception as e:
    print(f"Error loading file: {e}")


File: test_resume_ner_1.pdf
Size: 1122 bytes

Processing time: 461.1ms

Extracted text length: 170 characters
Extracted text preview (first 200 chars):
Amazon Web Services, Austin, TX. Lead Machine Learning Engineer with 7 years of experience in NLP, recommender systems, and deep learning. Skills: Python, PyTorch, Spark....


Extracted 7 entities:

DESIGNATION (1):
  - 'Lead Machine Learning Engineer' at [33:63] (confidence: 0.832)

EXPERIENCE (1):
  - '7 years of experience' at [69:90] (confidence: 0.662)

LOCATION (1):
  - 'Austin, TX' at [21:31] (confidence: 0.970)

SKILL (4):
  - 'Web Services' at [7:19] (confidence: 0.751)
  - 'NLP' at [94:97] (confidence: 0.942)
  - 'recommender systems' at [99:118] (confidence: 0.963)
  - 'deep learning' at [124:137] (confidence: 0.908)

Entities in context:
--------------------------------------------------------------------------------
Amazon [Web Services](SKILL), [Austin, TX](LOCATION). [Lead Machine Learning Engineer](DESIGNATION) with [7 y

In [None]:
# Test with larger PDF file
file_path = get_file_fixture("file_resume_1", "pdf")
try:
    with open(file_path, "rb") as f:
        file_content = f.read()
    files = {"file": (file_path.name, file_content, "application/pdf")}
    result = make_request("POST", "/predict/file", files=files)
    
    if result.get("status_code") == 200 and result.get("data"):
        extracted_text = result["data"].get("extracted_text", "")
        entities = result["data"].get("entities", [])
        display_entities(entities, source_text=extracted_text)
except Exception as e:
    print(f"Error loading file: {e}")



File: test_resume.pdf
Size: 57267 bytes

Processing time: 451.0ms

Extracted text length: 117 characters
Extracted text preview (first 300 chars):
John Doe is a software engineer at Google. Email: john.doe@example.com. Phone: +1-555-123-4567 Location: Seattle, WA....


Extracted 5 entities:

EMAIL (1):
  - 'john.doe@example.com' at [50:70] (confidence: 0.741)

SKILL (4):
  - 'is' at [9:11] (confidence: 0.557)
  - 'software engineer' at [14:31] (confidence: 0.859)
  - 'Email' at [43:48] (confidence: 0.850)
  - 'Phone' at [72:77] (confidence: 0.932)

Entities in context:
--------------------------------------------------------------------------------
John Doe [is](SKILL) a [software engineer](SKILL) at Google. [Email](SKILL): [john.doe@example.com](EMAIL). [Phone](SKILL): +1-555-123-4567 Location: Seattle, WA.
--------------------------------------------------------------------------------


## 4. Single Image File Prediction

Test entity extraction from image files (PNG) using OCR.


In [None]:
# Test with PNG image file
file_path = get_file_fixture("file_1", "png")
try:
    with open(file_path, "rb") as f:
        file_content = f.read()
    files = {"file": (file_path.name, file_content, "image/png")}
    result = make_request("POST", "/predict/file", files=files)
    
    if result.get("status_code") == 200 and result.get("data"):
        extracted_text = result["data"].get("extracted_text", "")
        entities = result["data"].get("entities", [])
        if extracted_text:
            display_entities(entities, source_text=extracted_text)
    elif result.get("status_code") == 400:
        error_detail = result.get("data", {}).get("detail", "")
        if "EasyOCR" in error_detail or "pytesseract" in error_detail or "Pillow" in error_detail:
            print(f"⚠️  OCR dependencies not installed")
except Exception as e:
    print(f"Error loading file: {e}")


File: test_resume_ner_1.png
Size: 16294 bytes

Processing time: 5343.3ms

Extracted text length: 168 characters
Extracted text preview (first 200 chars):
Amazon Web Services, Austin; TX. Lead Machine Learning Engineer with
years of
experience in NLP;
recommender systems, and deep learning: Skills: Python, PyTorch, Spark:...


Extracted 9 entities:

SKILL (9):
  - 'Amazon Web Services' at [0:19] (confidence: 0.978)
  - 'Machine Learning Engineer' at [38:63] (confidence: 0.993)
  - 'NLP' at [92:95] (confidence: 0.966)
  - 'recommender systems' at [97:116] (confidence: 0.943)
  - 'deep learning' at [122:135] (confidence: 0.992)
  - 'Skills' at [137:143] (confidence: 0.997)
  - 'Python' at [145:151] (confidence: 0.996)
  - 'PyTorch' at [153:160] (confidence: 0.943)
  - 'Spark' at [162:167] (confidence: 0.953)

Entities in context:
--------------------------------------------------------------------------------
[Amazon Web Services](SKILL), Austin; TX. Lead [Machine Learning Engineer](SKILL

## 5. Batch Text Prediction

Test entity extraction from multiple text inputs in a single batch.


### 5.1 Batch with Multiple Texts


In [None]:
# Test batch with multiple texts
texts = get_batch_text_fixture("batch_text_small")
result = make_request("POST", "/predict/batch", json={"texts": texts})

if result.get("status_code") == 200 and result.get("data"):
    predictions = result["data"].get("predictions", [])
    for i, (text, prediction) in enumerate(zip(texts, predictions), 1):
        entities = prediction.get("entities", [])
        display_entities(entities, source_text=text)


Batch size: 3 texts

Text 1: Amazon Web Services, Austin, TX. Lead Machine Learning Engineer with 7 years of ...
Text 2: Alice Johnson is a data analyst at Meta Platforms. Email: alice.johnson@meta.com...
Text 3: Robert Lee holds a PhD in Artificial Intelligence from Stanford University. His ...

Total processing time: 879.7ms
Average per text: 293.2ms


Result 1/3:
Text: Amazon Web Services, Austin, TX. Lead Machine Learning Engineer with 7 years of experience in NLP, r...
Processing time: 743.7ms

Extracted 7 entities:

DESIGNATION (1):
  - 'Lead Machine Learning Engineer' at [33:63] (confidence: 0.832)

EXPERIENCE (1):
  - '7 years of experience' at [69:90] (confidence: 0.662)

LOCATION (1):
  - 'Austin, TX' at [21:31] (confidence: 0.970)

SKILL (4):
  - 'Web Services' at [7:19] (confidence: 0.751)
  - 'NLP' at [94:97] (confidence: 0.942)
  - 'recommender systems' at [99:118] (confidence: 0.963)
  - 'deep learning' at [124:137] (confidence: 0.908)

Entities in context:
-------------

## 6. Batch File Prediction

Test entity extraction from multiple files in a single batch.


### 6.1 Batch with PDF Files Only


In [None]:
# Test batch with PDF files
file_paths = get_batch_file_fixture("batch_file_small", "pdf")
try:
    files_list = []
    for file_path in file_paths:
        with open(file_path, "rb") as f:
            file_content = f.read()
        files_list.append(("files", (file_path.name, file_content, "application/pdf")))
    
    result = make_request("POST", "/predict/file/batch", files=files_list)
    
    if result.get("status_code") == 200 and result.get("data"):
        predictions = result["data"].get("predictions", [])
        for i, (file_path, prediction) in enumerate(zip(file_paths, predictions), 1):
            extracted_text = prediction.get("extracted_text", "")
            entities = prediction.get("entities", [])
            if extracted_text:
                display_entities(entities, source_text=extracted_text)
except Exception as e:
    print(f"Error: {e}")


Batch size: 3 PDF files

File 1: test_resume_ner_1.pdf (1122 bytes)
File 2: test_resume_ner_2.pdf (1093 bytes)
File 3: test_resume_ner_3.pdf (1105 bytes)

Total processing time: 1266.1ms
Average per file: 422.0ms


Result 1/3: test_resume_ner_1.pdf
Processing time: 430.0ms
Extracted text length: 170 characters
Text preview: Amazon Web Services, Austin, TX. Lead Machine Learning Engineer with 7 years of experience in NLP, recommender systems, and deep learning. Skills: Pyt...

Extracted 7 entities:

DESIGNATION (1):
  - 'Lead Machine Learning Engineer' at [33:63] (confidence: 0.832)

EXPERIENCE (1):
  - '7 years of experience' at [69:90] (confidence: 0.662)

LOCATION (1):
  - 'Austin, TX' at [21:31] (confidence: 0.970)

SKILL (4):
  - 'Web Services' at [7:19] (confidence: 0.751)
  - 'NLP' at [94:97] (confidence: 0.942)
  - 'recommender systems' at [99:118] (confidence: 0.963)
  - 'deep learning' at [124:137] (confidence: 0.908)

Entities in context:
-------------------------------------

## 7. Mixed Batch Prediction

Test entity extraction from a batch containing a mixture of texts, PDF files, and images.


In [None]:
# Test mixed content (texts + PDFs + images)
# Note: API endpoints are separate, so we process them separately and combine results

texts = [get_text_fixture("text_1"), get_text_fixture("text_2")]
pdf_files = [get_file_fixture("file_1", "pdf")]
png_files = [get_file_fixture("file_1", "png")]

all_results = []

# Process texts
text_result = make_request("POST", "/predict/batch", json={"texts": texts})
if text_result.get("status_code") == 200:
    all_results.extend([
        {"type": "text", "content": text, "result": pred}
        for text, pred in zip(texts, text_result["data"].get("predictions", []))
    ])

# Process PDF files
try:
    files_list = []
    for file_path in pdf_files:
        with open(file_path, "rb") as f:
            file_content = f.read()
        files_list.append(("files", (file_path.name, file_content, "application/pdf")))
    
    pdf_result = make_request("POST", "/predict/file/batch", files=files_list)
    if pdf_result.get("status_code") == 200:
        all_results.extend([
            {"type": "pdf", "file": str(fp), "result": pred}
            for fp, pred in zip(pdf_files, pdf_result["data"].get("predictions", []))
        ])
except Exception as e:
    print(f"Error processing PDFs: {e}")

# Process image files
try:
    files_list = []
    for file_path in png_files:
        with open(file_path, "rb") as f:
            file_content = f.read()
        files_list.append(("files", (file_path.name, file_content, "image/png")))
    
    png_result = make_request("POST", "/predict/file/batch", files=files_list)
    if png_result.get("status_code") == 200:
        all_results.extend([
            {"type": "image", "file": str(fp), "result": pred}
            for fp, pred in zip(png_files, png_result["data"].get("predictions", []))
        ])
    elif png_result.get("status_code") == 400:
        error_detail = png_result.get("data", {}).get("detail", "")
        if "EasyOCR" in error_detail or "pytesseract" in error_detail:
            print(f"⚠️  OCR dependencies not installed")
except Exception as e:
    print(f"Error processing images: {e}")

# Display combined results
for item in all_results:
    result = item["result"]
    entities = result.get("entities", [])
    
    if item["type"] == "text":
        display_entities(entities, source_text=item["content"])
    else:
        extracted_text = result.get("extracted_text", "")
        if extracted_text:
            display_entities(entities, source_text=extracted_text)


Testing mixed content (texts + PDFs + images)

Note: API endpoints are separate for texts and files.
We'll process them separately and show combined results.

Content to process:
  - 2 text inputs
  - 1 PDF files
  - 1 image files
  Total: 4 items

Processing texts...
Processing PDF files...
Processing image files...

COMBINED RESULTS (4 items processed)


Item 1/4: TEXT
Text: Amazon Web Services, Austin, TX. Lead Machine Learning Engineer with 7 years of experience in NLP, r...

Extracted 7 entities:

DESIGNATION (1):
  - 'Lead Machine Learning Engineer' at [33:63] (confidence: 0.832)

EXPERIENCE (1):
  - '7 years of experience' at [69:90] (confidence: 0.662)

LOCATION (1):
  - 'Austin, TX' at [21:31] (confidence: 0.970)

SKILL (4):
  - 'Web Services' at [7:19] (confidence: 0.751)
  - 'NLP' at [94:97] (confidence: 0.942)
  - 'recommender systems' at [99:118] (confidence: 0.963)
  - 'deep learning' at [124:137] (confidence: 0.908)

Entities in context:
---------------------------------

## 8. Cross-Format Consistency Test

Test the same content across different formats (text, PDF, PNG) to verify entity extraction consistency and compare performance.


In [52]:
# Test the same content in different formats
sample_text = "John Doe is a software engineer at Google. Email: john.doe@example.com. Phone: +1-555-123-4567. Location: Seattle, WA."

pdf_file = get_file_fixture("file_resume_1", "pdf")
png_file = get_file_fixture("file_resume_1", "png")

results = []

# Test 1: Text format
text_result = make_request("POST", "/predict", json={"text": sample_text})
if text_result.get("status_code") == 200:
    text_data = text_result["data"]
    results.append({
        "format": "Text",
        "input": sample_text,
        "extracted_text": sample_text,
        "entities": text_data.get("entities", []),
        "processing_time_ms": text_data.get("processing_time_ms", 0),
        "num_entities": len(text_data.get("entities", []))
    })
    display_entities(text_data.get("entities", []), source_text=sample_text)

# Test 2: PDF format
try:
    with open(pdf_file, "rb") as f:
        pdf_content = f.read()
    pdf_files = {"file": (pdf_file.name, pdf_content, "application/pdf")}
    pdf_result = make_request("POST", "/predict/file", files=pdf_files)
    
    if pdf_result.get("status_code") == 200:
        pdf_data = pdf_result["data"]
        extracted_text = pdf_data.get("extracted_text", "")
        results.append({
            "format": "PDF",
            "input": str(pdf_file),
            "extracted_text": extracted_text,
            "entities": pdf_data.get("entities", []),
            "processing_time_ms": pdf_data.get("processing_time_ms", 0),
            "num_entities": len(pdf_data.get("entities", []))
        })
        display_entities(pdf_data.get("entities", []), source_text=extracted_text)
except Exception as e:
    print(f"Error loading PDF: {e}")

# Test 3: PNG format
try:
    with open(png_file, "rb") as f:
        png_content = f.read()
    png_files = {"file": (png_file.name, png_content, "image/png")}
    png_result = make_request("POST", "/predict/file", files=png_files)
    
    if png_result.get("status_code") == 200:
        png_data = png_result["data"]
        extracted_text = png_data.get("extracted_text", "")
        results.append({
            "format": "PNG",
            "input": str(png_file),
            "extracted_text": extracted_text,
            "entities": png_data.get("entities", []),
            "processing_time_ms": png_data.get("processing_time_ms", 0),
            "num_entities": len(png_data.get("entities", []))
        })
        display_entities(png_data.get("entities", []), source_text=extracted_text)
    elif png_result.get("status_code") == 400:
        error_detail = png_result.get("data", {}).get("detail", "")
        if "EasyOCR" in error_detail or "pytesseract" in error_detail or "Pillow" in error_detail:
            print(f"⚠️  OCR dependencies not installed")
except Exception as e:
    print(f"Error loading PNG: {e}")

# Comparison Summary
if len(results) >= 2:
    comparison_data = []
    for r in results:
        comparison_data.append({
            "Format": r["format"],
            "Processing Time (ms)": f"{r['processing_time_ms']:.1f}",
            "Entities Extracted": r["num_entities"],
            "Text Length": len(r["extracted_text"])
        })
    
    comparison_df = pd.DataFrame(comparison_data)
    display(comparison_df)
    
    # Entity consistency analysis
    if len(results) == 3:
        text_entities = set((e.get("text", ""), e.get("label", "")) for e in results[0]["entities"])
        pdf_entities = set((e.get("text", ""), e.get("label", "")) for e in results[1]["entities"])
        png_entities = set((e.get("text", ""), e.get("label", "")) for e in results[2]["entities"])
        
        common_all = text_entities & pdf_entities & png_entities
        text_pdf_only = (text_entities & pdf_entities) - png_entities
        text_png_only = (text_entities & png_entities) - pdf_entities
        pdf_png_only = (pdf_entities & png_entities) - text_entities
        
        if common_all:
            print(f"\nEntities in all formats ({len(common_all)}):")
            for entity in sorted(common_all):
                print(f"  - '{entity[0]}' ({entity[1]})")
        
        if text_pdf_only or text_png_only or pdf_png_only:
            print(f"\nFormat-specific entities:")
            if text_pdf_only:
                print(f"  Text & PDF only ({len(text_pdf_only)}): {', '.join([e[0] for e in sorted(text_pdf_only)])}")
            if text_png_only:
                print(f"  Text & PNG only ({len(text_png_only)}): {', '.join([e[0] for e in sorted(text_png_only)])}")
            if pdf_png_only:
                print(f"  PDF & PNG only ({len(pdf_png_only)}): {', '.join([e[0] for e in sorted(pdf_png_only)])}")
        
        # Performance comparison
        times = [r["processing_time_ms"] for r in results]
        formats = [r["format"] for r in results]
        min_time = min(times)
        max_time = max(times)
        
        print(f"\nPerformance: {formats[times.index(min_time)]} fastest ({min_time:.1f}ms), {formats[times.index(max_time)]} slowest ({max_time:.1f}ms)")
        
        text_time = results[0]["processing_time_ms"]
        png_time = results[2]["processing_time_ms"]
        if png_time > text_time:
            ocr_overhead = png_time - text_time
            print(f"OCR overhead: {ocr_overhead:.1f}ms ({ocr_overhead / text_time * 100:.1f}%)")



Extracted 6 entities:

EMAIL (1):
  - 'john.doe@example.com' at [50:70] (confidence: 0.763)

LOCATION (1):
  - 'WA' at [115:117] (confidence: 0.581)

SKILL (4):
  - 'software engineer' at [14:31] (confidence: 0.884)
  - 'Email' at [43:48] (confidence: 0.929)
  - 'Phone' at [72:77] (confidence: 0.980)
  - 'Location' at [96:104] (confidence: 0.889)

Entities in context:
--------------------------------------------------------------------------------
John Doe is a [software engineer](SKILL) at Google. [Email](SKILL): [john.doe@example.com](EMAIL). [Phone](SKILL): +1-555-123-4567. [Location](SKILL): Seattle, [WA](LOCATION).
--------------------------------------------------------------------------------

Extracted 5 entities:

EMAIL (1):
  - 'john.doe@example.com' at [50:70] (confidence: 0.741)

SKILL (4):
  - 'is' at [9:11] (confidence: 0.557)
  - 'software engineer' at [14:31] (confidence: 0.859)
  - 'Email' at [43:48] (confidence: 0.850)
  - 'Phone' at [72:77] (confidence: 0.932)

Enti

Unnamed: 0,Format,Processing Time (ms),Entities Extracted,Text Length
0,Text,412.0,6,118
1,PDF,430.7,5,117
2,PNG,2974.7,5,117



Entities in all formats (3):
  - 'Email' (SKILL)
  - 'Phone' (SKILL)
  - 'software engineer' (SKILL)

Format-specific entities:
  Text & PDF only (1): john.doe@example.com
  Text & PNG only (1): Location

Performance: Text fastest (412.0ms), PNG slowest (2974.7ms)
OCR overhead: 2562.7ms (622.0%)
