# PDF Local Ingestor

This notebook ingests a PDF file locally using functions from the original project.

**Key Features:**
- Uses **PyMuPDF** for proper PDF text extraction (not Pandoc)
- Leverages `extract_pages_block_level_simple` and BioC converters from the project
- Automatically detects and removes headers/footers
- Detects section headings (Abstract, Methods, Results, etc.)
- Mocks config to bypass HPC/S3/database dependencies
- **No PostgreSQL** - **No S3** - **Fully local**

In [0]:
# Install requirements
%pip install -r ../requirements.txt

In [0]:
import os
import sys
from pathlib import Path
from unittest.mock import MagicMock, patch

# Add project root to path so we can import from src
project_root = Path(os.getcwd()).parent
sys.path.insert(0, str(project_root))

# ============================================================================
# MOCK CONFIG AND FILE HANDLER BEFORE ANY PROJECT IMPORTS
# This allows us to use the project code without HPC/S3/Database dependencies
# ============================================================================

# Mock config that simulates "test" storage type with local paths
MOCK_CONFIG = {
    "paths": {
        "storage": {
            "type": "test",
            "test": {
                "ingestion_path": "./output/ingestion",
                "failed_ingestion_path": "./output/failed",
                "ingestion_interim_path": "./output/interim",
                "bioc_path": "./output/bioc_xml",
                "metadata_path": "./output/metadata",
                "embeddings_path": "./output/embeddings",
            }
        },
        "model": {
            "type": "test",
            "test": {
                "summarization_model": {
                    "mistral_7b": {
                        "model_path": "./models/mistral-7b",
                        "token_limit": 2048
                    }
                }
            }
        }
    },
    # AWS config needed by file_handler_factory at import time
    "aws": {
        "aws": {
            "platform_type": "HPC"
        }
    }
}

# Create a mock YAMLConfigLoader
class MockYAMLConfigLoader:
    def get_config(self, config_name):
        return MOCK_CONFIG.get(config_name, {})

# Apply the mock BEFORE importing project modules
import src.pubtator_utils.config_handler.config_reader as config_reader
config_reader.YAMLConfigLoader = MockYAMLConfigLoader

# ============================================================================
# MOCK db.py TO PREVENT DATABASE CONNECTION AT IMPORT TIME
# db.py has module-level code: db_url = get_db_url() and engine = create_engine(db_url)
# ============================================================================
import sys
from types import ModuleType

# Create a mock db module
mock_db = ModuleType("src.pubtator_utils.db_handler.db")
mock_db.get_db_url = lambda *args, **kwargs: "postgresql://mock:mock@localhost/mock"
mock_db.db_url = "postgresql://mock:mock@localhost/mock"
mock_db.engine = None
mock_db.Session = MagicMock()
mock_db.session = MagicMock()

# Register the mock in sys.modules BEFORE any imports that might reference it
sys.modules["src.pubtator_utils.db_handler.db"] = mock_db

# Also mock the FileHandlerFactory to always return LocalFileHandler
from src.pubtator_utils.file_handler.local_handler import LocalFileHandler

class MockFileHandlerFactory:
    # Include _handlers dict that original factory has
    _handlers = {
        "local": LocalFileHandler,
        "test": LocalFileHandler,
        "s3": LocalFileHandler,  # Mock S3 to also use local
    }
    
    @staticmethod
    def get_handler(storage_type=None, platform_type=None):
        return LocalFileHandler()

import src.pubtator_utils.file_handler.file_handler_factory as file_handler_factory
file_handler_factory.FileHandlerFactory = MockFileHandlerFactory

print("âœ“ Mocked YAMLConfigLoader (no config file reads)")
print("âœ“ Mocked db.py (no PostgreSQL connection)")
print("âœ“ Mocked FileHandlerFactory (always returns LocalFileHandler)")

In [0]:
# ============================================================================
# IMPORT PROJECT MODULES (now safe after mocking config)
# ============================================================================

# Logger
from src.pubtator_utils.logs_handler.logger import SingletonLogger

# PDF extraction using PyMuPDF (the correct approach for PDFs!)
from src.data_ingestion.ingest_preprints_rxivs.preprint_pdf_to_bioc_converter import (
    extract_pages_block_level_simple,  # Extracts text blocks from PDF using PyMuPDF
    make_document_from_blocks,          # Converts blocks to passages with merging
    build_bioc_collection_lib,          # Creates BioC collection
    find_running_headers_footers,       # Detects repeating headers/footers
    detect_heading_and_strip_regex,     # Detects section headings
    clean_xml_text,                     # Cleans text for XML output
)

import bioc
from datetime import datetime

# Initialize logger and file handler
logger = SingletonLogger().get_logger()
file_handler = LocalFileHandler()

print("âœ“ All imports successful!")
print("  Using project modules (PyMuPDF-based):")
print("  - extract_pages_block_level_simple (PDF â†’ text blocks)")
print("  - make_document_from_blocks (blocks â†’ passages)")
print("  - build_bioc_collection_lib (passages â†’ BioC)")
print("  - find_running_headers_footers (header/footer detection)")
print("  - detect_heading_and_strip_regex (section heading detection)")

## Configure Paths

Define input PDF and output directories. All processing happens locally.

In [0]:
# ============================================================================
# CONFIGURE INPUT/OUTPUT PATHS
# ============================================================================

# Input PDF file path
PDF_INPUT_PATH = "/Workspace/Users/jesse.americogomesdelima@gilead.com/pubtator/GileadPubtator/sample_data/attention.pdf"

# Output directory structure
OUTPUT_BASE_DIR = Path("/Workspace/Users/jesse.americogomesdelima@gilead.com/pubtator/GileadPubtator/sample_data/output")
INGESTION_PATH = OUTPUT_BASE_DIR / "ingestion"
INTERIM_PATH = OUTPUT_BASE_DIR / "interim"
BIOC_PATH = OUTPUT_BASE_DIR / "bioc_xml"
FAILED_PATH = OUTPUT_BASE_DIR / "failed"
METADATA_PATH = OUTPUT_BASE_DIR / "metadata"
EMBEDDINGS_PATH = OUTPUT_BASE_DIR / "embeddings"

# Get PDF name without extension
pdf_name = Path(PDF_INPUT_PATH).stem  # e.g., "attention"

# Create all directories
ALL_PATHS = [INGESTION_PATH, INTERIM_PATH, BIOC_PATH, FAILED_PATH, METADATA_PATH, EMBEDDINGS_PATH]
for dir_path in ALL_PATHS:
    dir_path.mkdir(parents=True, exist_ok=True)

print(f"âœ“ Output directories created in: {OUTPUT_BASE_DIR.resolve()}")
print(f"âœ“ PDF to process: {pdf_name}")

## Step 1: Read and Prepare PDF

Copy the PDF to the ingestion directory using `LocalFileHandler`.

In [0]:
# Read PDF and copy to ingestion directory
pdf_source_path = Path(PDF_INPUT_PATH).resolve()

if not file_handler.exists(str(pdf_source_path)):
    raise FileNotFoundError(f"PDF not found: {pdf_source_path}")

pdf_content = file_handler.read_file_bytes(str(pdf_source_path))
pdf_dest_path = INGESTION_PATH / f"{pdf_name}.pdf"
file_handler.write_file(str(pdf_dest_path), pdf_content)

print(f"âœ“ PDF: {pdf_source_path}")
print(f"âœ“ Size: {len(pdf_content):,} bytes")
print(f"âœ“ Copied to: {pdf_dest_path}")

## Step 2: Extract Text from PDF using PyMuPDF

Using `extract_pages_block_level_simple` from `preprint_pdf_to_bioc_converter.py` (PyMuPDF-based extraction).

This function:
- Opens the PDF with PyMuPDF
- Detects and removes repeating headers/footers
- Identifies table regions (to avoid duplicating table text)
- Extracts text blocks with position information
- Detects section headings (Abstract, Methods, Results, etc.)

In [0]:
# Extract text blocks from PDF using PyMuPDF
# This properly reads the PDF (unlike Pandoc which doesn't support PDF input)

pdf_file_path = str(pdf_dest_path)

# Extract blocks from all pages
# Returns: List[List[Tuple[heading, body_text]]] - one list per page
kept_blocks_per_page = extract_pages_block_level_simple(
    pdf_path=pdf_file_path,
    table_thresh=0.2,  # Drop text blocks that overlap tables by >= 20%
)

# Count total blocks extracted
total_blocks = sum(len(page_blocks) for page_blocks in kept_blocks_per_page)
total_pages = len(kept_blocks_per_page)

print(f"âœ“ Extracted {total_blocks} text blocks from {total_pages} pages")
print(f"âœ“ Source: {pdf_file_path}")

## Step 3: Preview Extracted Blocks

Show the extracted text blocks with their detected section headings.

In [0]:
# Preview extracted blocks (first few from each page)
print("Extracted blocks preview:")
print("=" * 60)

for page_idx, page_blocks in enumerate(kept_blocks_per_page[:3]):  # First 3 pages
    print(f"\nðŸ“„ Page {page_idx + 1} ({len(page_blocks)} blocks)")
    print("-" * 40)
    
    for block_idx, (heading, body_text) in enumerate(page_blocks[:3]):  # First 3 blocks per page
        preview = body_text[:100] + "..." if len(body_text) > 100 else body_text
        word_count = len(body_text.split())
        print(f"  [{block_idx + 1}] {heading} ({word_count} words)")
        print(f"      {preview}")
    
    if len(page_blocks) > 3:
        print(f"      ... and {len(page_blocks) - 3} more blocks on this page")

if len(kept_blocks_per_page) > 3:
    print(f"\n... and {len(kept_blocks_per_page) - 3} more pages")

In [0]:
# Show section heading distribution
from collections import Counter

all_headings = [heading for page_blocks in kept_blocks_per_page 
                for heading, _ in page_blocks]
heading_counts = Counter(all_headings)

print("Section heading distribution:")
print("-" * 40)
for heading, count in heading_counts.most_common():
    print(f"  {heading}: {count} block(s)")

## Step 4: Convert to BioC XML

Using `make_document_from_blocks` and `build_bioc_collection_lib` from `preprint_pdf_to_bioc_converter.py`.

This step:
- Merges small consecutive blocks (minimum 100 words per passage)
- Creates properly structured BioC passages with section types
- Builds a complete BioC collection with metadata

In [0]:
# Convert extracted blocks â†’ BioC using the preprint converter functions

# Create metadata for the document
metadata_infons = {
    "source": "local_pdf",
    "filename": pdf_name,
    "title": pdf_name.replace("_", " ").title(),
    "full_path": str(pdf_source_path),
}

# Step 1: Convert blocks to a document dict with merged passages
# This merges consecutive blocks until each passage has at least 100 words
doc_dict = make_document_from_blocks(
    doc_id=pdf_name,
    kept_blocks_per_page=kept_blocks_per_page,
    infons=metadata_infons,
    min_words=100,  # Minimum words per passage before merging stops
)

print(f"âœ“ Document created: {doc_dict['id']}")
print(f"âœ“ Passages after merging: {len(doc_dict['passages'])}")

# Step 2: Build BioC collection
bioc_collection = build_bioc_collection_lib(
    source="Local PDF",
    date_str=datetime.now().strftime("%Y-%m-%d"),
    documents=[doc_dict],
)

print(f"âœ“ BioC collection created: {len(bioc_collection.documents)} document(s)")

In [0]:
# The PyMuPDF extraction already handles:
# - Header/footer removal (via find_running_headers_footers)
# - Section heading detection (via detect_heading_and_strip_regex)  
# - Small passage merging (via make_document_from_blocks with min_words)
# - XML-safe text cleaning (via clean_xml_text)

# Show passage statistics
total_passages = sum(len(doc.passages) for doc in bioc_collection.documents)
total_words = sum(
    len(passage.text.split()) 
    for doc in bioc_collection.documents 
    for passage in doc.passages
)

print(f"âœ“ Total passages: {total_passages}")
print(f"âœ“ Total words: {total_words:,}")
print(f"âœ“ Average words per passage: {total_words // max(total_passages, 1)}")

In [0]:
# Save BioC XML
bioc_xml_path = BIOC_PATH / f"{pdf_name}.xml"

with open(bioc_xml_path, "w", encoding="utf-8") as f:
    bioc.dump(bioc_collection, f)

bioc_size = bioc_xml_path.stat().st_size
print(f"âœ“ Saved: {bioc_xml_path} ({bioc_size:,} bytes)")

## Step 5: Inspect Results

View the BioC XML structure and passages.

In [0]:
# Inspect BioC document
for doc in bioc_collection.documents:
    print(f"Document: {doc.id}")
    print(f"Passages: {len(doc.passages)}")
    print("-" * 50)
    
    # Show first 5 passages
    for i, passage in enumerate(doc.passages[:5]):
        # The preprint converter uses "type" for section headings
        section_type = passage.infons.get("type", "body_text")
        word_count = len(passage.text.split()) if passage.text else 0
        preview = (passage.text[:80] + "...") if len(passage.text) > 80 else passage.text
        print(f"[{i+1}] {section_type} ({word_count} words)")
        print(f"    {preview}")
    
    if len(doc.passages) > 5:
        print(f"\n... and {len(doc.passages) - 5} more passages")

In [0]:
# List generated output files
print("Generated files:")
print("-" * 50)

for root, dirs, files in os.walk(OUTPUT_BASE_DIR):
    level = root.replace(str(OUTPUT_BASE_DIR), '').count(os.sep)
    indent = '  ' * level
    folder = os.path.basename(root)
    if files:  # Only show folders with files
        print(f"{indent}{folder}/")
        for file in files:
            file_path = Path(root) / file
            size = file_path.stat().st_size
            print(f"{indent}  {file} ({size:,} bytes)")

## Summary

This notebook runs the **complete PDF ingestion pipeline** using **PyMuPDF** for proper PDF text extraction.

### Why PyMuPDF instead of Pandoc?

**Pandoc does NOT support PDF as an input format.** It can only output to PDF (via LaTeX), but cannot read PDFs. The previous `convert_apollo_to_html()` function was broken because it tried to use Pandoc for PDF input.

**PyMuPDF** (`pymupdf`) properly reads PDF files and extracts text blocks with position information.

### Functions Used from Project:

| Function | Module | Purpose |
|----------|--------|---------|
| `LocalFileHandler` | `file_handler.local_handler` | File I/O |
| `SingletonLogger` | `logs_handler.logger` | Logging |
| `extract_pages_block_level_simple` | `preprint_pdf_to_bioc_converter` | PDF â†’ text blocks (PyMuPDF) |
| `make_document_from_blocks` | `preprint_pdf_to_bioc_converter` | Blocks â†’ merged passages |
| `build_bioc_collection_lib` | `preprint_pdf_to_bioc_converter` | Passages â†’ BioC collection |
| `find_running_headers_footers` | `preprint_pdf_to_bioc_converter` | Header/footer detection |
| `detect_heading_and_strip_regex` | `preprint_pdf_to_bioc_converter` | Section heading detection |

### Built-in Processing:

The PyMuPDF-based functions automatically handle:
- âœ… **Header/footer removal** - detects repeating text at top/bottom of pages
- âœ… **Section heading detection** - Abstract, Methods, Results, Discussion, etc.
- âœ… **Table region avoidance** - prevents duplicating text from tables
- âœ… **Small passage merging** - combines blocks until min_words threshold
- âœ… **XML-safe text cleaning** - removes illegal XML characters

### Output:
```
./output/
â”œâ”€â”€ ingestion/    # PDF files
â””â”€â”€ bioc_xml/     # BioC XML output
```

**Requirement:** PyMuPDF must be installed (`pip install pymupdf`) - included in requirements.txt.