# Notebook 0: PDF to Plain Text Extraction

This notebook extracts plain text from PDF files using [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/), with automatic fallback to OCR when needed.

**Handles both text-based and image-based (scanned) PDFs with a 3-tier extraction strategy:**
1. **PyMuPDF `get_text()`** — fast, works for text-based PDFs
2. **Tesseract OCR** — handles scanned/image-based pages where PyMuPDF returns no text
3. **Docling** — last-resort fallback for complex layouts, tables, or formats that defeat Tesseract

The notebook auto-detects which method is needed per-page.

## Use Case
If your source documents are PDFs (e.g. corporate filings, legal documents, reports), this notebook converts them to plain text files that can then be fed into the import pipeline in **Notebook 1**.

### Prerequisites
- Python 3.9+
- **Tesseract OCR** installed on your system (`brew install tesseract` on macOS, `apt install tesseract-ocr` on Linux)
- PDF files to process (place them in a directory of your choice)

### Expected Runtime
- Text-based PDFs: ~1-2 seconds per page
- OCR (Tesseract): ~2-5 seconds per page
- Docling fallback: ~5-10 seconds per page

## 1. Install Dependencies

In [1]:
%pip install pymupdf tqdm docling
# Tesseract must be installed separately (system package, not pip):
#   macOS:  brew install tesseract
#   Linux:  sudo apt install tesseract-ocr

Collecting docling
  Downloading docling-2.75.0-py3-none-any.whl.metadata (12 kB)
Collecting docling-core<3.0.0,>=2.62.0 (from docling-core[chunking]<3.0.0,>=2.62.0->docling)
  Downloading docling_core-2.66.0-py3-none-any.whl.metadata (7.6 kB)
Collecting docling-parse<6.0.0,>=5.3.2 (from docling)
  Downloading docling_parse-5.4.0-cp313-cp313-macosx_14_0_arm64.whl.metadata (8.8 kB)
Collecting docling-ibm-models<4,>=3.9.1 (from docling)
  Downloading docling_ibm_models-3.11.0-py3-none-any.whl.metadata (7.2 kB)
Collecting filetype<2.0.0,>=1.2.0 (from docling)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting pypdfium2!=4.30.1,<6.0.0,>=4.30.0 (from docling)
  Downloading pypdfium2-5.5.0-py3-none-macosx_11_0_arm64.whl.metadata (68 kB)
Collecting pydantic-settings<3.0.0,>=2.3.0 (from docling)
  Downloading pydantic_settings-2.13.1-py3-none-any.whl.metadata (3.4 kB)
Collecting ocrmac<2.0.0,>=1.0.0 (from docling)
  Downloading ocrmac-1.0.1-py3-none-any.whl.metadata

## 2. Imports

In [None]:
import pymupdf
import shutil
import subprocess
import tempfile
import os
from pathlib import Path
from tqdm import tqdm

# Verify Tesseract is available
if shutil.which("tesseract"):
    print(f"Tesseract found: {shutil.which('tesseract')}")
else:
    print("WARNING: Tesseract not found. OCR fallback will skip to Docling.")
    print("  Install: brew install tesseract  (macOS) or apt install tesseract-ocr (Linux)")

# Verify Docling is available
try:
    from docling.document_converter import DocumentConverter
    print("Docling found")
except ImportError:
    print("WARNING: Docling not installed. Third fallback will be unavailable.")
    print("  Install: pip install docling")

Tesseract found: /opt/homebrew/bin/tesseract


  from .autonotebook import tqdm as notebook_tqdm


## 3. Configuration

Set the paths for your input PDFs and where you want the extracted text files written.

In [None]:
# ═══════════════════════════════════════════════════════════════
# CONFIGURATION — update these paths to match your environment
# ═══════════════════════════════════════════════════════════════

PDF_INPUT_DIR = Path("/Users/henryadamcollie/Documents/GitHub/enron_resolution_neo4j/pdfs/pdfs_pdfs")          # directory containing your PDF files
TEXT_OUTPUT_DIR = Path("/Users/henryadamcollie/Documents/GitHub/enron_resolution_neo4j/pdfs/pdfs_text")    # directory where .txt files will be written

# Set to True to also extract per-page files (one .txt per page)
SPLIT_BY_PAGE = False

## 4. PDF Text Extractor (with OCR and Docling fallback)

Each page is checked for embedded text first. If a page has no text but contains images, it falls back to Tesseract OCR. If Tesseract also fails (or isn't installed), Docling is used as a last resort on the entire document.

**Extraction strategy per page:**
1. `page.get_text()` — instant, works for native text PDFs
2. `page.get_textpage_ocr()` — PyMuPDF's built-in Tesseract integration (fast OCR)
3. Render page to PNG → `tesseract` subprocess — handles CCITTFax/JBIG2 1-bit fax images
4. **Docling** — processes the full document when per-page methods fail

In [None]:
def extract_page_text(page):
    """Extract text from a single page, falling back to OCR if needed.

    Three extraction paths are attempted:
    1. get_text() — fast, works for native text PDFs.
    2. get_textpage_ocr() — PyMuPDF built-in Tesseract OCR, works for most scanned pages.
    3. pixmap → PNG → tesseract subprocess — slower but handles CCITTFax/JBIG2
       1-bit fax images where get_textpage_ocr() silently returns empty.

    Returns (text, method) where method is "text", "ocr", or "empty".
    Pages returning "empty" are candidates for the Docling fallback.
    """
    # Path 1: Native text extraction (instant)
    text = page.get_text()
    if text.strip():
        return text, "text"

    # Path 2: PyMuPDF built-in OCR (fast, works for most image types)
    try:
        tp = page.get_textpage_ocr(tessdata=None, language="eng", dpi=300)
        ocr_text = page.get_text(textpage=tp)
        if ocr_text.strip():
            return ocr_text, "ocr"
    except Exception:
        pass

    # Path 3: Render page → PNG → Tesseract subprocess
    # Needed for CCITTFax/DeviceGray 1-bit images (common in USPTO 2023+ grants)
    # where get_textpage_ocr() fails silently.
    try:
        pix = page.get_pixmap(dpi=300)
        with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as f:
            tmpname = f.name
        pix.save(tmpname)
        result = subprocess.run(
            ["tesseract", tmpname, "stdout", "--psm", "3", "-l", "eng"],
            capture_output=True, text=True, timeout=60,
        )
        os.unlink(tmpname)
        if result.returncode == 0 and result.stdout.strip():
            return result.stdout, "ocr"
    except Exception:
        pass

    return "", "empty"


def extract_with_docling(pdf_path: Path) -> str | None:
    """Last-resort extraction using Docling for the entire document.

    Docling handles complex layouts, tables, and formats that
    PyMuPDF and Tesseract may struggle with.

    Returns the full document text, or None if Docling fails.
    """
    try:
        from docling.document_converter import DocumentConverter
        converter = DocumentConverter()
        result = converter.convert(str(pdf_path))
        text = result.document.export_to_markdown()
        if text.strip():
            return text
    except Exception as e:
        print(f"  Docling failed for {pdf_path.name}: {e}")
    return None


def extract_text_from_pdf(pdf_path: Path, max_pages: int | None = None) -> tuple[str, dict]:
    """Extract text from a PDF using a 3-tier fallback strategy.

    Tier 1: PyMuPDF get_text() per page (fast, native text)
    Tier 2: Tesseract OCR per page (scanned/image pages)
    Tier 3: Docling on the full document (if pages remain empty after tiers 1+2)

    Args:
        pdf_path: Path to the PDF file.
        max_pages: If set, only extract the first N pages. None = all pages.

    Returns (full_text, stats) where stats has page counts by method.
    """
    doc = pymupdf.open(pdf_path)
    pages = []
    stats = {"text": 0, "ocr": 0, "docling": 0, "empty": 0, "total_in_pdf": len(doc)}

    page_limit = max_pages if max_pages is not None else len(doc)
    empty_indices = []

    # Tier 1 + 2: Try PyMuPDF text extraction and Tesseract OCR per page
    for i, page in enumerate(doc):
        if i >= page_limit:
            break
        page_text, method = extract_page_text(page)
        pages.append(page_text)
        if method == "empty":
            empty_indices.append(i)
        stats[method] += 1

    doc.close()

    # Tier 3: If any pages are still empty, try Docling on the full document
    if empty_indices:
        docling_text = extract_with_docling(pdf_path)
        if docling_text:
            # Docling returns the full document as one block — use it to fill empty pages.
            # Since Docling doesn't give per-page output, we replace the first empty slot
            # with the full Docling text and clear the rest to avoid duplication.
            pages[empty_indices[0]] = docling_text
            stats["docling"] += 1
            stats["empty"] -= 1
            for idx in empty_indices[1:]:
                pages[idx] = ""
                stats["docling"] += 1
                stats["empty"] -= 1

    return "\n\n".join(pages), stats

## 5. Single-File Test

Try extracting text from one PDF to verify everything works before processing a full directory.

In [None]:
# Find the first PDF in the input directory for a quick test
pdf_files = sorted(PDF_INPUT_DIR.glob("*.pdf"))
print(f"Found {len(pdf_files)} PDF file(s) in {PDF_INPUT_DIR.resolve()}")

if pdf_files:
    sample_pdf = pdf_files[0]
    print(f"\nTesting with: {sample_pdf.name}")
    print("=" * 60)

    sample_text, stats = extract_text_from_pdf(sample_pdf)
    extracted = stats["text"] + stats["ocr"] + stats["docling"] + stats["empty"]
    print(f"Extracted {extracted}/{stats['total_in_pdf']} pages")
    print(f"  Text (PyMuPDF):  {stats['text']} pages")
    print(f"  OCR (Tesseract): {stats['ocr']} pages")
    print(f"  Docling:         {stats['docling']} pages")
    print(f"  Empty:           {stats['empty']} pages")
    print(f"Total characters: {len(sample_text):,}")
    print("=" * 60)

    # Show a preview (first 2000 chars)
    preview = sample_text[:2000]
    print(preview)
    if len(sample_text) > 2000:
        print(f"\n... [{len(sample_text) - 2000:,} more characters]")
else:
    print(f"\nNo PDF files found. Place your PDFs in: {PDF_INPUT_DIR.resolve()}")

## 6. Batch Processing

Process all PDFs in the input directory and write `.txt` files to the output directory. Each output file keeps the same name as its source PDF but with a `.txt` extension.

In [None]:
TEXT_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

pdf_files = sorted(PDF_INPUT_DIR.glob("*.pdf"))
print(f"Processing {len(pdf_files)} PDF(s) → {TEXT_OUTPUT_DIR.resolve()}\n")

results = []

for pdf_path in tqdm(pdf_files, desc="Extracting text"):
    try:
        if SPLIT_BY_PAGE:
            # Per-page mode: extract page by page (Docling fallback still applies)
            doc = pymupdf.open(pdf_path)
            page_count = len(doc)
            doc.close()
            full_text, stats = extract_text_from_pdf(pdf_path)
            # Write per-page files from the combined output
            page_texts = full_text.split("\n\n")
            for i, page_text in enumerate(page_texts, start=1):
                out_path = TEXT_OUTPUT_DIR / f"{pdf_path.stem}_p{i}.txt"
                out_path.write_text(page_text, encoding="utf-8")
            char_count = len(full_text)
        else:
            full_text, stats = extract_text_from_pdf(pdf_path)
            char_count = len(full_text)

        # Always write the combined file
        out_path = TEXT_OUTPUT_DIR / f"{pdf_path.stem}.txt"
        out_path.write_text(full_text, encoding="utf-8")

        # Determine primary method used
        if stats["docling"] > 0:
            method = "docling"
        elif stats["ocr"] > 0:
            method = "ocr"
        else:
            method = "text"

        extracted = stats["text"] + stats["ocr"] + stats["docling"] + stats["empty"]

        results.append({
            "file": pdf_path.name,
            "pages_extracted": extracted,
            "pages_total": stats["total_in_pdf"],
            "text_pages": stats["text"],
            "ocr_pages": stats["ocr"],
            "docling_pages": stats["docling"],
            "empty_pages": stats["empty"],
            "characters": char_count,
            "method": method,
            "status": "ok" if char_count > 0 else "empty"
        })

    except Exception as e:
        results.append({
            "file": pdf_path.name,
            "pages_extracted": 0,
            "pages_total": 0,
            "text_pages": 0,
            "ocr_pages": 0,
            "docling_pages": 0,
            "empty_pages": 0,
            "characters": 0,
            "method": "error",
            "status": f"error: {e}"
        })

print(f"\nDone. {len(results)} file(s) processed.")

## 7. Results Summary

In [None]:
import pandas as pd

if results:
    df = pd.DataFrame(results)
    print(f"Total files:      {len(df)}")
    print(f"Successful:       {(df['status'] == 'ok').sum()}")
    print(f"Empty:            {(df['status'] == 'empty').sum()}")
    print(f"Errors:           {df['status'].str.startswith('error').sum()}")
    print(f"\nExtraction method breakdown:")
    print(f"  Text (PyMuPDF):  {(df['method'] == 'text').sum()} files")
    print(f"  OCR (Tesseract): {(df['method'] == 'ocr').sum()} files")
    print(f"  Docling:         {(df['method'] == 'docling').sum()} files")
    print(f"\nTotal pages:      {df['pages_extracted'].sum():,}")
    print(f"  Text pages:      {df['text_pages'].sum():,}")
    print(f"  OCR pages:       {df['ocr_pages'].sum():,}")
    print(f"  Docling pages:   {df['docling_pages'].sum():,}")
    print(f"  Empty pages:     {df['empty_pages'].sum():,}")
    print(f"Total characters: {df['characters'].sum():,}")
    print()

    empty = df[df["status"] == "empty"].sort_values("file")
    if not empty.empty:
        print(f"\n{len(empty)} empty files (no text extracted by any method):")
        for _, row in empty.iterrows():
            print(f"  {row['file']}  (total pages in PDF: {row['pages_total']})")

    errors = df[df["status"].str.startswith("error")]
    if not errors.empty:
        print("\nFiles with errors:")
        for _, row in errors.iterrows():
            print(f"  {row['file']}: {row['status']}")

    display(df)
else:
    print("No results — did you place PDF files in the input directory?")

## 8. Next Steps

The extracted `.txt` files are now ready for downstream processing:

1. **Feed into Notebook 1** — if the PDFs contain email-like content, adapt the parser in Notebook 1 to read from the text output directory instead of `maildir/`
2. **Direct Neo4j import** — load the text content as document nodes for full-text search and entity extraction
3. **Entity extraction** — use spaCy NER or an LLM to pull out people, organizations, and other entities from the plain text

### Extraction Pipeline

```
PDF file
  │
  ├─ Tier 1: PyMuPDF get_text()          [fast, native text]
  │    └─ Got text? ✓ Done
  │
  ├─ Tier 2: Tesseract OCR               [scanned/image pages]
  │    ├─ get_textpage_ocr() built-in
  │    └─ pixmap → PNG → tesseract CLI
  │         └─ Got text? ✓ Done
  │
  └─ Tier 3: Docling                     [complex layouts, last resort]
       └─ Full document conversion
```

### Troubleshooting

| Issue | Solution |
|-------|----------|
| Empty text output (all tiers fail) | Check Tesseract: `tesseract --version`. Check Docling: `pip install docling` |
| Tesseract not found | `brew install tesseract` (macOS) or `apt install tesseract-ocr` (Linux) |
| Docling import error | `pip install docling` — requires Python 3.9+ |
| Slow processing | OCR is ~2-5s/page, Docling ~5-10s/page. Text-based PDFs are instant. |
| Garbled OCR text | Try increasing DPI in `extract_page_text` (e.g. `dpi=400`) |
| Non-English PDFs | Install Tesseract language packs: `brew install tesseract-lang`, change `language="eng"` |