# Notebook 0: PDF to Plain Text Extraction

This notebook extracts plain text from PDF files using [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/), a fast and lightweight Python binding for the MuPDF library.

## Why PyMuPDF?
- **Fast**: Written in C under the hood, significantly faster than pure-Python alternatives
- **Reliable**: Handles scanned PDFs, multi-column layouts, embedded fonts, and complex formatting
- **Lightweight**: No Java or external service dependencies (unlike Apache Tika)
- **Well-maintained**: Active development with regular releases

## Use Case
If your source documents are PDFs (e.g. corporate filings, legal documents, reports), this notebook converts them to plain text files that can then be fed into the import pipeline in **Notebook 1**.

### Prerequisites
- Python 3.9+
- PDF files to process (place them in a directory of your choice)

### Expected Runtime
- ~1-2 seconds per typical PDF page
- A 100-page document takes roughly 10-20 seconds

## 1. Install Dependencies

In [None]:
%pip install pymupdf tqdm

## 2. Imports

In [None]:
import pymupdf
from pathlib import Path
from tqdm import tqdm

## 3. Configuration

Set the paths for your input PDFs and where you want the extracted text files written.

In [None]:
# ═══════════════════════════════════════════════════════════════
# CONFIGURATION — update these paths to match your environment
# ═══════════════════════════════════════════════════════════════

PDF_INPUT_DIR = Path("/Users/henryadamcollie/Documents/GitHub/enron_resolution_neo4j/pdfs/pdfs_pdfs")          # directory containing your PDF files
TEXT_OUTPUT_DIR = Path("/Users/henryadamcollie/Documents/GitHub/enron_resolution_neo4j/pdfs/pdfs_text")    # directory where .txt files will be written

# Set to True to also extract per-page files (one .txt per page)
SPLIT_BY_PAGE = False

## 4. PDF Text Extractor

The `extract_text_from_pdf` function opens a single PDF and returns its full plain-text content. PyMuPDF's `get_text()` method handles:
- Multi-column layouts (reads in natural reading order)
- Embedded fonts and Unicode characters
- Headers, footers, and page numbers
- Tables (as best-effort plain text)

In [None]:
def extract_text_from_pdf(pdf_path: Path) -> str:
    """Extract all text from a PDF file and return as a single string."""
    doc = pymupdf.open(pdf_path)
    pages = []
    for page in doc:
        pages.append(page.get_text())
    doc.close()
    return "\n\n".join(pages)


def extract_text_by_page(pdf_path: Path) -> list[str]:
    """Extract text from a PDF file, returning a list with one entry per page."""
    doc = pymupdf.open(pdf_path)
    pages = []
    for page in doc:
        pages.append(page.get_text())
    doc.close()
    return pages

## 5. Single-File Test

Try extracting text from one PDF to verify everything works before processing a full directory.

In [None]:
# Find the first PDF in the input directory for a quick test
pdf_files = sorted(PDF_INPUT_DIR.glob("*.pdf"))
print(f"Found {len(pdf_files)} PDF file(s) in {PDF_INPUT_DIR.resolve()}")

if pdf_files:
    sample_pdf = pdf_files[0]
    print(f"\nTesting with: {sample_pdf.name}")
    print("═" * 60)

    sample_text = extract_text_from_pdf(sample_pdf)
    # Show a preview (first 2000 chars)
    preview = sample_text[:2000]
    print(preview)
    if len(sample_text) > 2000:
        print(f"\n... [{len(sample_text) - 2000:,} more characters]")
    print(f"\nTotal characters extracted: {len(sample_text):,}")
else:
    print(f"\nNo PDF files found. Place your PDFs in: {PDF_INPUT_DIR.resolve()}")

## 6. Batch Processing

Process all PDFs in the input directory and write `.txt` files to the output directory. Each output file keeps the same name as its source PDF but with a `.txt` extension.

In [None]:
TEXT_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

pdf_files = sorted(PDF_INPUT_DIR.glob("*.pdf"))
print(f"Processing {len(pdf_files)} PDF(s) → {TEXT_OUTPUT_DIR.resolve()}\n")

results = []

for pdf_path in tqdm(pdf_files, desc="Extracting text"):
    try:
        if SPLIT_BY_PAGE:
            pages = extract_text_by_page(pdf_path)
            # Write per-page files: document_p1.txt, document_p2.txt, ...
            for i, page_text in enumerate(pages, start=1):
                out_path = TEXT_OUTPUT_DIR / f"{pdf_path.stem}_p{i}.txt"
                out_path.write_text(page_text, encoding="utf-8")
            full_text = "\n\n".join(pages)
            char_count = len(full_text)
            page_count = len(pages)
        else:
            full_text = extract_text_from_pdf(pdf_path)
            char_count = len(full_text)
            doc = pymupdf.open(pdf_path)
            page_count = len(doc)
            doc.close()

        # Always write the combined file
        out_path = TEXT_OUTPUT_DIR / f"{pdf_path.stem}.txt"
        out_path.write_text(full_text, encoding="utf-8")

        results.append({
            "file": pdf_path.name,
            "pages": page_count,
            "characters": char_count,
            "status": "ok"
        })

    except Exception as e:
        results.append({
            "file": pdf_path.name,
            "pages": 0,
            "characters": 0,
            "status": f"error: {e}"
        })

print(f"\nDone. {len(results)} file(s) processed.")

## 7. Results Summary

In [None]:
import pandas as pd

if results:
    df = pd.DataFrame(results)
    print(f"Total files:      {len(df)}")
    print(f"Successful:       {(df['status'] == 'ok').sum()}")
    print(f"Errors:           {(df['status'] != 'ok').sum()}")
    print(f"Total pages:      {df['pages'].sum():,}")
    print(f"Total characters: {df['characters'].sum():,}")
    print()
    display(df)

    # Show any errors
    errors = df[df["status"] != "ok"]
    if not errors.empty:
        print("\n⚠ Files with errors:")
        for _, row in errors.iterrows():
            print(f"  {row['file']}: {row['status']}")
else:
    print("No results — did you place PDF files in the input directory?")

## 8. Next Steps

The extracted `.txt` files are now ready for downstream processing:

1. **Feed into Notebook 1** — if the PDFs contain email-like content, adapt the parser in Notebook 1 to read from the text output directory instead of `maildir/`
2. **Direct Neo4j import** — load the text content as document nodes for full-text search and entity extraction
3. **Entity extraction** — use spaCy NER or an LLM to pull out people, organizations, and other entities from the plain text

### Troubleshooting

| Issue | Solution |
|-------|----------|
| Empty text output | The PDF may be image-based (scanned). Consider adding OCR with `pymupdf` + Tesseract. |
| Garbled characters | The PDF may use a non-standard font encoding. Try `page.get_text("text", sort=True)` for better ordering. |
| Missing tables | For structured table extraction, consider `pdfplumber` as an alternative. |