
## Catalog Product Extraction
This notebook guides you through extracting product information from a PDF catalog, even if you are new to the process. You will choose a PDF, set the exact prompt text in one place, and then run a memory‑safe extraction that goes page by page. The system reads text when available and uses the page image when that works better, so it can handle different page layouts without you changing code.

Each step has a short description and ends with a single function call. When you run everything, you will get a JSON file (and a CSV for convenience) that lists the products the system found, including fields such as manufacturer, model, years, and part numbers. If results are not what you expect, simply adjust the prompt in the prompt cell and run the steps again; you should not need to edit the library code.


## Step 1: Environment and Imports

Purpose: Load environment variables and import the refactored library entry points.


In [1]:
import os
import sys
from pathlib import Path
from dotenv import load_dotenv

# Auto-reload library modules so edits take effect without kernel restart
try:
    get_ipython().run_line_magic("load_ext", "autoreload")  # type: ignore[name-defined]
    get_ipython().run_line_magic("autoreload", "2")  # type: ignore[name-defined]
except Exception:
    pass

# Ensure project root on sys.path so we can import the local `lib` package
PROJECT_ROOT = Path("..").resolve()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

load_dotenv()

from lib import (
    PDFAnalyzer,
    AIDataExtractor,
    process_document,
    process_document_streaming,
)


## Step 2: Configure Inputs

Purpose: Choose the PDF file and output locations.


In [2]:
PDF_DIR = Path("../pdfs").resolve()
EXTRACTED_DIR = Path("../extracted_data").resolve()
TEMP_DIR = Path("../temp").resolve()
OUTPUT_DIR = Path("../output").resolve()

PDF_DIR.mkdir(exist_ok=True, parents=True)
EXTRACTED_DIR.mkdir(exist_ok=True, parents=True)
TEMP_DIR.mkdir(exist_ok=True, parents=True)
OUTPUT_DIR.mkdir(exist_ok=True, parents=True)

# Choose the first PDF by default
pdf_files = sorted(PDF_DIR.glob("*.pdf"))
selected_pdf = pdf_files[0] if pdf_files else None
selected_pdf


PosixPath('/Users/owen/Repos/pdf-catalog-anaylyzer/pdfs/2019_Undercar_Catalog_Small.pdf')

## Step 3: Edit Prompt(s)

Purpose: Store prompt text in one cell for easy iteration during analysis.


In [3]:
# Edit the exact extraction prompt text here. This is the active prompt.
EXTRACTION_PROMPT = """
You are an expert automotive parts specialist analyzing a catalog page. Use your deep understanding of automotive systems to extract contextually accurate product information.

For each product/fitment entry, extract a JSON array of objects with:
- manufacturer, model, years, specifications, side, inner_part_number, outer_part_number, part_number, category, notes

Rules:
- Preserve exact part numbers and year formats
- If unsure about a field, use null
- Return only a JSON array; no extra commentary
"""


## Step 4: Analyze PDF Structure

Purpose: Load the PDF, extract text for all pages, and create per-page images for vision tasks (memory-safe).


In [4]:
assert selected_pdf is not None, "Add a PDF to pdfs/ before proceeding."

# Streaming-friendly setup: load PDF and basic info only
analyzer = PDFAnalyzer(selected_pdf, temp_dir=TEMP_DIR, extracted_data_dir=EXTRACTED_DIR)
loaded = analyzer.load_pdf()

if loaded:
    pdf_info = analyzer.extract_basic_info()
    print(f"Loaded {pdf_info['filename']} with {pdf_info['page_count']} pages")
else:
    raise RuntimeError("Failed to load PDF")


Extracting text: 100%|██████████| 20/20 [00:01<00:00, 12.60it/s]
Processing pages: 100%|██████████| 20/20 [00:01<00:00, 12.34it/s]

Loaded 2019_Undercar_Catalog_Small.pdf with 20 pages





## Step 5: Page Strategy and Product Page Detection

Purpose: Identify likely product pages to prioritize hybrid extraction.


In [5]:
# Page classification to focus extraction on product pages
product_pages_analysis = analyzer.detect_product_data_pages()
num_product_pages = sum(1 for p in product_pages_analysis if p.get("is_product_page"))
print(f"Identified {num_product_pages} likely product pages out of {len(product_pages_analysis)}")


Identified 18 likely product pages out of 20


## Step 6 (Option A): Streaming Hybrid Extraction

Purpose: Process pages one-by-one with a single reusable AdvancedDocumentAI instance to minimize memory usage.


In [None]:
# Run streaming hybrid extraction (lower memory)
stream_result = process_document_streaming(
    pdf_path=selected_pdf,
    temp_dir=TEMP_DIR,
    extracted_data_dir=EXTRACTED_DIR,
    extraction_prompt=EXTRACTION_PROMPT,
    catalog_type="automotive",
    use_openai=True,
    use_gemini=False,
    use_advanced_ai=True,
    high_quality_images=False,
    minimize_memory=True,
    verbose=True,
)

extracted_items = stream_result.get("items", [])
len(extracted_items)


Streaming 20 pages in low-memory mode using OpenAI...
Page 1/20: candidate=False text_len=0
  Extracted 0 items this page | Total so far: 0
Page 2/20: candidate=True text_len=2437
  Rendered image for page 2 -> page_2.png
  Extracted 0 items this page | Total so far: 0
Page 3/20: candidate=False text_len=4568
  Extracted 0 items this page | Total so far: 0
Page 4/20: candidate=True text_len=6783
  Rendered image for page 4 -> page_4.png
  Extracted 0 items this page | Total so far: 0
Page 5/20: candidate=True text_len=1766
  Rendered image for page 5 -> page_5.png
  Extracted 3 items this page | Total so far: 3
Page 6/20: candidate=True text_len=2104
  Rendered image for page 6 -> page_6.png
  Extracted 12 items this page | Total so far: 15
Page 7/20: candidate=True text_len=3004
  Rendered image for page 7 -> page_7.png
  Extracted 77 items this page | Total so far: 92
Page 8/20: candidate=True text_len=3201
  Rendered image for page 8 -> page_8.png
  Extracted 49 items this page | To

## Step 7: Save Outputs

Purpose: Persist extracted data to JSON/CSV for downstream analysis.


In [None]:
import json
import pandas as pd
if 'selected_pdf' not in globals() or selected_pdf is None:
    raise RuntimeError("No PDF selected. Run Step 2-4 first.")
if 'EXTRACTED_DIR' not in globals() or 'OUTPUT_DIR' not in globals():
    raise RuntimeError("Output directories not configured. Run Step 2 first.")
if 'extracted_items' not in globals():
    print("No extracted_items found. Run Step 6 first.")
    extracted_items = []
json_path = EXTRACTED_DIR / f"{selected_pdf.stem}_items.json"
csv_path = OUTPUT_DIR / f"{selected_pdf.stem}_items.csv"
with open(json_path, "w", encoding="utf-8") as f:
    json.dump(extracted_items, f, ensure_ascii=False, indent=2)
if extracted_items:
    pd.DataFrame(extracted_items).to_csv(csv_path, index=False)
    result_paths = (json_path, csv_path)
else:
    result_paths = (json_path, None)
result_paths
