# OCR Processing for PDF Page Images

This notebook applies OCR to PNG page images (from PDFs in `../tmp/input_pdf`)  

and saves the extracted text and metadata as JSON files in `../tmp/output/`.

In [81]:
# OCR Processing for PDF & Image Files
# ====================================
#
# Renders PDF pages to PNG when needed, applies native-text/table extraction
# or OCR/VLM for scanned pages, and writes one JSON per page (or per single image)
# under ../tmp/output/.
#
# JSON schema per page:
# {
#   "schema_version": "1.0",
#   "page": 0,
#   "size": {"width": 1653, "height": 2338},
#   "items": [
#     {"block_id":"p0_b0","text":"…","confidence":0.95,"bbox":[l,t,r,b],"page_index":0}
#   ]
# }

# 0. autoreload for rapid iteration
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [82]:
# TODO: Currently `results = ocr_engine.run_ocr(image)` in `process_document_folder(image_dir, output_dir)` is just a dummy
# Ideas for models/approaches to try:
# - SuryaOCR
# - TrOCR
# - Huggingface: SmolVLM, ColPali

In [83]:
# 1. bootstrap paths & env
import sys, os, shutil, importlib, json, itertools
from pathlib import Path
from dotenv import load_dotenv

notebook_dir   = Path.cwd().resolve()        # .../repo/notebooks
repo_root      = notebook_dir.parent         # .../repo
src_dir        = repo_root / "src"
tmp_dir        = repo_root / "tmp"

sys.path.extend([str(repo_root), str(src_dir)])
load_dotenv(repo_root / ".env")

True

In [84]:
# 2. check settings
from config.settings import settings
assert settings.model_type in ("gemma","smolvlm","dummy"), settings.model_type

In [85]:
# 3. import & reload our unified OCR runner
import tests.local_test as lt
importlib.reload(lt)

<module 'tests.local_test' from '/Users/markuskuehnle/Documents/projects/credit-ocr-module/tests/local_test.py'>

In [86]:
# 4. prepare clean tmp/data and tmp/output
data_dir   = tmp_dir / "data"
output_dir = tmp_dir / "output"

for d in (data_dir, output_dir):
    if d.exists():
        shutil.rmtree(d)
    d.mkdir(parents=True, exist_ok=True)

In [87]:
# 5. unify your input PDFs/images into tmp/input
#    if you currently have a tmp/input_pdf folder, point here:
input_pdf_dir = tmp_dir / "input_pdf"
input_dir = tmp_dir / "input"

if input_dir.exists():
    shutil.rmtree(input_dir)
input_dir.mkdir(parents=True, exist_ok=True)

# copy all existing PDFs/images into tmp/input
for ext in ("*.pdf","*.png","*.jpg","*.jpeg"):
    for f in (input_pdf_dir.glob(ext) if input_pdf_dir.exists() else []):
        shutil.copy(f, input_dir)

In [88]:
# 6. run OCR/image processing on every file under tmp/input
for file_path in sorted(input_dir.glob("*")):
    print("→ processing", file_path.name)
    lt.process_input(
        file_path,
        image_base_directory=data_dir,
        json_output_root=output_dir,
    )

print("\nOCR done → JSON in", output_dir.resolve())

→ processing sample.pdf


INFO:generic_ocr:00:28:45:Starting PDF sample.pdf with engine SmolVLMEngine
INFO:generic_ocr:00:28:45:Page 00 (PDF) → page_00.json
INFO:generic_ocr:00:28:45:Page 01 (PDF) → page_01.json
INFO:generic_ocr:00:28:45:Page 02 (PDF) → page_02.json
INFO:generic_ocr:00:28:46:Page 03 (PDF) → page_03.json
INFO:generic_ocr:00:28:46:Page 04 (PDF) → page_04.json
INFO:generic_ocr:00:28:47:Page 05 (PDF) → page_05.json
INFO:generic_ocr:00:28:47:Page 06 (PDF) → page_06.json
INFO:generic_ocr:00:28:47:Page 07 (PDF) → page_07.json
INFO:generic_ocr:00:28:47:PDF sample.pdf done: 8 native, 0 OCR



OCR done → JSON in /Users/markuskuehnle/Documents/projects/credit-ocr-module/tmp/output


In [89]:
# 7. Preview first page of first document
first_json = next(output_dir.rglob("page_00.json"), None)
if first_json is None:
    raise RuntimeError(f"No page_00.json found under {output_dir}")

page_data = json.loads(first_json.read_text())
print(f"\nPreview – document “{first_json.parent.name}”, page {page_data['page']}, size={page_data['size']}\n")
print("\n".join(itertools.islice((it["text"] for it in page_data["items"]), 30)))


Preview – document “sample”, page 0, size={'width': 612, 'height': 792}

{"text": "2020 33rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI)\nFrom explanations to feature selection: assessing\nSHAP values as feature selection mechanism\nWilson E. Marcilio-Jr and Danilo M. Eler\nSa˜o Paulo State University - Department of Mathematics and Computer Science\nPresidente Prudente, Sa˜o Paulo/Brazil\nEmail: wilson.marcilio@unesp.br, danilo.eler@unesp.br\nAbstract—Explainabilityhasbecomeoneofthemostdiscussed technique has the same problem of trying to choose a model\ntopicsinmachinelearningresearchinrecentyears,andalthough for a task (e.g., in finance or clinical) only based on these\nalotofmethodologiesthattrytoprovideexplanationstoblack-\nmetrics; finally, although calculated as a part of the training\nbox models have been proposed to address such an issue, little\nprocess, embedded methods have to be incorporated based on\ndiscussion has been made on the pre-processing step

In [90]:
# 8. Merge per-page JSON into one Markdown for quick inspection
def merge_to_markdown(document_stem: str) -> Path:
    """
    Concatenate all page_<n>.json under tmp/output/<document_stem>/ 
    into a single Markdown file for eyeballing.
    """
    doc_dir = output_dir / document_stem
    md_path = doc_dir / f"{document_stem}.md"
    with open(md_path, "w", encoding="utf-8") as md_file:
        for page_json in sorted(doc_dir.glob("page_*.json")):
            data = json.loads(page_json.read_text())
            md_file.write(f"# page {data['page']}\n\n")
            for item in data["items"]:
                md_file.write(item["text"] + "\n\n")
    return md_path

merged_md = merge_to_markdown(first_json.parent.name)
print("Merged text written to", merged_md.resolve())

Merged text written to /Users/markuskuehnle/Documents/projects/credit-ocr-module/tmp/output/sample/sample.md
