# OCR Processing for PDF Page Images

This notebook applies OCR to PNG page images (from PDFs in `../tmp/input_pdf`)  

and saves the extracted text and metadata as JSON files in `../tmp/output/`.

In [8]:
%load_ext autoreload
%autoreload 2 

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [9]:
# TODO: Currently `results = ocr_engine.run_ocr(image)` in `process_document_folder(image_dir, output_dir)` is just a dummy
# Ideas for models/approaches to try:
# - SuryaOCR
# - TrOCR
# - Huggingface: SmolVLM, ColPali

### Prerequirements
**Ollama 0.6+ is required**

```bash
curl -fsSL https://ollama.com/install.sh | sh
```
Grab the multimodal Gemma 3 build you actually have VRAM for

```bash
ollama pull gemma3:12b         # ~8 GB VRAM, good quality
```

or   

```bash
ollama pull gemma3:4b   # ~3 GB VRAM, slower but lighter
```

### Quick-start cheat sheet for running **Ollama** with Gemma locally

| Step| Command    | What it does      |           
| --------------------- | ---------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | 
| **Install**           | curl -fsSL [https://ollama.com/install.sh](https://ollama.com/install.sh)                      |                                  Installs the daemon (`ollama serve`) and CLI (`ollama`).         |  
| **Start the daemon**  | `ollama serve`          | Runs a local REST API on **`http://localhost:11434`** (default port). Keep this terminal open—or run it as a systemd service. |                                                          |
| **Pull a model**      | `ollama pull gemma3:12b`<br>*(or `gemma3:4b` if VRAM-starved)*                                             | Downloads and quantizes the model. Shows up in `ollama list`.                                                                 |                                                          |
| **Quick sanity test** | `ollama run gemma3:12b`                                                                                    | Opens an interactive REPL; type anything and get a response. Press **Ctrl-C** to exit.                                        |                                                          |
| **Simple REST call**  | `bash curl -s http://localhost:11434/api/generate \ -d '{"model":"gemma3:12b","prompt":"hello"}' `         | Should stream back JSON chunks with text content.                                                                             |                                                          |
| **Python**            | `python import ollama resp = ollama.generate(model="gemma3:12b", prompt="hello") print(resp["response"]) ` | Works because `ollama serve` is already listening at 11434.                                                                   |                                                          |
| **Custom host/port**  | `export OLLAMA_HOST=http://my-server:11434`                                                                | Both the CLI and the Python lib will point to this URL.                                                                       |                                                          |

#### Typical friction points & fixes

* **Daemon not running** → Every CLI/API request will hang. Make sure `ollama serve` (or the systemd service) is active.

  ```bash
  systemctl --user enable --now ollama  # on Linux desktop
  ```

* **GPU out of memory** → Use the 4-b model (`gemma3:4b`) or quantize further (`:int4`). `nvidia-smi` will tell you.

* **Port already in use** → `OLLAMA_HOST=http://localhost:11500 ollama serve -p 11500`.

* **Proxy / WSL networking issues** → Set `OLLAMA_HOST` to the real IP/port reachable from the client.

---

#### Recap for the notebook

1. **Start Ollama once**:

   ```bash
   ollama serve
   ```

2. **Pull the model once**:

   ```bash
   ollama pull gemma3:12b
   ```

3. **Run the model from CLI**:

   ```bash
   ollama run gemma3:12b
   ```

4. **Notebook cells** use the Python client, which will silently hit `http://localhost:11434`.

Nothing fancier is required. If `ollama run gemma3:12b` prints a reply, your daemon is good to go.


In [10]:
# --- 0. Bootstrap ----------------------------------------------------------
import sys, os
from pathlib import Path
from dotenv import load_dotenv   # pip install python-dotenv
import json, itertools

nb_dir       = Path.cwd().resolve()       # .../repo/notebooks
project_root = nb_dir.parent              # .../repo
src_dir      = project_root / "src"

# Make BOTH dirs importable: root → `config.*`, src → `src.*`
sys.path.extend([str(project_root), str(src_dir)])

# Load env so MODEL_TYPE=gemma is visible to pydantic
load_dotenv(project_root / ".env")

from config.settings import settings
assert settings.model_type == "gemma", settings.model_type

In [11]:
from tests.local_test import process_document_folder
from pathlib import Path

INPUT_PDF_DIR     = Path("../tmp/input_pdf")
IMAGE_OUTPUT_BASE = Path("../tmp/data")
OUTPUT_ROOT       = Path("../tmp/output")

for pdf in INPUT_PDF_DIR.glob("*.pdf"):
    stem        = pdf.stem.lower().replace(" ", "_")
    img_dir     = IMAGE_OUTPUT_BASE / stem
    out_dir     = OUTPUT_ROOT / stem

    process_document_folder(img_dir, out_dir)

print("OCR finished. JSONs live in", OUTPUT_ROOT)

INFO:tests.local_test:OCR processing: sample
INFO:tests.local_test:file=page_00.png (../tmp/data/sample/page_00.png)
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"
INFO:tests.local_test:Written: ../tmp/output/sample/page_00.json
INFO:tests.local_test:file=page_01.png (../tmp/data/sample/page_01.png)
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"
INFO:tests.local_test:Written: ../tmp/output/sample/page_01.json
INFO:tests.local_test:file=page_02.png (../tmp/data/sample/page_02.png)
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"
INFO:tests.local_test:Written: ../tmp/output/sample/page_02.json
INFO:tests.local_test:file=page_03.png (../tmp/data/sample/page_03.png)
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"
INFO:tests.local_test:Written: ../tmp/output/sample/page_03.json
INFO:tests.local_test:file=page_04.png (../tmp/data/sample/page_04.png)

OCR finished. JSONs live in ../tmp/output


In [14]:
some_file = next(Path(OUTPUT_ROOT).rglob("page_00.json"))
with open(some_file, encoding="utf-8") as f:
    data = json.load(f)

print("\n".join(itertools.islice((r["text"] for r in data), 30)))

World


In [15]:
def merge_pages(pdf_stem: str):
    page_dir = OUTPUT_ROOT / pdf_stem
    md_path  = page_dir / f"{pdf_stem}.md"

    page_files = sorted(page_dir.glob("page_*.json"))
    with open(md_path, "w", encoding="utf-8") as fout:
        for p in page_files:
            with open(p, encoding="utf-8") as f:
                objs = json.load(f)
                lines = [o["text"] for o in objs]
                fout.write(f"# {p.stem}\n\n")
                fout.write("\n".join(lines) + "\n\n")

    return md_path

merged = merge_pages("sample")     # whatever your PDF stem is
print("Merged text written to", merged)

Merged text written to ../tmp/output/sample/sample.md
