<a href="https://colab.research.google.com/github/ngusadeep/CRUD-springboot/blob/main/docs_parser_with_deepseek_ocr_3b_model2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **ðŸ“„ OCR + Structured Document Parser Prototype**
### Using DeepSeek-OCR:3B (Ollama)

### This notebook demonstrates a complete prototype pipeline for an OCR-based document parser designed for processing Tanzanian shipment documents (EIRs, Export Orders, etc.).

### **Core Functions**
- Accept **any document format**: `.jpg, .jpeg, .png, .webp, .tiff, .bmp, .pdf`
- If the uploaded file is a **PDF**, automatically convert it to images
- Use **Ollama + DeepSeek-OCR:3B** locally inside Google Colab
- Process image(s) and extract clean text
- Output **structured with desired fomart**

Letâ€™s begin!


## ðŸ”§ Install Dependencies


In [None]:
# Install Ollama (server + CLI)
!curl -fsSL https://ollama.com/install.sh | sh

# Install poppler-utils for PDF -> images conversion
!apt-get update -qq && apt-get install -y -qq poppler-utils

# Python packages
!pip install -q pdf2image pillow pandas openpyxl

## Start Ollama (background) and pull the model

In [None]:
!ollama serve &>/content/ollama.log & sleep 1
!ollama pull deepseek-ocr:3b

## Helper Imports , libraries and constants

In [None]:
from pdf2image import convert_from_path
from PIL import Image
from IPython.display import display
from google.colab import files
import base64, subprocess, json, os, tempfile, re, io, time
import pandas as pd

# Where outputs will be stored
OUTPUT_DIR = "/content/ocr_outputs"
os.makedirs(OUTPUT_DIR, exist_ok=True)

## System PROMPT

In [None]:
MASTER_SYSTEM_PROMPT = r"""
You are a precise and reliable OCR + document parser AI that extracts
structured information from ANY type of scanned document or image, including:
- shipment documents
- container EIRs
- export orders
- invoices
- receipts
- IDs
- contracts
- PDFs converted to images

Your responsibilities:
1. Read the user's instruction and extract ONLY the requested information.
2. Understand the structure and layout of the uploaded document.
3. Use intelligent field detection even if the document format changes.
4. Be robust to low-quality scans, rotated images, blur, stamps, handwriting.
5. Always respond in STRICT JSON unless the user specifies a different format.

RULES:
- Never add explanations or text outside the JSON or chosen output format.
- If a requested field does not exist, return it as null.
- If multiple pages exist, return an array of results by page_number.
- Maintain consistent keys, casing, and values.
- Do NOT hallucinate values.

You will receive:
1. The user's extraction instruction in natural language.
2. The document image (base64).
3. Optional preview metadata.

Your task:
- Parse the document.
- Extract exactly what the user asked for.
- Output in the requested format (JSON/CSV/TXT/MD/YAML).

If the user instruction is vague:
- Ask for clarification in a single short question.

You are optimized for DeepSeek-OCR:3B running inside Ollama.

"""
print("Master prompt loaded (hidden).")

## File Upload

In [None]:
print("Upload your document here")
uploaded = files.upload()

if not uploaded:
    raise SystemExit("No file uploaded. Re-run cell and upload a file.")

uploaded_filename = list(uploaded.keys())[0]
local_path = "/content/" + uploaded_filename
with open(local_path, "wb") as f:
    f.write(uploaded[uploaded_filename])

print("Saved to:", local_path)

## Convert pdf to images

In [None]:
def pdf_to_images(pdf_path, dpi=200, out_dir="/content/pdf_pages"):
    os.makedirs(out_dir, exist_ok=True)
    pages = convert_from_path(pdf_path, dpi=dpi)
    paths = []
    for i, page in enumerate(pages):
        p = os.path.join(out_dir, f"page_{i+1}.png")
        page.save(p, "PNG")
        paths.append(p)
    return paths

def prepare_images(file_path):
    ext = file_path.lower().split('.')[-1]
    if ext == "pdf":
        print("Converting PDF to images...")
        return pdf_to_images(file_path)
    else:
        return [file_path]

image_paths = prepare_images(local_path)
print("Prepared images:", image_paths)

In [None]:
for p in image_paths:
    print("----", p)
    try:
        img = Image.open(p)
        display(img.resize((800, int(800 * img.height / img.width))))
    except Exception as e:
        print("Error displaying image:", e)

In [None]:
def image_to_data_uri(path):
    with open(path, "rb") as f:
        b = f.read()
    b64 = base64.b64encode(b).decode("utf-8")
    # assume PNG/JPEG by extension
    ext = path.split('.')[-1].lower()
    mime = "image/png" if ext in ("png", "svg") else "image/jpeg"
    return f"data:{mime};base64,{b64}"

In [None]:
user_instruction = """
Extract the following fields from the document image:
- container_terminal
- shipment_date
- shipment_number
- container_number
- container_size
"""

print("Choose your desired output format (json, csv, xlsx, txt). Default is csv:")
user_fmt = input().strip().lower()
if user_fmt in ("json", "csv", "xlsx", "txt"):
    fmt = user_fmt
else:
    fmt = "csv"
print("Extraction instruction set to extract all shipment info.")
print("Output format:", fmt)


In [None]:
def build_model_payload(system_prompt, user_instruction, base64_image):
    """
    Prepare the prompt structure expected by the model.
    We include system prompt + user content that contains instruction and image.
    """
    # Keep message structure simple: system (context) + user (instruction + image)
    user_content = [
        {"type": "text", "text": user_instruction},
        {"type": "image_url", "image_url": {"url": base64_image}},
    ]
    return system_prompt, user_content

In [None]:
def run_deepseek_on_image(image_path, system_prompt, user_instruction, max_retries=2):
    data_uri = image_to_data_uri(image_path)
    system, user_content = build_model_payload(system_prompt, user_instruction, data_uri)

    # prepare ollama CLI call - we'll stream the prompt via stdin
    # Ollama expects text input; some setups accept JSON-like chat payloads; we use simple interaction
    prompt_text = json.dumps({
        "system": system,
        "user": user_content
    })
    cmd = ["ollama", "run", "deepseek-ocr:3b"]

    for attempt in range(1, max_retries+1):
        try:
            proc = subprocess.Popen(cmd, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            out, err = proc.communicate(input=prompt_text.encode(), timeout=120)
            output_text = out.decode(errors="ignore").strip()
            if not output_text:
                output_text = err.decode(errors="ignore").strip()
            # try to find JSON in output_text
            return output_text
        except subprocess.TimeoutExpired:
            proc.kill()
            if attempt == max_retries:
                raise RuntimeError("DeepSeek-OCR call timed out.")
            print("Retrying... attempt", attempt+1)
    raise RuntimeError("Failed to call DeepSeek-OCR")


In [None]:
raw_model_outputs = []
start = time.time()

for i, img_path in enumerate(image_paths):
    print(f"Processing page {i+1}/{len(image_paths)} ...")
    try:
        raw = run_deepseek_on_image(img_path, MASTER_SYSTEM_PROMPT, user_instruction)
        print("Model raw output (first 400 chars):")
        print(raw[:400])
        raw_model_outputs.append({"page_number": i+1, "image_path": img_path, "raw_output": raw})
    except Exception as e:
        print("Error processing page:", e)
        raw_model_outputs.append({"page_number": i+1, "image_path": img_path, "raw_output": None, "error": str(e)})

duration = time.time() - start
print(f"Done. Time elapsed: {duration:.1f}s")
