## 📄 SmolDocling-256M App

Developed by IBM Research  and Hugging Face , SmolDocling brings a breath of fresh air to the ever-expanding world of vision-language models. Designed for end-to-end multi-modal document conversion, this ultra-compact model punches well above its weight. At just 256M parameters, SmolDocling competes with models over 10x larger, offering accurate, structured document parsing without the bloat of traditional LVLMs (Large Vision-Language Models).

Paper Link: https://arxiv.org/html/2503.11576v1

In [None]:
!pip install gradio python-dotenv docling-core Pillow

Collecting gradio
  Downloading gradio-5.23.3-py3-none-any.whl.metadata (16 kB)
Collecting python-dotenv
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Collecting docling-core
  Downloading docling_core-2.25.0-py3-none-any.whl.metadata (5.8 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.8.0 (from gradio)
  Downloading gradio_client-1.8.0-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.met

In [None]:
# smol_docling_gradio_app.py

import os
import time
import torch
import gradio as gr
from PIL import Image
from dotenv import load_dotenv

from transformers import AutoProcessor, AutoModelForVision2Seq
from huggingface_hub import login
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument

# Load HuggingFace token
load_dotenv()
HF_TOKEN = os.getenv("HF_TOKEN")

# -------------------------------
# Load Model and Processor
# -------------------------------
def load_smol_docling_model():
    """Load the HuggingFace model and processor."""
    if HF_TOKEN:
        login(token=HF_TOKEN)

    device = "cuda" if torch.cuda.is_available() else "cpu"

    processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
    model = AutoModelForVision2Seq.from_pretrained(
        "ds4sd/SmolDocling-256M-preview",
        torch_dtype=torch.float32
    ).to(device)

    return processor, model, device


# -------------------------------
# Inference Function
# -------------------------------
def run_docling_ocr(image: Image.Image, task_prompt: str):
    """Run inference using SmolDocling on one image."""
    processor, model, device = load_smol_docling_model()

    start_time = time.time()

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": task_prompt}
            ]
        }
    ]

    prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
    inputs = processor(text=prompt, images=[image], return_tensors="pt").to(device)

    with torch.no_grad():
        generated_ids = model.generate(**inputs, max_new_tokens=1024)

    prompt_length = inputs.input_ids.shape[1]
    trimmed_ids = generated_ids[:, prompt_length:]
    doctags = processor.batch_decode(trimmed_ids, skip_special_tokens=False)[0].lstrip()
    doctags = doctags.replace("<end_of_utterance>", "").strip()

    # Convert to markdown using docling
    doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
    doc = DoclingDocument(name="Document")
    doc.load_from_doctags(doctags_doc)
    md_content = doc.export_to_markdown()

    processing_time = time.time() - start_time

    return doctags, md_content, f"✅ Processed in {processing_time:.2f} seconds"


# -------------------------------
# Gradio UI
# -------------------------------
task_options = [
    "Convert this page to docling.",
    "Convert this table to OTSL.",
    "Convert code to text.",
    "Convert formula to latex.",
    "Convert chart to OTSL.",
    "Extract all section header elements on the page."
]

with gr.Blocks() as demo:
    gr.Markdown("# 📄 SmolDocling-256M App")
    gr.Markdown("Upload a document image and extract structured content using SmolDocling + Docling.")

    with gr.Row():
        with gr.Column(scale=1):
            image_input = gr.Image(type="pil", label="Upload Image")
            prompt_input = gr.Dropdown(label="Select Task Prompt", choices=task_options, value=task_options[0])
            run_button = gr.Button("🔍 Run the model")

        with gr.Column(scale=2):
            doctag_output = gr.Textbox(label="DocTags Output", lines=15)
            markdown_output = gr.Markdown()
            status = gr.Textbox(label="Status")

    run_button.click(
        fn=run_docling_ocr,
        inputs=[image_input, prompt_input],
        outputs=[doctag_output, markdown_output, status]
    )


if __name__ == "__main__":
    demo.launch()


Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://6e6a843e7b87f73381.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
