From d44a543e6314e7b6b1dc9fdca6713b76c39e8252 Mon Sep 17 00:00:00 2001 From: merveenoyan Date: Tue, 21 Oct 2025 16:46:40 +0200 Subject: [PATCH 01/28] initial commit --- ocr-open-models.md | 348 ++++++++++++++++++++++++++++++++++ ocr-open-models/thumbnail.png | Bin 0 -> 77228 bytes 2 files changed, 348 insertions(+) create mode 100644 ocr-open-models.md create mode 100644 ocr-open-models/thumbnail.png diff --git a/ocr-open-models.md b/ocr-open-models.md new file mode 100644 index 0000000000..7634a799f0 --- /dev/null +++ b/ocr-open-models.md @@ -0,0 +1,348 @@ +--- +title: "Supercharge your OCR Pipelines with Open Models" +thumbnail: /blog/assets/ocr-open-models/thumbnail.png +authors: +- user: merve +- user: ariG23498 +- user: davanstrien +- user: hynky +- user: andito +- user: reach-vb +- user: pcuenq +--- + +# Supercharge your OCR Pipelines with Open Models + +TL;DR: The rise of powerful vision-language models has transformed document AI. Each model comes with unique strengths, making it tricky to choose the right one. Open-weight models offer better cost efficiency and privacy. To help you get started with them, we’ve put together this guide. + +You’ll learn: + +* The landscape of current models and their capabilities +* When to fine-tune models vs. use models out-of-the-box +* Key factors to consider when selecting a model for your use case +* How to move beyond OCR with multimodal retrieval and document QA + +By the end, you’ll know how to choose the right OCR model, start building with it, and gain deeper insights into document AI. Let’s go\! + +# Table-of-Contents + +- [Supercharge your OCR Pipelines with Open Models](#supercharge-your-ocr-pipelines-with-open-models) + - [Brief Introduction to Modern OCR](#brief-introduction-to-modern-ocr) + - [Model Capabilities](#model-capabilities) + - [Transcription](#transcription) + - [Handling complex components in documents](#handling-complex-components-in-documents) + - [Output formats](#output-formats) + - [Locality Awareness in OCR](#locality-awareness-in-ocr) + - [Model Prompting](#model-prompting) + - [Cutting-edge Open OCR Models](#cutting-edge-open-ocr-models) + - [Comparing Latest Models](#comparing-latest-models) + - [Evaluating Models](#evaluating-models) + - [Benchmarks](#benchmarks) + - [Cost-efficiency](#cost-efficiency) + - [Open OCR Datasets](#open-ocr-datasets) + - [Tools to Run Models](#tools-to-run-models) + - [Locally](#locally) + - [Transformers](#transformers) + - [MLX](#mlx) + - [Remotely](#remotely) + - [Inference Endpoints for Managed Deployment](#inference-endpoints-for-managed-deployment) + - [Hugging Face Jobs for Batch Inference](#hugging-face-jobs-for-batch-inference) + - [Going Beyond OCR](#going-beyond-ocr) + - [Visual Document Retrievers](#visual-document-retrievers) + - [Using Vision Language Models for Document Question Answering](#using-vision-language-models-for-document-question-answering) + - [Wrapping up](#wrapping-up) + +# Brief Introduction to Modern OCR + +Optical Character Recognition (OCR) is one of the earliest and longest running challenges in computer vision. Many of AI’s first practical applications focused on turning printed text into digital form. + +With the surge of [vision-language models](https://huggingface.co/blog/vlms) (VLMs), OCR has advanced significantly. Recently, many OCR models have been developed by fine-tuning existing VLMs. But today’s capabilities extend far beyond OCR: you can retrieve documents by query or answer questions about them directly. Thanks to stronger vision features, these models can also handle low-quality scans, interpret complex elements like tables, charts, and images, and fuse text with visuals to answer open-ended questions across documents. + +## Model Capabilities + +### Transcription +Recent models transcribe texts into a machine-readable format. +The input can include: + +- Handwritten text +- Various scripts like Latin, Arabic, and Japanese characters +- Mathematical expressions +- Chemical formulas +- Image/Layout/Page number tags + + + OCR models convert them into machine-readable text that comes in many different formats like HTML, Markdown and more. + + +### Handling complex components in documents + +On top of text, some models can also recognize: + +- Images +- Charts +- Tables + +Some models know where images are inside the document, extract their coordinates, and insert them appropriately between texts. Other models generate captions for images and insert them where they appear. This is especially useful if you are feeding the machine-readable output into an LLM. Example models are [OlmOCR by AllenAI](https://huggingface.co/allenai/olmOCR-7B-0825), or [PaddleOCR-VL by PaddlePaddle](https://huggingface.co/PaddlePaddle/PaddleOCR-VL). + +Models use different machine-readable output formats, such as **DocTags**, **HTML** or **Markdown** (explained in the next section *Output Formats*). The way a model handles tables and charts often depends on the output format they are using. Some models treat charts like images: they are kept as is. Other models convert charts into markdown tables or JSON, e.g., a bar chart can be converted as follows. + +![Chart Rendering](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/ocr/chart-rendering.png) + +Similarly for tables, cells are converted into a machine-readable format while retaining context from headings and columns. + +![Table Rendering][https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/ocr/table-rendering.png] + +### Output formats +Different OCR models have different output formats. Briefly, here are the common output formats used by modern models. +**DocTag:** DocTag is an XML-like format for documents that expresses location, text format, component-level information, and more. Below is an illustration of a paper parsed into DocTags. This format is employed by the open Docling models. + +![DocTags](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/ocr/doctags_v2.png) + +- **HTML:** HTML is one of the most popular output formats used for document parsing as it properly encodes structure and hierarchical information. +- **Markdown:** Markdown is the most human-readable format. It’s simpler than HTML but not as expressive. For example, it can’t represent split-column tables. +- **JSON:** JSON is not a format that models use for the entire output, but it can be used to represent information in tables or charts. + +The right model depends on how you plan to use its outputs: + +* **Digital reconstruction**: To reconstruct documents digitally, choose a model with a layout-preserving format (e.g., DocTags or HTML). +* **LLM input or Q\&A**: If the use case involves passing outputs to LLM, pick a model that outputs Markdown and image captions, since they’re closer to natural language. +* **Programmatic use**: If you want to pass your outputs to a program (like data analysis), opt for a model that generates structured outputs like JSON. + +### Locality Awareness + +Documents can have complex structures, like multi-column text blocks and floating figures. Older OCR models handled these documents by detecting words and then the layout of pages manually in post-processing to have the text rendered in reading order, which is brittle. Modern OCR models, on the other hand, incorporate layout metadata to help preserve reading order and accuracy. This metadata is called “anchor”, it can come in bounding boxes. This process is also called as “grounding/anchoring” because it helps with hallucination. + + +### Model Prompting + +OCR models can either take in images or a text prompt along with the image, which heavily depends on the architecture and the pre-training setup. +Some OCR models support prompt-based task switching, e.g. [granite-docling](https://huggingface.co/ibm-granite/granite-docling-258M) can parse an entire page with the prompt “Convert this page to Docling” while it can also take prompts like “Convert this formula to LaTeX” along with a page full of formulas. +Other models, however, are trained only for parsing entire pages, and they are conditioned to do this through a system prompt. +For instance, [OlmOCR by AllenAI](https://huggingface.co/collections/allenai/olmocr-67af8630b0062a25bf1b54a1) takes a long conditioning prompt. Like many others, OlmOCR is technically an OCR fine-tuned version of a VLM (Qwen2.5VL in this case), so you can prompt for other tasks, but its performance will not be on par with the OCR capabilities. + +## Cutting-edge Open OCR Models + +We’ve seen an incredible wave of new models this past year. Because so much work is happening in the open, these players build on and benefit from each other’s work. A great example is AllenAI’s release of OlmOCR, which not only released a model but also the dataset used to train it. With these, others can build upon them in new directions. The field is incredibly active, but it’s not always obvious which model to use. + +### Comparing Latest Models + +To make things a bit easier, we’re putting together a non-exhaustive comparison of some of our current favorite models. All of the models below are layout-aware and can parse tables, charts, and math equations. The full list of languages each model supports are detailed in their model cards, so make sure to check them if you’re interested. + +| Model Name | Output formats | Features | Model Size | Multilingual? | +| :---- | :---- | :---- | :---- | :---- | +| [Nanonets-OCR2-3B](https://huggingface.co/collections/nanonets/nanonets-ocr2-68ed207f17ee6c31d226319e) | structured Markdown with semantic tagging (plus HTML tables, etc.) | Captions images in the documents Signature & watermark extraction Handles checkboxes, flowcharts, and handwriting | 4B | ✅Supports English, Chinese, French, Arabic and more. | +| [PaddleOCR-VL](https://huggingface.co/collections/PaddlePaddle/paddleocr-vl-68f0db852483c7af0bc86849) | Markdown, JSON, HTML tables and charts | Handles handwriting, old documents Allows prompting Converts tables & charts to HTML Extracts and inserts images directly | 0.9B | ✅Supports 109 languages | +| [dots.ocr](https://huggingface.co/rednote-hilab/dots.ocr) | Markdown, JSON | Grounding Extracts and inserts images Handles handwriting | 3B | ✅Multilingual with language info not available | +| [OlmOCR](https://huggingface.co/allenai/olmOCR-7B-0825) | Markdown, html, latex | Grounding Optimized for large-scale batch processing | 8B | ❎English-only | +| [Granite-Docling-258M](https://huggingface.co/ibm-granite/granite-docling-258M) | DocTags | Prompt-based task switching Ability to prompt element locations with location tokens Rich output | 258M | ✅Supports English, Japanese, Arabic and Chinese. | +| [DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR) | Markdown \+ HTML | Supports general visual understanding Can parse and re-render all charts, tables, and more into HTML Handles handwriting Memory-efficient, solves text through image | 3B | ✅Supports nearly 100 languages | + +Here’s a small demo for you to try some of the latest models and compare their outputs. +\ ### Evaluating Models From cb5a45171f4348aa7c3c607327a70965fde8a719 Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Tue, 21 Oct 2025 17:50:33 +0200 Subject: [PATCH 20/28] Update ocr-open-models.md Co-authored-by: Pedro Cuenca --- ocr-open-models.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ocr-open-models.md b/ocr-open-models.md index 9c1e70f9b8..1b65fbbf01 100644 --- a/ocr-open-models.md +++ b/ocr-open-models.md @@ -320,7 +320,7 @@ For example, to run OCR on 100 images: hf jobs uv run \--flavor l4x1 \\ https://huggingface.co/datasets/uv-scripts/ocr/raw/main/nanonets-ocr.py \\ your-input-dataset your-output-dataset \\ - \--max-samples 100 + --max-samples 100 ``` The scripts handle all the vLLM configuration and batching automatically, making batch OCR accessible without infrastructure setup. From ed0e841f36faad5ad75ea550bef704d559f0d798 Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Tue, 21 Oct 2025 17:51:04 +0200 Subject: [PATCH 21/28] Update ocr-open-models.md Co-authored-by: Pedro Cuenca --- ocr-open-models.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/ocr-open-models.md b/ocr-open-models.md index 1b65fbbf01..1d08ab5ffa 100644 --- a/ocr-open-models.md +++ b/ocr-open-models.md @@ -310,9 +310,9 @@ For many OCR applications, you want to do efficient batch inference, i.e., runni To make this even easier, we've created [uv-scripts/ocr](https://huggingface.co/datasets/uv-scripts/ocr), a collection of ready-to-run OCR scripts that work with Hugging Face Jobs. These scripts let you run OCR on any dataset without needing your own GPU. Simply point the script at your input dataset, and it will: -\- Process all images in a dataset column using many different open OCR models -\- Add OCR results as a new markdown column to the dataset -\- Push the updated dataset with OCR results to the Hub +- Process all images in a dataset column using many different open OCR models +- Add OCR results as a new markdown column to the dataset +- Push the updated dataset with OCR results to the Hub For example, to run OCR on 100 images: From d2a41ad73c6549b213dd831a788473467934fcaa Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Tue, 21 Oct 2025 17:51:12 +0200 Subject: [PATCH 22/28] Update ocr-open-models.md Co-authored-by: vb --- ocr-open-models.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ocr-open-models.md b/ocr-open-models.md index 1d08ab5ffa..03af7907b7 100644 --- a/ocr-open-models.md +++ b/ocr-open-models.md @@ -15,7 +15,7 @@ authors: TL;DR: The rise of powerful vision-language models has transformed document AI. Each model comes with unique strengths, making it tricky to choose the right one. Open-weight models offer better cost efficiency and privacy. To help you get started with them, we’ve put together this guide. -You’ll learn: +In this guide, you’ll learn: * The landscape of current models and their capabilities * When to fine-tune models vs. use models out-of-the-box From d70bccd96fe06c301158076adace4f29bfe5f433 Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Tue, 21 Oct 2025 17:53:43 +0200 Subject: [PATCH 23/28] fix --- ocr-open-models.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/ocr-open-models.md b/ocr-open-models.md index 03af7907b7..1dd16ac82e 100644 --- a/ocr-open-models.md +++ b/ocr-open-models.md @@ -298,7 +298,7 @@ Here is a simple method of deploying `nanonets` using vLLM as the inference engi 3. Configure the deployment setup within seconds -![Inference Endpoints][https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/resolve/ocr/IE2.png] +![Inference Endpoints](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/resolve/ocr/IE2.png) 4. After the endpoint is created, you can consume it using the OpenAI client snippet we provided in the previous section. @@ -344,6 +344,6 @@ If you want to learn more about OCR and vision language models, we encourage you - [Vision Language Models Explained](https://huggingface.co/blog/vlms) - [Vision Language Models 2025 Update](https://huggingface.co/blog/vlms-2025) -- \[PP-OCR-v5\](https://huggingface.co/blog/baidu/ppocrv5) -- \[SOTA OCR on-device with Core ML and dots.ocr\](https://huggingface.co/blog/dots-ocr-ne) +- [PP-OCR-v5](https://huggingface.co/blog/baidu/ppocrv5) +- [SOTA OCR on-device with Core ML and dots.ocr](https://huggingface.co/blog/dots-ocr-ne) From 3e6a3654a7cbe775370a93541130b005fd14acba Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Tue, 21 Oct 2025 18:00:49 +0200 Subject: [PATCH 24/28] Update ocr-open-models.md Co-authored-by: vb --- ocr-open-models.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ocr-open-models.md b/ocr-open-models.md index 1dd16ac82e..540ab8e1e4 100644 --- a/ocr-open-models.md +++ b/ocr-open-models.md @@ -270,7 +270,7 @@ print(result) ``` **MLX** -MLX is an open-source machine learning framework for Apple Silicon. MLX-VLM is built on top of MLX to serve vision language models easily. You can explore all the OCR models available in MLX format [here](https://huggingface.co/models?sort=trending&search=ocr). They also come in quantized versions. +MLX is an open-source machine learning framework for Apple Silicon. [MLX-VLM](https://github.com/Blaizzy/mlx-vlm) is built on top of MLX to serve vision language models easily. You can explore all the OCR models available in MLX format [here](https://huggingface.co/models?sort=trending&search=ocr). They also come in quantized versions. You can install MLX-VLM as follows. ``` From 74b59765c0d7eef76510e06f074761bc3ec49b1c Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Tue, 21 Oct 2025 18:01:04 +0200 Subject: [PATCH 25/28] Update ocr-open-models.md Co-authored-by: vb --- ocr-open-models.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ocr-open-models.md b/ocr-open-models.md index 540ab8e1e4..20f28a4bef 100644 --- a/ocr-open-models.md +++ b/ocr-open-models.md @@ -137,7 +137,7 @@ To make things a bit easier, we’re putting together a non-exhaustive compariso | [Granite-Docling-258M](https://huggingface.co/ibm-granite/granite-docling-258M) | DocTags | Prompt-based task switching Ability to prompt element locations with location tokens Rich output | 258M | ✅Supports English, Japanese, Arabic and Chinese. | | [DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR) | Markdown \+ HTML | Supports general visual understanding Can parse and re-render all charts, tables, and more into HTML Handles handwriting Memory-efficient, solves text through image | 3B | ✅Supports nearly 100 languages | -Here’s a small demo for you to try some of the latest models and compare their outputs. +Here’s a [small demo](https://prithivMLmods-Multimodal-OCR3.hf.space) for you to try some of the latest models and compare their outputs.