# [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/harpreetsahota204/document_visual_ai_with_fiftyone_workshop/blob/main/03_using_ocr_models.ipynb)


In [None]:
!pip install fiftyone
!pip install "mineru-vl-utils[transformers]"

### Load local dataset

You can load the dataset we created in the first notebook as follows:

In [None]:
import fiftyone as fo

dataset = fo.load_dataset("neurips-2025-vision-papers")

### (Alternatively) Load dataset from Hugging Face Hub

If you're picking up in a fresh Colab notebook or didn't go through the first notebook, you can download the [Visual AI at NeurIPS 2025 dataset with the embeddings from the Jina models we used in the previous notebook](https://huggingface.co/datasets/harpreetsahota/visual_ai_at_neurips2025_jina), hosted on Hugging Face.


In [None]:
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

dataset = load_from_hub("harpreetsahota/visual_ai_at_neurips2025_jina")

 ### Setup the model

We'll start by using [MinerU2.5](https://github.com/harpreetsahota204/mineru_2_5), a 1.2B-parameter vision-language model for high-resolution document parsing.

This model is good at:

- Comprehensive layout analysis (headers, footers, lists, code blocks)

- Complex mathematical formula parsing (including mixed Chinese-English)

- Robust table parsing (handles rotated, borderless, and partial-border tables)



#### Register the model

In [None]:
import fiftyone.zoo as foz

foz.register_zoo_model_source(
    "https://github.com/harpreetsahota204/mineru_2_5",
    overwrite=True
)

#### Instantiate the model

In [None]:
# Load with default setting
model = foz.load_zoo_model("opendatalab/MinerU2.5-2509-1.2B")

#### Use the model for OCR with bounding boxes

In [None]:
import fiftyone as fo

# Apply model for structured extraction
model.operation = "ocr_detection"
dataset.apply_model(model, label_field="text_detections")

#### Use the model for OCR text extraction

In [None]:
model.operation = "ocr"
dataset.apply_model(model, label_field="text_extraction")

You can inspect the output as follows:


In [None]:
dataset.first()['text_detections']

In [None]:
dataset.first()['text_extraction']

### Some models are promptable

Below is an example of another VLM for OCR that you can prompt to extract specific content from a document image:

In [None]:
import fiftyone as fo
import fiftyone.zoo as foz

# Register the model source
foz.register_zoo_model_source(
    "https://github.com/harpreetsahota204/nanonets_ocr2",
    overwrite=True
)

# Load the model
nanonets_model = foz.load_zoo_model("nanonets/Nanonets-OCR2-3B")

nanonets_model.custom_prompt = "Extract the text from the abstract section of this paper"

# Apply OCR to your dataset
dataset.apply_model(nanonets_model, label_field="nanonets_abstract")

##### ðŸ“Œ Some other models you may want to check out later:

| Model | Parameters | Output | Key Features | Good For |
|:---|:---|:---|:---|:---|
| **[`mineru-2.5`](https://github.com/harpreetsahota204/mineru_2_5)** | 1.2B | Structured markdown | Two-stage strategy: global layout on downsampled image, then fine-grained recognition on native resolution; handles complex math formulas and tables (rotated, borderless, partial-border) | Documents with complex layouts and mathematical content |
| **[`deepseek-ocr`](https://docs.voxel51.com/plugins/plugins_ecosystem/deepseek_ocr.html)** | Dual-encoder (SAM + CLIP) | Structured markdown with bounding boxes | Five resolution modes (gundam default uses multi-view processing); contextual optical compression; supports custom prompts for specific extraction tasks | Complex PDFs and multi-column layouts where you need structured output |
| **[`olmocr-2`](https://docs.voxel51.com/plugins/plugins_ecosystem/olmocr_2.html)** | 7B (qwen2.5-vl) | Markdown with YAML front matter | Converts equations to LaTeX, tables to HTML; outputs metadata (language, rotation, table/diagram detection); reads documents like a human would | Academic papers and technical documents with equations and structured data |
| **[`kosmos-2.5`](https://docs.voxel51.com/plugins/plugins_ecosystem/kosmos2_5.html)** | 1.37B | OCR with bounding boxes or markdown | Two modes (OCR/markdown); automatic hardware optimization (bfloat16/float16/float32); handles handwritten text and diverse document types | General-purpose OCR when you need either coordinates or clean markdown |
| **[`nanonets-ocr2`](https://docs.voxel51.com/plugins/plugins_ecosystem/nanonets_ocr2.html)** | 3B | Structured markdown with semantic tags | LaTeX equations, image descriptions, signature/watermark detection, checkboxes to Unicode, tables to HTML, flowcharts as Mermaid, multilingual, VQA support | Documents needing semantic markup and intelligent content recognition |