# DeepSeek-OCR FiftyOne Integration Example

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/harpreetsahota204/deepseek_ocr/blob/main/deepseek_ocr_example.ipynb)

This notebook demonstrates how to use DeepSeek-OCR as a FiftyOne zoo model for document analysis and OCR tasks.


## Installation

Install the required dependencies. Note: DeepSeek-OCR requires specific versions of transformers and tokenizers.


In [None]:
!pip install transformers==4.46.3
!pip install tokenizers==0.20.3
!pip install addict
!pip install fiftyone


### Optional: GPU Acceleration

For faster inference on GPU, install Flash Attention:


In [None]:
!pip install flash-attn==2.7.3 --no-build-isolation


## Load a Dataset

Load a sample document dataset from Hugging Face:


In [None]:
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

# Load the dataset
# Note: other available arguments include 'max_samples', etc
dataset = load_from_hub("Voxel51/document-haystack-10pages")


## Register the Zoo Model Source

Register the DeepSeek-OCR model as a remote FiftyOne zoo model source:


In [None]:
import fiftyone.zoo as foz

# Register the model source
foz.register_zoo_model_source(
    "https://github.com/harpreetsahota204/deepseek_ocr",
    overwrite=True  # This will make sure you're always using the latest implementation
)


## Load the Model

Load the DeepSeek-OCR model from the zoo:


In [None]:
# Load the model
model = foz.load_zoo_model("deepseek-ai/DeepSeek-OCR")


## Understanding Resolution Modes

DeepSeek-OCR provides five resolution modes optimized for different document types:

**Single-View Modes** (`crop_mode=False`):
- **`"tiny"`** - 512x512 resolution, 64 vision tokens. Fastest, for very simple documents.
- **`"small"`** - 640x640 resolution, 100 vision tokens. Fast, for simple receipts/forms.
- **`"base"`** - 1024x1024 resolution, 256 vision tokens. Balanced, for standard documents.
- **`"large"`** - 1280x1280 resolution, 400 vision tokens. Highest quality, slower.

**Multi-View Mode**:
- **`"gundam"`** (default) - 1024 base + 640 patches, variable tokens. Multi-view for complex layouts with tables, multi-column documents, and academic papers.

The model automatically handles any input image size. You choose the mode based on document complexity, not your image dimensions.


## Example 1: Grounding Mode - Extract Text with Bounding Boxes

Use grounding mode to extract text along with bounding box coordinates:


In [None]:
# Grounding Mode - Extract text with bounding boxes
model.resolution_mode = "gundam"
model.operation = "grounding"

dataset.apply_model(model, label_field="text_detections")


## Example 2: Free OCR - Text Extraction Only

Extract text without bounding boxes:


In [None]:
# Free OCR
model.operation = "ocr"
dataset.apply_model(model, label_field="text_extraction")


## Example 3: Describe Mode - Document Description

Generate descriptions of document content:


In [None]:
# Describe mode
model.operation = "describe"
dataset.apply_model(model, label_field="doc_description")


## Example 4: Custom Prompt

Use a custom prompt to guide the model toward specific extraction tasks:


In [None]:
# Custom prompt
model.prompt = "<image>\n<|grounding|>Locate <|ref|>The secret<|/ref|> in the image."
dataset.apply_model(model, label_field="custom_detections")


## Visualize Results

Launch the FiftyOne App to visualize the results:


In [None]:
session = fo.launch_app(dataset)


## Optional: Install Caption Viewer Plugin

For better visualization of extracted text and captions:


In [None]:
!fiftyone plugins download https://github.com/mythrandire/caption-viewer


## Additional Examples

### Using Different Resolution Modes

Try different resolution modes for different document types:


In [None]:
# Fast processing for simple documents
model.resolution_mode = "small"
model.operation = "ocr"
dataset.apply_model(model, label_field="fast_ocr")

# High quality for complex documents
model.resolution_mode = "large"
model.operation = "grounding"
dataset.apply_model(model, label_field="high_quality_detections")


### Custom Prompts for Specific Tasks


In [None]:
# Extract tables with bounding boxes
model.prompt = "<image>\n<|grounding|>Extract all table content."
dataset.apply_model(model, label_field="table_detections")

# Extract headers and titles
model.prompt = "<image>\n<|grounding|>Find all headers and section titles."
dataset.apply_model(model, label_field="header_detections")

# Summarize document
model.prompt = "<image>\nSummarize the main points in bullet format."
dataset.apply_model(model, label_field="summary")


## Resources

- **GitHub Repository:** https://github.com/harpreetsahota204/deepseek_ocr
- **Official DeepSeek-OCR:** https://github.com/deepseek-ai/DeepSeek-OCR
- **Model Card:** https://huggingface.co/deepseek-ai/DeepSeek-OCR
- **FiftyOne Documentation:** https://docs.voxel51.com/
