# [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/harpreetsahota204/kosmos2_5/blob/main/kosmos2_5_example.ipynb)

# Kosmos-2.5 for FiftyOne: Complete Example

This notebook demonstrates how to use Microsoft's Kosmos-2.5 model with FiftyOne for document understanding and OCR tasks.

## What you'll learn:
1. How to set up and register the Kosmos-2.5 model
2. Processing PDFs with OCR and markdown extraction
3. Working with text detection datasets
4. Visualizing results in FiftyOne
5. Comparing OCR and markdown outputs


## 1. Installation and Setup

First, let's install the required packages and set up our environment.


In [None]:
# Install required packages
%pip install -q fiftyone
%pip install -q transformers torch torchvision
%pip install -q huggingface-hub

# For PDF processing
!apt-get install -y poppler-utils > /dev/null 2>&1
%pip install -q pdf2image


In [None]:
import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.operators as foo
import fiftyone.utils.huggingface as fouh
import torch
import requests
import os
from PIL import Image

## 2. Register and Load Kosmos-2.5 Model

Now let's register the Kosmos-2.5 model source and load the model. The first load will download the model weights (~3.5GB).


In [None]:
# Register the Kosmos-2.5 model source
foz.register_zoo_model_source(
    "https://github.com/harpreetsahota204/kosmos2_5", 
    overwrite=True
)

print("Model source registered successfully!")


In [None]:
# Load the model (this will download on first run)
print("Loading Kosmos-2.5 model...")
model = foz.load_zoo_model("microsoft/kosmos-2.5")
print(f"Model loaded successfully!")
print(f"Available operations: ocr, md")


## 3. Example 1: Processing a Research Paper PDF

Let's download a research paper and process it with both OCR and markdown extraction modes.


In [None]:
# Download the PDF loader plugin
!fiftyone plugins download https://github.com/brimoor/pdf-loader

In [None]:
# Download a sample research paper (Kosmos-2.5 paper itself!)
url = "https://arxiv.org/pdf/2309.11419"
filename = "kosmos25_paper.pdf"

response = requests.get(url)
if response.status_code == 200:
    with open(filename, 'wb') as f:
        f.write(response.content)
    print(f"✅ Downloaded {filename}")
else:
    print(f"❌ Failed to download. Status code: {response.status_code}")


In [None]:
# Create a dataset and load the PDF as images
pdf_dataset = fo.Dataset("kosmos25_paper_demo")

# Load PDF pages as images
pdf_loader = foo.get_operator("@brimoor/pdf-loader/pdf_loader")

pdf_loader(
    pdf_dataset,
    input_path="./kosmos25_paper.pdf",
    output_dir="./pdf_images",
    dpi=200,  # Higher DPI for better OCR quality
    fmt="png",
    tags=["research_paper", "kosmos25"]
)


pdf_dataset.first()


In [None]:
# Apply OCR mode to detect text with bounding boxes
print("🔍 Running OCR detection...")
model.operation = "ocr"
pdf_dataset.apply_model(model, label_field="text_detections")
print("✅ OCR detection complete!")


In [None]:
# Apply markdown mode to extract structured text
print("📝 Extracting markdown text...")
model.operation = "md"
pdf_dataset.apply_model(model, label_field="text_extraction")
print("✅ Markdown extraction complete!")


In [None]:
# Display extracted markdown from the first page
first_sample = pdf_dataset.first()
if first_sample.text_extraction:
    print("📄 Markdown from first page:")
    print("-" * 50)
    print(first_sample.text_extraction[:1000] + "..." if len(first_sample.text_extraction) > 1000 else first_sample.text_extraction)


In [None]:
# Visualize the results
session = fo.launch_app(pdf_dataset)


## 4. Example 2: Working with Text Detection Datasets

Now let's work with a text detection dataset from Hugging Face Hub.


In [None]:
# Load a text detection dataset from Hugging Face
print("📦 Loading Total-Text dataset from Hugging Face...")
text_dataset = fouh.load_from_hub(
    "Voxel51/Total-Text-Dataset", 
    max_samples=10  # Load only 10 samples for demo
)


In [None]:
# Apply OCR detection
print("🔍 Running OCR on text dataset...")
model.operation = "ocr"
text_dataset.apply_model(model, label_field="kosmos_detections")
print("✅ OCR complete!")


In [None]:
# Also extract markdown text
print("📝 Extracting markdown text...")
model.operation = "md"
text_dataset.apply_model(model, label_field="markdown_text")
print("✅ Markdown extraction complete!")


In [None]:
# Launch app to visualize and compare results
session = fo.launch_app(text_dataset)
