# [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/harpreetsahota204/minicpm-v/blob/main/minicpm_v_fiftyone_example.ipynb)


# MiniCPM-V Integration with FiftyOne - Example Notebook

This notebook demonstrates how to use MiniCPM-V 4.5, a powerful 8B parameter multimodal language model, as a remote source zoo model in FiftyOne.

## What You'll Learn

- How to register and load MiniCPM-V as a FiftyOne zoo model
- How to use all 6 supported operations:
  - Visual Question Answering (VQA)
  - Object Detection
  - Phrase Grounding
  - Image Classification
  - Keypoint Detection
  - OCR (Optical Character Recognition)

## 1. Setup and Installation

First, let's make sure we have all the necessary dependencies installed and import the required libraries.


In [None]:
# Install required packages if not already installed
# Uncomment the following lines if needed:
# !pip install fiftyone
# !pip install torch torchvision
# !pip install transformers
# !pip install huggingface-hub

import fiftyone as fo
import fiftyone.zoo as foz

## 2. Register and Load MiniCPM-V Model

Now let's register the MiniCPM-V model source and load the model. The model will be downloaded on first use (approximately 16GB).


In [None]:
# Register the MiniCPM-V model source
foz.register_zoo_model_source(
    "https://github.com/harpreetsahota204/minicpm-v", 
    overwrite=True
)

print("✅ MiniCPM-V model source registered successfully!")


In [None]:
# Load the model (this will download the model on first use)
# Note: The download is approximately 16GB and may take some time
model = foz.load_zoo_model(
    "openbmb/MiniCPM-V-4_5",
    # install_requirements=True  # Uncomment if you're unsure about dependencies
)

print("✅ Model loaded successfully!")
print(f"Device: {model.device}")  # Will show cuda, mps, or cpu


## 3. Load Sample Dataset

Let's load a sample dataset from the FiftyOne zoo to demonstrate the model's capabilities. We'll use the quickstart dataset with a small number of samples.


In [None]:
# Load a sample dataset
dataset = foz.load_zoo_dataset(
    "quickstart", 
    max_samples=10,  # Using 10 samples for quick demonstration
    overwrite=True
)

# Prepare object labels for detection tasks
labels_per_sample = dataset.values("ground_truth.detections.label")
unique_labels_per_sample = [list(set(labels)) for labels in labels_per_sample]
dataset.set_values("objects", unique_labels_per_sample)

print(f"Added 'objects' field with unique labels per sample")


## 4. Visual Question Answering (VQA)

Let's start with VQA to generate natural language descriptions of our images.


In [None]:
# Visual Question Answering - Generate descriptions
model.operation = "vqa"
model.prompt = "Describe this image in detail, including the main subjects, actions, and setting."

print("🔄 Generating image descriptions...")
dataset.apply_model(model, label_field="descriptions")

print("✅ Descriptions generated!")
print("\nSample descriptions:")
for i, sample in enumerate(dataset.head(3)):
    print(f"\nSample {i+1}: {sample.descriptions[:100]}...")  # Show first 100 chars


In [None]:
# You can also ask specific questions
model.prompt = "What is the main color scheme in this image?"

print("🔄 Analyzing color schemes...")
dataset.apply_model(model, label_field="color_analysis")

print("✅ Color analysis complete!")
print("\nSample color analyses:")
for i, sample in enumerate(dataset.head(3)):
    print(f"\nSample {i+1}: {sample.color_analysis}")


## 5. Object Detection

Now let's detect and localize objects in the images using bounding boxes.


In [None]:
# Object Detection with a predefined list of objects
model.operation = "detect"
model.prompt = ['person', 'car', 'dog', 'cat', 'bicycle', 'traffic light']

print("🔄 Detecting objects...")
dataset.apply_model(model, label_field="pred_detections")

print("✅ Object detection complete!")


In [None]:
# Object Detection using a prompt field from the dataset
# This uses the 'objects' field we created earlier
model.operation = "detect"

print("🔄 Detecting objects using prompt field...")
dataset.apply_model(model, label_field="pf_detections", prompt_field="objects")

print("✅ Prompt field detection complete!")
print("This detection used the unique objects from ground truth for each image")


## 6. Phrase Grounding

Phrase grounding locates specific regions described by natural language phrases.


In [None]:
# Phrase Grounding - Find regions based on descriptions
model.operation = "phrase_grounding"

print("🔄 Performing phrase grounding using descriptions...")
dataset.apply_model(model, label_field="pg_detections", prompt_field="descriptions")

print("✅ Phrase grounding complete!")
print("The model located regions based on the generated descriptions")


## 7. Image Classification

Classify images into predefined or open-ended categories.


In [None]:
# Classification with specific categories
model.operation = "classify"
model.prompt = "Classify this image into exactly one of the following: indoor, outdoor, people, animals, vehicles, food"

print("🔄 Classifying images...")
dataset.apply_model(model, label_field="scene_class")

print("✅ Classification complete!")


In [None]:
# Multi-label classification
model.prompt = "Identify all relevant attributes: daytime/nighttime, urban/rural, crowded/empty"

print("🔄 Performing multi-label classification...")
dataset.apply_model(model, label_field="attributes")

print("✅ Multi-label classification complete!")


## 8. Keypoint Detection

Identify key points of interest in images.


In [None]:
# Keypoint Detection
model.operation = "point"

print("🔄 Detecting keypoints...")
dataset.apply_model(model, label_field="keypoints",prompt_field="objects")

print("✅ Keypoint detection complete!")


## 9. OCR (Optical Character Recognition)

Extract text from images while preserving formatting.


In [None]:
# OCR - Extract text from images
model.operation = "ocr"
model.prompt = "Extract all visible text from this image"

print("🔄 Extracting text from images...")
dataset.apply_model(model, label_field="extracted_text")

print("✅ OCR complete!")

## 10. Custom System Prompts

You can customize the system prompt for any operation to specialize the model's behavior.


In [None]:
# Example: Custom system prompt for specialized analysis
model.operation = "vqa"
model.system_prompt = """You are a photography expert. Analyze images from a technical perspective, 
commenting on composition, lighting, color balance, and artistic elements. 
Keep responses concise and professional."""

model.prompt = "Analyze the photographic qualities of this image"

print("🔄 Performing technical photography analysis...")
dataset.apply_model(model, label_field="photo_analysis")

print("✅ Photography analysis complete!")

# Show a sample analysis
sample = dataset.first()
if sample.photo_analysis:
    print(f"\nSample photography analysis:\n{sample.photo_analysis}")

# Reset to default system prompt
model.system_prompt = None  # This will use the default for the operation

## 11. Visualizing Results in FiftyOne App

The FiftyOne App provides powerful visualization capabilities for all the predictions we've generated.


In [None]:
# Refresh the session to see all new fields
fo.launch_app(dataset)


print("\n🎯 Tips for using the FiftyOne App:")
print("  1. Click on samples to see detailed predictions")
print("  2. Use the sidebar to toggle different label fields on/off")
print("  3. Filter samples based on predictions using the filter bar")
print("  4. Compare ground truth with predictions side by side")
print("  5. Use the color scheme options to differentiate label types")


## Important License Note

**MiniCPM-V Model License**: The model weights are subject to the [MiniCPM Model License](https://github.com/OpenBMB/MiniCPM-V/blob/main/MiniCPM%20Model%20License.md).

**Commercial Use Restrictions:**
- Free use allowed for edge devices ≤5,000 units or apps with <1M daily active users (registration required)
- Other commercial use requires explicit authorization from OpenBMB
- Cannot use outputs to enhance other models
- See the full license for complete terms and restrictions
