# Using Moondream2 as Remotely Sourced Zoo Model

In [1]:
import fiftyone as fo
import fiftyone.zoo as foz

# Load a dataset
dataset = foz.load_zoo_dataset("quickstart", overwrite=True)
dataset=dataset.take(3)

Overwriting existing directory '/home/harpreet/fiftyone/quickstart'
Downloading dataset to '/home/harpreet/fiftyone/quickstart'
Downloading dataset...
 100% |████|  187.5Mb/187.5Mb [451.4ms elapsed, 0s remaining, 415.5Mb/s]      
Extracting dataset...
Parsing dataset metadata
Found 200 samples
Dataset info written to '/home/harpreet/fiftyone/quickstart/info.json'
Loading existing dataset 'quickstart'. To reload from disk, either delete the existing dataset or provide a custom `dataset_name` to use


# Setup Zoo Model

In [None]:
# import fiftyone as fo
# import fiftyone.zoo as foz
# foz.register_zoo_model_source("https://github.com/harpreetsahota204/moondream2", overwrite=True)

In [None]:
import fiftyone as fo
import fiftyone.zoo as foz
foz.download_zoo_model(
    "https://github.com/harpreetsahota204/moondream2",
    model_name="vikhyatk/moondream2"
)

In [3]:
import fiftyone as fo
import fiftyone.zoo as foz
model = foz.load_zoo_model(
    "vikhyatk/moondream2",
    revision="2025-03-27"
    )

Fetching 30 files:   0%|          | 0/30 [00:00<?, ?it/s]

Loading moondream2 model from /home/harpreet/fiftyone/__models__/moondream2/moondream2


# Use Moondream2 for Captions

The three captioning operations require no additional arguments beyond selecting the operation type. 

Supported `length` values:

* `short`

* `normal`

* `long`

In [4]:
model.set_operation(operation="caption", length= "long")

dataset.apply_model(
    model, 
    label_field="captions", 
)

 100% |█████████████████████| 3/3 [7.4s elapsed, 0s remaining, 0.4 samples/s]   


In [5]:
dataset.first()['captions']

' The image shows a woman standing outdoors, holding a black umbrella in her right hand and smiling at the camera. She is wearing a beige trench coat, blue jeans, and flat shoes, and she has a black handbag slung over her left shoulder. The woman appears to be posing for a photograph, standing confidently on a wooden deck or platform made of planks. Behind her, there is a unique architectural structure composed of wooden beams and panels, creating an artistic and visually appealing setting. The structure has a translucent, semi-transparent glass covering, allowing natural light to filter through and creating an ethereal atmosphere.\n\nIn the background, there is a serene park-like setting with lush green grass, trees, and a calm body of water visible behind the structure. The water appears still and reflective, adding to the peaceful ambiance. Further in the distance, there are buildings and structures, indicating an urban or suburban environment. The overall color palette of the image

# Use Moondream2 for Detection


The results are stored as Detections objects containing bounding boxes and labels:

In [8]:
model.set_operation(
    operation="detect",
    object_type="people",
)

dataset.apply_model(
    model,
    label_field="detections",
)

 100% |█████████████████████| 3/3 [779.1ms elapsed, 0s remaining, 3.9 samples/s]      


In [9]:
dataset.first()['detections']

<Detections: {
    'detections': [
        <Detection: {
            'id': '67e84283c0b17942fc9c06c4',
            'attributes': {},
            'tags': [],
            'label': 'people',
            'bounding_box': [
                0.36569203436374664,
                0.3885642886161804,
                0.2686159312725067,
                0.6056839227676392,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': None,
            'index': None,
        }>,
    ],
}>

# Use Moondream2 for Keypoints


In [10]:
model.set_operation(    
    operation="point",
    object_type="people",)

# Apply with a different operation
dataset.apply_model(
    model,
    label_field="pointings",
)

 100% |█████████████████████| 3/3 [684.1ms elapsed, 0s remaining, 4.4 samples/s]      


In [11]:
dataset.first()['pointings']

<Keypoints: {
    'keypoints': [
        <Keypoint: {
            'id': '67e842c6c0b17942fc9c06d1',
            'attributes': {},
            'tags': [],
            'label': 'people',
            'points': [[0.4873046875, 0.5712890625]],
            'confidence': None,
            'index': None,
        }>,
    ],
}>

# Use Moondream2 for VQA


In [12]:
model.set_operation(    
    operation="query",
    query_text="What is the in the background of the image",)

dataset.apply_model(
    model,
    label_field="query_text_response",
)

 100% |█████████████████████| 3/3 [1.0s elapsed, 0s remaining, 2.9 samples/s]         


In [13]:
dataset.first()['query_text_response']

' In the background of the image, there is a body of water, possibly a lake or a pond, and a building.'

When you want to use a Field of a Sample for grounding, you use the following pattern:

In [15]:
# dataset.add_sample_field("questions")

dataset.set_values("questions", ["Where is the general location of this scene?"]*len(dataset))

In [16]:
dataset.first()['questions']

'Where is the general location of this scene?'

In [19]:
model.set_operation(    
    operation="query",
    query_field="questions"
    )


dataset.apply_model(
    model,
    label_field="query_field_response",
    query_field="questions"
)

 100% |█████████████████████| 3/3 [1.0s elapsed, 0s remaining, 2.9 samples/s]         


In [20]:
dataset.first()['query_field_response']

' The general location of this scene is a park, where the woman is standing under a large umbrella.'