# MiMo-VL Tutorial: Multimodal Analysis with FiftyOne

This tutorial demonstrates how to use the MiMo-VL vision-language models with FiftyOne for various visual understanding tasks.

## 1. Load a Sample Dataset

First, let's load a small UI dataset from the FiftyOne Dataset Zoo.

In [None]:
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

# Load 5 random samples from the WaveUI dataset
dataset = load_from_hub(
    "Voxel51/WaveUI-25k",
    max_samples=10,
    shuffle=True
)

Launch the FiftyOne App to visualize the dataset (optional)

In [None]:
fo.launch_app(dataset)

## 2. Set Up MiMo-VL Integration

Register the MiMo-VL remote zoo model source and load the model.

In [None]:
import fiftyone.zoo as foz

# Register the model source
foz.register_zoo_model_source("https://github.com/harpreetsahota204/MiMo_VL", overwrite=True)

# Load the MiMo-VL-7B-SFT model

You can also use `XiaomiMiMo/MiMo-VL-7B-RL` or `XiaomiMiMo/MiMo-VL-7B-SFT-GGUF`

In [None]:
model = foz.load_zoo_model(
    "XiaomiMiMo/MiMo-VL-7B-SFT",
    # install_requirements=True, #you can pass this to make sure you have all reqs installed
    )

## 3. Visual Question Answering

Ask the model to describe UI screenshots.

In [None]:
model.operation = "vqa"
model.prompt = "Describe this screenshot and what the user might be doing in it."
dataset.apply_model(model, label_field="vqa_results")

Note that for any of the following operations you can use a Field which currently exists on your dataset, all you need to do is pass the name of that field in `prompt_field` when you call `apply_model`. For example:

```python
dataset.apply_model(model, prompt_field="<field-name>", label_field="<label-field>")
```

## 4. Object Detection

Detect interactive UI elements with bounding boxes.

In [None]:
model.operation = "detect"
model.prompt = "Locate the elements of this UI that a user can interact with."
dataset.apply_model(model, label_field="ui_detections")


## 5. Optical Character Recognition (OCR)

Extract and locate text in the UI.

In [None]:
model.operation = "ocr"
model.prompt = "OCR all the text in the user interface."
dataset.apply_model(model, label_field="ocr_results")


## 6. Keypoint Detection

Identify important points in the UI.

In [None]:
model.operation = "point"
model.prompt = "Point to all the clickable and interactable elements in user interface."
dataset.apply_model(model, label_field="ui_keypoints")

## 7. Classification

Classify the type of UI platform.

In [None]:
model.operation = "classify"
model.prompt = "Classify the type of platform. Choose from one of: desktop, mobile, web"
dataset.apply_model(model, label_field="ui_classifications")


## 8. Using Dataset Fields as Prompts

You can use existing fields in your dataset as prompts.
In this example, we assume there's a "purpose" field that contains instructions.

In [None]:
# If your dataset has a field called "purpose" with instructions
model.operation = "agentic"
dataset.apply_model(model, prompt_field="purpose", label_field="agentic_output")

## 9. View Results

Examine the results for the first sample.

In [None]:


sample = dataset.first()
print(f"VQA Result: {sample.vqa_results}")
print(f"Detections: {sample.ui_detections}")
# You can view all results in the FiftyOne App with: fo.launch_app(dataset)

In [None]:
# Visualize all results in the FiftyOne App
fo.launch_app(dataset)