# Qwen3-VL-Embedding in FiftyOne

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/harpreetsahota204/qwen3vl_embeddings/blob/main/qwen3vl_embeddings_in_fiftyone.ipynb)

This notebook demonstrates how to use [Qwen3-VL-Embedding](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B) with [FiftyOne](https://docs.voxel51.com/) for multimodal embeddings, text-to-media similarity search, and zero-shot classification.

Qwen3-VL-Embedding maps text, images, and video into a unified representation space, enabling powerful cross-modal retrieval and understanding.

## Setup

Install the required dependencies:

In [None]:
!pip install -q fiftyone decord qwen-vl-utils transformers torch torchvision

Note, you should install flash attention 2 for faster inference speed.

## Load the Model

Register the remote model source and load the Qwen3-VL-Embedding model. The 2B parameter variant offers a good balance of quality and speed; an 8B variant is also available for higher quality embeddings.

In [None]:
import fiftyone as fo
import fiftyone.zoo as foz

# Register the model source
foz.register_zoo_model_source(
    "https://github.com/harpreetsahota204/qwen3vl_embeddings",
    overwrite=True
)

In [None]:
# Load Qwen3-VL model
model = foz.load_zoo_model("Qwen/Qwen3-VL-Embedding-2B")

## Video Dataset

Qwen3-VL-Embedding can generate embeddings for video content by sampling frames at a configurable FPS. Set `media_type="video"` to process video datasets.

We'll load a sample video dataset from the Hugging Face Hub and compute embeddings for each video.

In [None]:
model.media_type = "video"

In [None]:
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

dataset = load_from_hub(
    "harpreetsahota/random_short_videos",
    name="random_short_videos",
    overwrite=True,
    )

In [None]:
dataset.compute_embeddings(
    model,
    embeddings_field="qwen_embeddings",
    skip_failures=False,
    batch_size=32,
    num_workers=8,
)

If you don't want to run inference and just want to see the results you can download the following dataset. However, you will need to make sure you have registered and loaded the zoo model as shown above:

In [None]:
from fiftyone.utils.huggingface import load_from_hub

load_from_hub("harpreetsahota/testing_qwen3vl_embeddings")

### Text-to-Video Similarity Search

Build a similarity index to enable natural language search over your video dataset. Once indexed, you can find videos matching text queries like "a person cooking in a kitchen".

In [None]:
import fiftyone.brain as fob

# Build similarity index
sim = fob.compute_similarity(
    dataset,
    model="Qwen/Qwen3-VL-Embedding-2B",
    brain_key="qwen_video_sim",
    embeddings="qwen_embeddings"
)


### Embedding Visualization

Use UMAP to project the high-dimensional embeddings into 2D for visualization. This helps you explore the semantic structure of your dataset in the FiftyOne App.

In [None]:
# Compute UMAP visualization
results = fob.compute_visualization(
    dataset,
    method="umap",
    brain_key="qwen_video_viz",
    embeddings="qwen_embeddings",
    num_dims=2
)

print("UMAP visualization computed!")

### Zero-Shot Classification

Classify videos using text prompts without any training. Define a list of classes and an optional text prompt prefix, then apply the model to generate predictions based on embedding similarity.

In [None]:

# Configure model for classification
model.classes = [
    "children", 
    "babies", 
    "people exercising", 
    "bottle opening", 
    "pets or animals", 
    "cartoon",
    "a door opening",
    "person sleeping",
    "undetermined activity"]

model.text_prompt = "A video of "

# Apply zero-shot classification
dataset.apply_model(model, label_field="zero_shot_classification")

### Explore in FiftyOne App

Launch the FiftyOne App to explore your dataset, view the UMAP visualization, run similarity searches, and inspect the zero-shot classification results.

In [None]:
session = fo.launch_app(dataset, auto=False)
session.url

## Image Dataset

The same model can be used for image datasets by switching `media_type` to `"image"`. This allows you to reuse a single loaded model for both video and image workflows without reloading weights.

Load the FiftyOne quickstart dataset, which contains 200 images with various object detection annotations.

In [None]:
import fiftyone.zoo as foz

image_dataset = foz.load_zoo_dataset("quickstart")

In [None]:
model.media_type = "image"

In [None]:
image_dataset.compute_embeddings(
    model,
    embeddings_field="qwen_embeddings",
    skip_failures=False,
    batch_size=32,
    num_workers=8,
)

### Text-to-Image Similarity Search

Build a similarity index for the image dataset using the same model. You can now search for images using natural language queries.

In [None]:
import fiftyone.brain as fob

# Build similarity index
sim = fob.compute_similarity(
    image_dataset,
    model="Qwen/Qwen3-VL-Embedding-2B",
    brain_key="qwen_img_sim",
    embeddings="qwen_embeddings"
)


### Embedding Visualization

Visualize the image embeddings with UMAP to explore semantic clusters in your image dataset.

In [None]:
# Compute UMAP visualization
results = fob.compute_visualization(
    image_dataset,
    method="umap",
    brain_key="qwen_img_viz",
    embeddings="qwen_embeddings",
    num_dims=2
)