# [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/harpreetsahota204/fiftyone_video_workshop/blob/main/workshop.ipynb)

# Understanding Video Data at Scale
### A Hands-On Workshop with Action100M and FiftyOne

---

Video is a hard modality to work with. 

You're dealing with more data, temporal complexity, and annotation workflows that don't scale. This notebook tackles a practical question: **given a large video dataset, how do you understand what's in it without manually watching thousands of clips?**

We're working with a subset of **[Action100M preview](https://www.arxiv.org/abs/2601.10592)**. In this subset there are 1,144 YouTube videos, each clipped to 90 seconds, annotated with a hierarchical *Tree-of-Captions* structure produced by a fully automated AI pipeline: V-JEPA 2 for segmentation, PerceptionLM-3B and Llama-3.2-Vision-11B for captioning, and GPT-OSS-120B with multi-round Self-Refine for structured annotation extraction.

Every label in this dataset was written by a model. None of it was seen by a human annotator.

As AI-generated datasets become the norm, **the skill of interrogating machine-generated annotations is increasingly important**. This notebook shows you how to do that systematically.

---

### What We'll Build

|  | Question | Tools |
|---|---|---|
| 1. What We Were Given | What does this dataset claim to contain? | FiftyOne App |
| 2. Three Lenses | What does the raw data actually look like? | Qwen3-VL-Embedding, Molmo2, Sentence Transformers |
| 3. The Second Opinion | Does a second model agree with the first? | Qwen3-VL |
| 4. Measuring Agreement | How much do they agree, per sample? | Text Evaluation Plugin |
| 5. Grounding the Hard Cases | Where they disagree, who's right? | Molmo2 |
| 6. The Payoff | What can I now do with this? | FiftyOne App |

By the end, you'll have a **confidence map** of the dataset's annotations and a reusable workflow for understanding any video dataset with AI-generated labels.

In [None]:
import fiftyone as fo

fo.config.requirement_error_level=2

---
## Section 1: What We Were Given

Before running any models, let's understand the shape of what we have. 

The Action100M preview comes with rich pre-existing annotations from the Tree-of-Captions pipeline. Each video has temporal segments annotated at multiple levels of granularity:

- **Level 0 (root/tier):** The full video — one annotation covering the entire 90-second clip
- **Mid levels:** Sub-segments — multi-second chunks describing coherent activities
- **Leaf level:** The finest grain — individual moments

At each level, there are five annotation fields:
- `gpt_summary_brief` — one-sentence clip caption
- `gpt_summary_detailed` — full play-by-play description
- `gpt_action_brief` — short verb phrase ("spread almonds on tray")
- `gpt_action_detailed` — instruction-manual version of the action
- `gpt_action_actor` — who's performing it

All of these were generated by GPT-OSS-120B with three rounds of Self-Refine. We're going to take these at face value for now — and then build the tools to interrogate them.

You can either download the dataset from the [Voxel51 Hugging Face org](https://huggingface.co/datasets/Voxel51/action100m_tiny_subset), or if you face rate limit issues, you can follow the instructions in the `download_scripts` directory of this repository to download the dataset and parse it into FiftyOne format.

If you are downloading from the Hugging Face Hub, run:

```python
from fiftyone.utils.huggingface import load_from_hub

dataset = load_from_hub(
    "Voxel51/action100m_tiny_subset",
    dataset_name="action100m",
    overwrite=True,
    persistent=True,
)
```

If you've downloaded the dataset and parsed manually, then run:

In [1]:
import fiftyone as fo

dataset = fo.load_dataset("action100m")

  from .autonotebook import tqdm as notebook_tqdm


A **[FiftyOne dataset](https://docs.voxel51.com/user_guide/using_datasets.html)** is the core data structure you work with in [FiftyOne](https://docs.voxel51.com/user_guide/basics.html). It:

- Logically represents your visual data (images, videos, point clouds, etc.) along with all associated information: labels, metadata, predictions, and other fields.

- Is stored in a lightweight, non-relational database (MongoDB) so it can scale to large datasets without loading all media into RAM. 

- Can be created from many sources: directories of files, common dataset formats (like COCO), or the built‑in Dataset Zoo. 

Conceptually, you can [think of a dataset like a table in pandas](https://github.com/voxel51/fiftyone/blob/develop/docs/source/tutorials/pandas_comparison.ipynb): it has **rows and columns**, but specialized for computer vision:

- In pandas: **row = record**, **column = feature**
- In FiftyOne: **sample = record**, **field = feature** 

You can inspect the schema of a dataset by calling it:

In [None]:
dataset

### What is a sample?

A [**sample**](https://docs.voxel51.com/user_guide/basics.html#samples) is the atomic element of a FiftyOne dataset. It is:

- One “item” of your data (for example, a single image or a single video) plus everything you know about it. 
- A flexible container of [**fields**](https://docs.voxel51.com/user_guide/basics.html#fields), which can include:
  - `filepath` to the media
  - `metadata` (dimensions, duration, etc.)
  - Ground truth labels (detections, classifications, segmentations, etc.)
  - Model predictions
  - Tags, scalar values, strings, arrays, and more 

So, **a dataset is a collection of samples**, and **each sample is everything you care about for one media file**, stored in a structured way that’s easy to query, visualize, and modify.

You can see what a sample looks like by inspecting the [first](https://docs.voxel51.com/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.first) sample:

In [None]:
dataset.first()

Once you have your dataset, you can [launch the app](https://docs.voxel51.com/user_guide/app.html#using-the-fiftyone-app) and see what's in it.


In [None]:
# Open the FiftyOne App and explore the dataset
session = fo.launch_app(dataset, auto=False)
session.url

**What to look at in the App:**

1. Play a few videos and toggle on `gpt_action_brief` — watch how the temporal segments track the actions you see on screen
2. Click on a sample with high `tree_depth` (try sorting by it) — notice how the annotations at level 0 are broad overviews, while leaf-level annotations are very fine-grained moments
3. Look at the `transcript` field — this is the raw ASR text, much noisier than the GPT-refined annotations
4. Check a few `gpt_summary_detailed` labels — these are the longest annotations (~540 words average)

> **Key question to hold in mind:** Every one of these labels came from an automated pipeline. They look authoritative. But should we trust them?

---
## Section 2: Three Lenses on the Same Data

Before interrogating the existing annotations, let's build our own understanding of the dataset — using nothing but the raw videos and transcripts.

We'll create three separate embedding spaces:

1. **Visual ([Qwen3-VL-Embedding](https://docs.voxel51.com/plugins/plugins_ecosystem/qwen3vl_embeddings.html)):** What the videos look like — and crucially, this lives in a shared text-video space, so we can search with natural language

2. **Visual-Grounding ([Molmo2](https://docs.voxel51.com/plugins/plugins_ecosystem/molmo2.html)):** A different visual understanding — video-to-video similarity only, but from a model trained on grounding and spatial reasoning

3. **Language (Transcript):** What people are *saying* in these videos, embedded with a text model

The insight from comparing these three spaces is the foundation of everything that follows: **what you embed determines what you find**.

### Lens 1: Visual Content (Qwen3-VL-Embedding)

[Qwen3-VL-Embedding](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B) maps video and text into a shared vector space. This means we can embed a natural language query and find videos that match without touching any of the existing labels. It's semantic search, not keyword search.

In [2]:
# Load the 2B embedding model — good balance of quality and speed
import fiftyone.zoo as foz

foz.register_zoo_model_source(
    "https://github.com/harpreetsahota204/qwen3vl_embeddings",
    overwrite=True
)

qwen_emb_model = foz.load_zoo_model(
    "Qwen/Qwen3-VL-Embedding-2B",
)

qwen_emb_model.max_length=32768

# Compute embeddings — stores a vector on each sample
dataset.compute_embeddings(
    qwen_emb_model,
    embeddings_field="qwen_embeddings",
    batch_size=64,
    num_workers=8,
    skip_failures=False,
)

print(f"Embeddings computed for {dataset.exists('qwen_embeddings').count()} samples")

Downloading https://github.com/harpreetsahota204/qwen3vl_embeddings...
  121.6Mb [885.4ms elapsed, ? remaining, 137.3Mb/s] 
Overwriting existing model source '/home/harpreet/fiftyone/__models__/qwen3vl-embeddings'


Fetching 14 files: 100%|██████████| 14/14 [00:33<00:00,  2.41s/it]

   0% ||--------------|    0/1144 [91.8ms elapsed, ? remaining, ? samples/s] 


`torch_dtype` is deprecated! Use `dtype` instead!
qwen-vl-utils using decord to read video.


 100% |███████████████| 1144/1144 [26.4m elapsed, 0s remaining, 0.8 samples/s]    
Embeddings computed for 1144 samples


### What you can do after computing embeddings in FiftyOne

Once you've computed embeddings, you unlock powerful workflows:

#### Visualize your embeddings in the App

[Use `compute_visualization()`](https://docs.voxel51.com/brain.html#visualizing-embeddings) to reduce embeddings with UMAP, t‑SNE, or PCA and explore them in the Embeddings panel to:
- See clusters and data structure  
- Understand how classes group together  
- Lasso regions to create views and filter subsets


In [3]:
import fiftyone.brain as fob

# Project into 2D with UMAP for visualization
fob.compute_visualization(
    dataset,
    method="umap",
    brain_key="qwen_viz",
    embeddings="qwen_embeddings",
    num_dims=2,
)

Generating visualization...
UMAP( verbose=True)
Mon Feb 23 13:16:41 2026 Construct fuzzy simplicial set
Mon Feb 23 13:16:42 2026 Finding Nearest Neighbors
Mon Feb 23 13:16:45 2026 Finished Nearest Neighbor Search
Mon Feb 23 13:16:47 2026 Construct embedding


Epochs completed: 100%| ██████████ 500/500 [00:01]

	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Mon Feb 23 13:16:48 2026 Finished embedding





<fiftyone.brain.visualization.VisualizationResults at 0x7a6ede3b8d50>


####  Search by visual similarity

[Build a similarity index](https://docs.voxel51.com/brain.html#similarity) with `compute_similarity()` to:
- Find similar images (image‑to‑image search)  
- Detect duplicates or unique samples  
- Enable visual search in the App or via Python


In [4]:
# Build a similarity index — this is what powers text-to-video search
fob.compute_similarity(
    dataset,
    brain_key="qwen_sim",
    embeddings="qwen_embeddings",
)

<fiftyone.brain.internal.core.sklearn.SklearnSimilarityIndex at 0x7a6fa89a1190>

You can also use this model for zero-shot classification. Let's add a Field to the Dataset:

In [5]:
classes = [
    "Cooking and Food",
    "Home Improvement and DIY",
    "Health and Beauty",
    "Hobbies and Crafts",
    "Sports and Fitness",
    "Gardening",
    "Technology and Electronics",
    "Fashion and Style",
    "Arts and Music",
    "Automotive",
    "Pets and Animals",
    "Education and Learning"
]

# Configure model for classification
qwen_emb_model.classes = classes
qwen_emb_model.text_prompt = "A video about "

# Apply zero-shot classification
dataset.apply_model(
    qwen_emb_model, 
    label_field="predicted_class"
    )

 100% |███████████████| 1144/1144 [26.3m elapsed, 0s remaining, 0.8 samples/s]    


### Lens 2: Visual-Grounding (Molmo2)

Molmo2 is a different kind of vision-language model — it's trained heavily on grounding tasks: pointing to objects, tracking them through time, and counting. Its internal representations will weight spatial and motion features differently than Qwen3-VL-Embedding.

We can't do text-to-video search with Molmo2 embeddings, but we can still visualize the embedding space and do video-to-video similarity. 

The question is: **do two different visual models agree on which videos are similar?**

In [None]:
import fiftyone.zoo as foz 

foz.register_zoo_model_source(
    "https://github.com/harpreetsahota204/molmo2",
    overwrite=True
)

molmo_model = foz.load_zoo_model("allenai/Molmo2-4B")

molmo_model.pooling_strategy = "mean"

dataset.compute_embeddings(
    molmo_model,
    embeddings_field="molmo_embeddings",
    batch_size=64,
    num_workers=4,
    skip_failures=False,
)

In [None]:
import fiftyone.brain as fob

fob.compute_visualization(
    dataset,
    method="umap",
    brain_key="molmo_viz",
    embeddings="molmo_embeddings",
    num_dims=2,
)

fob.compute_similarity(
    dataset,
    brain_key="molmo_sim",
    embeddings="molmo_embeddings",
)

### Lens 3: Language (Transcript Embeddings)

Both visual embeddings above represent what you *see* in the videos. Now let's embed what you *hear* — the ASR transcript text.

Instructional videos vary enormously in how much they narrate. Some presenters talk through every step; others work in silence. The transcript embedding space captures this variation in a way neither visual model can.

Fo this we can make use of [`jina-embeddings-v5-text-small-clustering`](https://huggingface.co/jinaai/jina-embeddings-v5-text-small-clustering). 

This model is not integrated as a remote source zoo model, but we can make use of it by extracting the transcripts from the videos using the `values` method of the [Dataset](https://docs.voxel51.com/user_guide/basics.html#datasets) to get all the values of the [Field](https://docs.voxel51.com/user_guide/basics.html#fields) into a Python list.

With the values in a list we can use the model natively and then add the results back as a Field to the Dataset using [the `set_values`](https://docs.voxel51.com/api/fiftyone.core.collections.html#fiftyone.core.collections.SampleCollection.set_values) method.

In [None]:
import os
import torch
from sentence_transformers import SentenceTransformer

#set an environment variable so tokenizers doesn't yell at us,
# note this related to the `transformers` and `tokenizers` libraries and not a FiftyOne specific environment variable
os.environ["TOKENIZERS_PARALLELISM"] = "false"

text_emb_model = SentenceTransformer(
    "jinaai/jina-embeddings-v5-text-small-clustering",
    model_kwargs={"dtype": torch.bfloat16}, #recommended for GPUs
    config_kwargs={"attn_implementation": "flash_attention_2"}, #optional, but recommended
    tokenizer_kwargs={"extra_special_tokens": {}}, # This line fixes the AttributeError: 'list' object has no attribute 'keys'
)


transcripts = dataset.values("transcript")

# Encode texts
text_embeddings = text_emb_model.encode(transcripts)

dataset.set_values("text_embeddings", text_embeddings)

Default prompt name is set to 'document'. This prompt will be applied to all `encode()` calls, except if `encode()` is called with `prompt` or `prompt_name` parameters.


In [17]:
fob.compute_visualization(
    dataset,
    method="umap",
    brain_key="transcript_viz",
    embeddings="text_embeddings",
    num_dims=2,
)

Generating visualization...
UMAP( verbose=True)
Mon Feb 23 14:26:10 2026 Construct fuzzy simplicial set
Mon Feb 23 14:26:10 2026 Finding Nearest Neighbors
Mon Feb 23 14:26:10 2026 Finished Nearest Neighbor Search
Mon Feb 23 14:26:10 2026 Construct embedding


Epochs completed:  49%| ████▉      246/500 [00:00]

	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs


Epochs completed: 100%| ██████████ 500/500 [00:00]


Mon Feb 23 14:26:10 2026 Finished embedding


<fiftyone.brain.visualization.VisualizationResults at 0x7a70f44f6750>

We can compute visualizations of the embeddings just like before:

### Using Embeddings to Compute Uniqueness and Representativeness Values

In FiftyOne, both **uniqueness** and **representativeness** are scalar scores computed from embeddings that describe how a sample relates to the rest of the dataset.

#### Uniqueness

- [Measures **how different (dissimilar)**](https://docs.voxel51.com/brain.html#image-uniqueness) a sample is from its neighbors in embedding space.  

- Implemented by looking at each sample’s nearest neighbors, weighting their distances (e.g. 60%-30%-10%), and normalizing to a score in **\[0, 1\]**. Higher scores mean the sample is more “isolated” or distinct; lower scores mean it has many close neighbors.
 
- The most unique sample in a collection has a uniqueness value of **1**.

- Useful for:
  - Finding **outliers / edge cases / bad actors**  
  - Detecting **near-duplicates** (by looking at *low* uniqueness)  
  - Selecting **diverse samples** for annotation when budget is limited 


In [54]:
import fiftyone.brain as fob

fob.compute_uniqueness(
    dataset,
    uniqueness_field = "qwen_uniqueness",
    embeddings = "qwen_embeddings",
    batch_size=64,
    num_workers=4
)

Computing uniqueness...
Uniqueness computation complete


## Representativeness

- [Measures **how typical** or **central** ](https://docs.voxel51.com/brain.html#image-representativeness) a sample is relative to the rest of the dataset in embedding space.  

- Also normalized to **\[0, 1\]**, with the most representative samples having value **1**.  

- High representativeness ≈ sample lies in a dense, central region of the data; it’s similar to many other samples.  

- Useful for:
  - **Active learning**: picking representative samples to cover common patterns  
  - **Dataset balancing**: finding under/overrepresented regions  
  - **Efficient annotation**: prioritizing samples that “stand in” for many others 

In short:

- **Uniqueness** → “How unusual is this sample?”  
- **Representativeness** → “How well does this sample represent many others?”

In [55]:
import fiftyone.brain as fob

fob.compute_representativeness(
    dataset,
    representativeness_field = "qwen_rep",
    embeddings = "qwen_embeddings",
    batch_size=64,
    num_workers=4
)

Computing representativeness...
Computing clusters for 1144 embeddings; this may take awhile...
Representativeness computation complete


### The Cross-Lens Moment

Try to **find a tight visual cluster in the Qwen UMAP, then look at where those same videos land in the transcript UMAP.**

You may find that videos which cluster tightly by visual content are often spread out in transcript space. Cooking videos that all look similar — same kitchen setup, same hands-on-food visual — may use completely different spoken language: detailed step-by-step narration, casual conversational chat, or total silence.

This is not a failure of any embedding model, but a property of the data: **visual similarity and linguistic similarity are measuring different things.** Your choice of embedding determines what questions you can ask.

- Qwen visual embeddings to find *visually similar content* or do text-to-video semantic search

- Molmo2 embeddings to find videos with similar *physical actions and spatial structure*

- Transcript embeddings to find videos that *talk about similar topics*

In [None]:
# Pick a video and find its nearest neighbors in each embedding space — 
# a concrete way to see how the three spaces differ
sample = dataset.first()
print(f"Reference video: {sample.title}")
print(f"Root annotation: {sample.gpt_root_summary}\n")

for brain_key, label in [("qwen_sim", "Qwen visual"), ("molmo_sim", "Molmo2 visual")]:
    neighbors = dataset.sort_by_similarity(sample.id, brain_key=brain_key, k=4)
    print(f"Nearest neighbors ({label}):")
    for n in neighbors.skip(1).take(3):
        print(f"  → {n.gpt_root_summary}")
    print()

**What to look at:** 

- In the Embeddings panel, hover over clusters. Click points to see which videos land near each other. Try coloring by `tree_depth` or `title` to see if the visual clusters map onto any metadata patterns.


- Compare the Molmo2 UMAP with the Qwen UMAP. Are the clusters in the same positions? Similar shape, different boundaries — this is two models with different visual priors producing different embedding spaces. Neither is "correct" — they're measuring different aspects of the same content.

---
## Section 3: The Second Opinion

We've built three independent maps of the dataset. Now let's do what the original pipeline did — but with a completely different model.

The Action100M annotations were produced by GPT-OSS-120B processing outputs from PerceptionLM-3B and Llama-3.2-Vision-11B. 

We're going to run **Qwen3-VL-8B** on the same videos, completely independently, and ask:

- How would *you* describe this video?

- What events do *you* see, and when?

We're not trying to beat the pipeline. We're trying to get a second opinion. Where the two agree, we gain confidence in the original annotation. Where they diverge, we've found something worth looking at.

In [None]:
import fiftyone.zoo as foz 

foz.register_zoo_model_source(
    "https://github.com/harpreetsahota204/qwen3vl_video",
    overwrite=True
)

qwen_video_model = foz.load_zoo_model(
    "Qwen/Qwen3-VL-8B-Instruct",
)

In [None]:
# Generate a full-video description for every sample.
# This produces a string field: qwen_desc_summary
qwen_video_model.operation = "description"

dataset.apply_model(
    qwen_video_model,
    label_field="qwen_desc",
    skip_failures=True,
)

print(f"Descriptions generated: {dataset.exists('qwen_desc_summary').count()}")

In [None]:
# Temporally localize events — produces TemporalDetections with start/end frames.
# Compare these against gpt_action_brief in the App.
qwen_video_model.operation = "temporal_localization"

dataset.apply_model(
    qwen_video_model,
    label_field="qwen_events",
    skip_failures=True,
)

print(f"Temporal events generated: {dataset.exists('qwen_events').count()}")

You'll notice the descriptions often describe the same video in quite different language — different vocabulary, different sentence structure, different level of detail. Sometimes one catches a detail the other misses. 

The question is: **how do we move from "these look different" to a number we can sort and filter by?**

---
## Section 4: Measuring Agreement

Eyeballing agreement doesn't scale. We need a metric.

The obvious choice is a lexical similarity metric: edit distance, word overlap, something that compares the actual characters and words in the two strings. 

For example:
- GPT might say: *"spread almonds on tray"*  
- Qwen might say: *"person distributes nuts onto a baking sheet"

These are describing exactly the same action. But their character-level edit distance is enormous — they share almost no words. Any lexical metric will call this a bad match.

This is why the right metric for comparing paraphrased descriptions is **not** edit distance. But edit distance is still *useful* — it gives us a lower bound, and the gap between lexical and semantic similarity is itself informative.

In [None]:
import fiftyone.operators as foo 

# ANLS: Average Normalized Levenshtein Similarity
# Standard metric for OCR/VLM evaluation — robust to minor errors,
# but still character-level. Watch how it penalizes semantic equivalents.
anls_op = foo.get_operator("@harpreetsahota/text-evaluation-metrics/compute_anls")

anls_result = anls_op(
    dataset,
    pred_field="qwen_desc_summary",
    gt_field="gpt_root_summary",
    output_field="anls_score",
    threshold=0.5,
    case_sensitive=False,
    delegate=False,
)

print(f"Mean ANLS: {anls_result.get('mean_anls', 'N/A'):.3f}")
print("(Low score expected — lexical metrics penalize paraphrase)")

In [None]:
# Normalized Levenshtein Similarity — continuous 0–1 score without threshold
# More granular than ANLS, still lexical.
sim_op = foo.get_operator("@harpreetsahota/text-evaluation-metrics/compute_normalized_similarity")

sim_result = sim_op(
    dataset,
    pred_field="qwen_desc_summary",
    gt_field="gpt_root_summary",
    output_field="annotation_confidence",
    case_sensitive=False,
    delegate=False,
)

print(f"Mean normalized similarity: {sim_result.get('mean_similarity', 'N/A'):.3f}")

**What to look at:** Browse the bottom of the sorted view (lowest confidence). Are these actually bad annotations, or just paraphrases that look different? This is the limitation of lexical metrics — they can't tell the difference. A future extension here is a semantic similarity operator using sentence embeddings, which would score paraphrases much more accurately.

For now, the low-confidence samples are exactly what we want for the next section: the cases where the two systems most clearly diverge.

---
## Section 5: Grounding the Hard Cases

We now have a ranked list of the samples where GPT-OSS-120B and Qwen3-VL disagree most strongly. Instead of reading more text descriptions, let's ask a different question: **can a third model show us what's actually happening?**

Molmo2's core strength is grounding — it can point to specific objects and actors in frames, and it can localize when events happen over time. When two models disagree in words, Molmo2 provides a qualitatively different kind of evidence: spatial coordinates in the actual pixels.

This isn't just a tiebreaker. It demonstrates a principle: **text descriptions are abstractions. When abstractions conflict, go back to the source.**

In [None]:
# # Work on the bottom quartile — the most uncertain samples
# threshold = sorted(scores)[len(scores) // 4]  # 25th percentile
# uncertain_view = dataset.match(F("annotation_confidence") < threshold)
# print(f"Working with {len(uncertain_view)} low-confidence samples (score < {threshold:.2f})")

In [None]:
# # Temporal localization with Molmo2 — independent from both previous models
# molmo_model.operation = "temporal_localization"

# uncertain_view.apply_model(
#     molmo_model,
#     label_field="molmo_events",
#     skip_failures=True,
# )

# print(f"Molmo2 temporal localization done for {uncertain_view.exists('molmo_events').count()} samples")

In [None]:
# Pointing — ask Molmo2 to locate the main actor on each frame
# This produces frame-level keypoints in the App
molmo_model.operation = "pointing"
molmo_model.prompt = "the person performing the main action"

uncertain_view.apply_model(
    molmo_model,
    label_field="molmo_actor",
    skip_failures=True,
)

print(f"Molmo2 pointing done.")

In [None]:
# View the uncertain samples with all three annotation sources visible:
# gpt_action_brief (original), qwen_events (second opinion), molmo_events (third opinion)
# molmo_actor keypoints are frame-level — visible in the video frame panel
session.view = uncertain_view.exists("molmo_events")

print("In the App, for each uncertain sample, look at:")
print("  Timeline: gpt_action_brief | qwen_events | molmo_events")
print("  Frames:   molmo_actor (keypoints showing where the actor is)")
print("  Text:     gpt_root_summary | qwen_desc_summary")
print()
print("Cases to find:")
print("  ✓ Molmo2 agrees with one model — tiebreaker worked")
print("  ? All three diverge — the video is genuinely ambiguous")
print("  ✗ Low score was just a paraphrase — lexical metric was wrong")

---
## Section 6: The Payoff

Let's look at what we built.

In [None]:
from fiftyone import ViewField as F 

# High-confidence subset: both models agree — ready to use
high_confidence = dataset.match(F("annotation_confidence") >= 0.6)
print(f"High-confidence subset (score ≥ 0.6): {len(high_confidence)} samples")

# Review queue: models disagree — worth a closer look
review_queue = dataset.match(F("annotation_confidence") < 0.3)
print(f"Review queue (score < 0.3):           {len(review_queue)} samples")

# Open the full dataset with everything visible
session.view = dataset
print("\nFull dataset loaded in App. Use Filter to slice by annotation_confidence.")

---

## What We Built

We started with 1,144 videos and a set of annotations we had to take on faith.

We ended with a **confidence map**:

- **High confidence (`annotation_confidence ≥ 0.6`)** — both GPT-OSS-120B and Qwen3-VL describe these clips similarly. The annotation is likely reliable. Use these for training or evaluation.

- **Low confidence (`annotation_confidence < 0.3`)** — the models diverge. Some of these are just paraphrase failures (a limitation of lexical metrics). Others are genuinely ambiguous clips where even multiple frontier models don't agree. These are your review queue — and Molmo2's spatial grounding is your best tool for investigating them.

---

### The Workflow, Generalized

Everything we did here transfers directly to any video dataset with AI-generated labels:

1. **Load and explore** — understand the annotation structure before trusting it
2. **Embed from multiple angles** — visual content, visual grounding, language; each lens reveals something different
3. **Generate an independent second opinion** — run a different model, compare outputs
4. **Quantify agreement** — a score per sample turns qualitative inspection into a sortable, filterable signal
5. **Ground the disagreements** — when text descriptions conflict, spatial evidence is the tiebreaker

The tools: **FiftyOne** for orchestration and visualization, **Qwen3-VL-Embedding** for the searchable visual index, **Qwen3-VL** for independent annotation, **Molmo2** for grounding, and a **text evaluation plugin** to quantify it all.

The dataset doesn't matter. The workflow does.