# [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/harpreetsahota204/fiftyone_video_workshop/blob/main/workshop.ipynb)

# Understanding Video Data at Scale
### A Hands-On Workshop with Action100M and FiftyOne

---

Video is a hard modality to work with. 

You're dealing with more data, temporal complexity, and annotation workflows that don't scale. This notebook tackles a practical question: **given a large video dataset, how do you understand what's in it without manually watching thousands of clips?**

We're working with a subset of **[Action100M preview](https://www.arxiv.org/abs/2601.10592)**. In this subset there are 1,144 YouTube videos, each clipped to 90 seconds, annotated with a hierarchical *Tree-of-Captions* structure produced by a fully automated AI pipeline: V-JEPA 2 for segmentation, PerceptionLM-3B and Llama-3.2-Vision-11B for captioning, and GPT-OSS-120B with multi-round Self-Refine for structured annotation extraction.

Every label in this dataset was written by a model. None of it was seen by a human annotator.

As AI-generated datasets become the norm, **the skill of interrogating machine-generated annotations is increasingly important**. This notebook shows you how to do that systematically.

---

### What We'll Build

|  | Question | Tools |
|---|---|---|
| 1. What We Were Given | What does this dataset claim to contain? | FiftyOne App |
| 2. Three Lenses | What does the raw data actually look like? | Qwen3-VL-Embedding, Molmo2, Sentence Transformers |
| 3. The Second Opinion | Does a second model agree with the first? | Qwen3-VL |
| 4. Measuring Agreement | How much do they agree, per sample? | Text Evaluation Plugin |
| 5. Grounding the Hard Cases | Where they disagree, who's right? | Molmo2 |
| 6. The Payoff | What can I now do with this? | FiftyOne App |

By the end, you'll have a **confidence map** of the dataset's annotations and a reusable workflow for understanding any video dataset with AI-generated labels.

In [None]:
import fiftyone as fo

fo.config.requirement_error_level=2

---
## Section 1: What We Were Given

Before running any models, let's understand the shape of what we have. 

The Action100M preview comes with rich pre-existing annotations from the Tree-of-Captions pipeline. Each video has temporal segments annotated at multiple levels of granularity:

- **Level 0 (root/tier):** The full video — one annotation covering the entire 90-second clip
- **Mid levels:** Sub-segments — multi-second chunks describing coherent activities
- **Leaf level:** The finest grain — individual moments

At each level, there are five annotation fields:
- `gpt_summary_brief` — one-sentence clip caption
- `gpt_summary_detailed` — full play-by-play description
- `gpt_action_brief` — short verb phrase ("spread almonds on tray")
- `gpt_action_detailed` — instruction-manual version of the action
- `gpt_action_actor` — who's performing it

All of these were generated by GPT-OSS-120B with three rounds of Self-Refine. We're going to take these at face value for now — and then build the tools to interrogate them.

You can either download the dataset from the [Voxel51 Hugging Face org](https://huggingface.co/datasets/Voxel51/action100m_tiny_subset), or if you face rate limit issues, you can follow the instructions in the `download_scripts` directory of this repository to download the dataset and parse it into FiftyOne format.

If you are downloading from the Hugging Face Hub, run:

```python
from fiftyone.utils.huggingface import load_from_hub

dataset = load_from_hub(
    "Voxel51/action100m_tiny_subset",
    dataset_name="action100m",
    overwrite=True,
    persistent=True,
)
```

If you wat to use the dataset with all the enrichments, then run:

```python
from fiftyone.utils.huggingface import load_from_hub

dataset = load_from_hub(
    "harpreetsahota/fo_video_workshop_enriched",
    dataset_name="action100m_enriched",
    overwrite=True,
    persistent=True,
)
```

If you've downloaded the dataset and parsed manually, then run:

```python
dataset = fo.load_dataset("action100m")
```

I will assume that you'll want the datasets with all the enrichments:

In [None]:
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

dataset = load_from_hub(
    "harpreetsahota/fo_video_workshop_enriched",
    dataset_name="action100m_enriched",
    overwrite=True,
    persistent=True,
)

A **[FiftyOne dataset](https://docs.voxel51.com/user_guide/using_datasets.html)** is the core data structure you work with in [FiftyOne](https://docs.voxel51.com/user_guide/basics.html). It:

- Logically represents your visual data (images, videos, point clouds, etc.) along with all associated information: labels, metadata, predictions, and other fields.

- Is stored in a lightweight, non-relational database (MongoDB) so it can scale to large datasets without loading all media into RAM. 

- Can be created from many sources: directories of files, common dataset formats (like COCO), or the built‑in Dataset Zoo. 

Conceptually, you can [think of a dataset like a table in pandas](https://github.com/voxel51/fiftyone/blob/develop/docs/source/tutorials/pandas_comparison.ipynb): it has **rows and columns**, but specialized for computer vision:

- In pandas: **row = record**, **column = feature**
- In FiftyOne: **sample = record**, **field = feature** 

You can inspect the schema of a dataset by calling it:

In [None]:
dataset

### What is a sample?

A [**sample**](https://docs.voxel51.com/user_guide/basics.html#samples) is the atomic element of a FiftyOne dataset. It is:

- One “item” of your data (for example, a single image or a single video) plus everything you know about it. 
- A flexible container of [**fields**](https://docs.voxel51.com/user_guide/basics.html#fields), which can include:
  - `filepath` to the media
  - `metadata` (dimensions, duration, etc.)
  - Ground truth labels (detections, classifications, segmentations, etc.)
  - Model predictions
  - Tags, scalar values, strings, arrays, and more 

So, **a dataset is a collection of samples**, and **each sample is everything you care about for one media file**, stored in a structured way that’s easy to query, visualize, and modify.

You can see what a sample looks like by inspecting the [first](https://docs.voxel51.com/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.first) sample:

In [None]:
dataset.first()

Once you have your dataset, you can [launch the app](https://docs.voxel51.com/user_guide/app.html#using-the-fiftyone-app) and see what's in it.


In [None]:
# Open the FiftyOne App and explore the dataset
session = fo.launch_app(dataset, auto=False)
session.url

**What to look at in the App:**

1. Play a few videos and toggle on `gpt_action_brief` — watch how the temporal segments track the actions you see on screen
2. Click on a sample with high `tree_depth` (try sorting by it) — notice how the annotations at level 0 are broad overviews, while leaf-level annotations are very fine-grained moments
3. Look at the `transcript` field — this is the raw ASR text, much noisier than the GPT-refined annotations
4. Check a few `gpt_summary_detailed` labels — these are the longest annotations (~540 words average)

> **Key question to hold in mind:** Every one of these labels came from an automated pipeline. They look authoritative. But should we trust them?

---
## Section 2: Three Lenses on the Same Data

Before interrogating the existing annotations, let's build our own understanding of the dataset — using nothing but the raw videos and transcripts.

We'll create three separate embedding spaces:

1. **Visual ([Qwen3-VL-Embedding](https://docs.voxel51.com/plugins/plugins_ecosystem/qwen3vl_embeddings.html)):** What the videos look like — and crucially, this lives in a shared text-video space, so we can search with natural language

2. **Visual-Grounding ([Molmo2](https://docs.voxel51.com/plugins/plugins_ecosystem/molmo2.html)):** A different visual understanding — video-to-video similarity only, but from a model trained on grounding and spatial reasoning

3. **Language (Transcript):** What people are *saying* in these videos, embedded with a text model

The insight from comparing these three spaces is the foundation of everything that follows: **what you embed determines what you find**.

### Lens 1: Visual Content (Qwen3-VL-Embedding)

[Qwen3-VL-Embedding](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B) maps video and text into a shared vector space. This means we can embed a natural language query and find videos that match without touching any of the existing labels. It's semantic search, not keyword search.

In [None]:
# Load the 2B embedding model — good balance of quality and speed
import fiftyone.zoo as foz

foz.register_zoo_model_source(
    "https://github.com/harpreetsahota204/qwen3vl_embeddings",
    overwrite=True
)

qwen_emb_model = foz.load_zoo_model(
    "Qwen/Qwen3-VL-Embedding-2B",
)

qwen_emb_model.max_length=32768

# Compute embeddings — stores a vector on each sample
dataset.compute_embeddings(
    qwen_emb_model,
    embeddings_field="qwen_embeddings",
    batch_size=64,
    num_workers=8,
    skip_failures=False,
)

### What you can do after computing embeddings in FiftyOne

Once you've computed embeddings, you unlock powerful workflows:

#### Visualize your embeddings in the App

[Use `compute_visualization()`](https://docs.voxel51.com/brain.html#visualizing-embeddings) to reduce embeddings with UMAP, t‑SNE, or PCA and explore them in the Embeddings panel to:
- See clusters and data structure  
- Understand how classes group together  
- Lasso regions to create views and filter subsets


In [None]:
import fiftyone.brain as fob

# Project into 2D with UMAP for visualization
fob.compute_visualization(
    dataset,
    method="umap",
    brain_key="qwen_viz",
    embeddings="qwen_embeddings",
    num_dims=2,
)


####  Search by visual similarity

[Build a similarity index](https://docs.voxel51.com/brain.html#similarity) with `compute_similarity()` to:
- Find similar images (image‑to‑image search)  
- Detect duplicates or unique samples  
- Enable visual search in the App or via Python


In [None]:
# Build a similarity index — this is what powers text-to-video search
fob.compute_similarity(
    dataset,
    brain_key="qwen_sim",
    embeddings="qwen_embeddings",
)

You can also use this model for zero-shot classification. Let's add a Field to the Dataset:

In [None]:
classes = [
    "Cooking and Food",
    "Home Improvement and DIY",
    "Health and Beauty",
    "Hobbies and Crafts",
    "Sports and Fitness",
    "Gardening",
    "Technology and Electronics",
    "Fashion and Style",
    "Arts and Music",
    "Automotive",
    "Pets and Animals",
    "Education and Learning"
]

# Configure model for classification
qwen_emb_model.classes = classes
qwen_emb_model.text_prompt = "A video about "

# Apply zero-shot classification
dataset.apply_model(
    qwen_emb_model, 
    label_field="predicted_class"
    )

### Lens 2: Visual-Grounding (Molmo2)

Molmo2 is a different kind of vision-language model — it's trained heavily on grounding tasks: pointing to objects, tracking them through time, and counting. Its internal representations will weight spatial and motion features differently than Qwen3-VL-Embedding.

We can't do text-to-video search with Molmo2 embeddings, but we can still visualize the embedding space and do video-to-video similarity. 

The question is: **do two different visual models agree on which videos are similar?**

In [None]:
import fiftyone.zoo as foz 

foz.register_zoo_model_source(
    "https://github.com/harpreetsahota204/molmo2",
    overwrite=True
)

molmo_model = foz.load_zoo_model("allenai/Molmo2-4B")

molmo_model.pooling_strategy = "mean"

dataset.compute_embeddings(
    molmo_model,
    embeddings_field="molmo_embeddings",
    batch_size=64,
    num_workers=4,
    skip_failures=False,
)

In [None]:
import fiftyone.brain as fob

fob.compute_visualization(
    dataset,
    method="umap",
    brain_key="molmo_viz",
    embeddings="molmo_embeddings",
    num_dims=2,
)

fob.compute_similarity(
    dataset,
    brain_key="molmo_sim",
    embeddings="molmo_embeddings",
)

### Lens 3: Language (Transcript Embeddings)

Both visual embeddings above represent what you *see* in the videos. Now let's embed what you *hear* — the ASR transcript text.

Instructional videos vary enormously in how much they narrate. Some presenters talk through every step; others work in silence. The transcript embedding space captures this variation in a way neither visual model can.

Fo this we can make use of [`jina-embeddings-v5-text-small-clustering`](https://huggingface.co/jinaai/jina-embeddings-v5-text-small-clustering). 

This model is not integrated as a remote source zoo model, but we can make use of it by extracting the transcripts from the videos using the `values` method of the [Dataset](https://docs.voxel51.com/user_guide/basics.html#datasets) to get all the values of the [Field](https://docs.voxel51.com/user_guide/basics.html#fields) into a Python list.

With the values in a list we can use the model natively and then add the results back as a Field to the Dataset using [the `set_values`](https://docs.voxel51.com/api/fiftyone.core.collections.html#fiftyone.core.collections.SampleCollection.set_values) method.

In [None]:
import os
import torch
from sentence_transformers import SentenceTransformer

#set an environment variable so tokenizers doesn't yell at us,
# note this related to the `transformers` and `tokenizers` libraries and not a FiftyOne specific environment variable
os.environ["TOKENIZERS_PARALLELISM"] = "false"

text_emb_model = SentenceTransformer(
    "jinaai/jina-embeddings-v5-text-small-clustering",
    model_kwargs={"dtype": torch.bfloat16}, #recommended for GPUs
    config_kwargs={"attn_implementation": "flash_attention_2", "device":"cuda"}, #flash attn optional, but recommended
    tokenizer_kwargs={"extra_special_tokens": {}}, # This line fixes the AttributeError: 'list' object has no attribute 'keys'
)

transcripts = dataset.values("transcript")

# Encode texts
text_embeddings = text_emb_model.encode(transcripts)

dataset.set_values("text_embeddings", text_embeddings)

We can compute visualizations of the embeddings just like before:

In [None]:
import fiftyone.brain as fob 

fob.compute_visualization(
    dataset,
    method="umap",
    brain_key="transcript_viz",
    embeddings="text_embeddings",
    num_dims=2,
)

fob.compute_similarity(
    dataset,
    brain_key="text_sim",
    embeddings="text_embeddings",
)

We can also generate some Classifications for the Samples based on the title and description:

In [None]:
text_cls_model = SentenceTransformer(
    "jinaai/jina-embeddings-v5-text-small-classification",
    model_kwargs={"dtype": torch.bfloat16}, #recommended for GPUs
    config_kwargs={"attn_implementation": "flash_attention_2", "device":"cuda"}, #flash attn optional, but recommended
    tokenizer_kwargs={"extra_special_tokens": {}}, # This line fixes the AttributeError: 'list' object has no attribute 'keys'
    trust_remote_code=True
)

# grab the titles for the videos
_titles = dataset.values("title")

# grab the descriptions for the videos
_desc = dataset.values("description")

# combine them into one string
title_desc = [t + " " + d for t, d in zip(_titles, _desc)]

# Encode texts
title_embeddings = text_cls_model.encode(title_desc, prompt_name="document")

class_embeddings = text_cls_model.encode(classes, prompt = "The title of a YouTube video about ")

# compute cosine similarity
similarity = text_cls_model.similarity(title_embeddings, class_embeddings)

Now let's parse this back `similarity`, which is a [num_samples × num_classes] tensor of cosine similarity scores.

Each row is one video, each column is one class, and each value measures how geometrically close the text embedding is to that class label's embedding.

`argmax` gives us the column index (class) with the highest score per row (sample), so we can slice into our classes list, and `max` gives us the score itself, which we store as confidence.

This is zero-shot classification: no fine-tuning, no labeled training data. 

In [None]:
predicted_indices = similarity.argmax(dim=1).tolist()

confidence_scores = similarity.max(dim=1).values.tolist()

predicted_labels = [
    fo.Classification(
        label=classes[idx],
        confidence=conf,
    )
    for i, (idx, conf) in enumerate(zip(predicted_indices, confidence_scores))
]

dataset.set_values("jina_predicted_class", predicted_labels)

Now, we don't have ground truth classifications for the classification of the videos. 

So, let's assume that the classifications we got when using Qwen3-VL-Embeddings are the ground truth. 

We can evaluate how well the classifications using the Jina text embedding model is like so: 

In [None]:
classification_results = dataset.evaluate_classifications(
    "jina_predicted_class",
    gt_field="predicted_class",
    eval_key="simple_cls_eval",
)

I'll show you a more interactive evaluation panel when we return to the app, but for now you can view the results at a high-level like so:

In [None]:
classification_results.print_report()

### Using Embeddings to Compute Uniqueness and Representativeness Values

In FiftyOne, both **uniqueness** and **representativeness** are scalar scores computed from embeddings that describe how a sample relates to the rest of the dataset.

#### Uniqueness

- [Measures **how different (dissimilar)**](https://docs.voxel51.com/brain.html#image-uniqueness) a sample is from its neighbors in embedding space.  

- Implemented by looking at each sample’s nearest neighbors, weighting their distances (e.g. 60%-30%-10%), and normalizing to a score in **\[0, 1\]**. Higher scores mean the sample is more “isolated” or distinct; lower scores mean it has many close neighbors.
 
- The most unique sample in a collection has a uniqueness value of **1**.

- Useful for:
  - Finding **outliers / edge cases / bad actors**  
  - Detecting **near-duplicates** (by looking at *low* uniqueness)  
  - Selecting **diverse samples** for annotation when budget is limited 


In [None]:
import fiftyone.brain as fob

fob.compute_uniqueness(
    dataset,
    uniqueness_field = "qwen_uniqueness",
    embeddings = "qwen_embeddings",
    batch_size=64,
    num_workers=4
)

## Representativeness

- [Measures **how typical** or **central** ](https://docs.voxel51.com/brain.html#image-representativeness) a sample is relative to the rest of the dataset in embedding space.  

- Also normalized to **\[0, 1\]**, with the most representative samples having value **1**.  

- High representativeness ≈ sample lies in a dense, central region of the data; it’s similar to many other samples.  

- Useful for:
  - **Active learning**: picking representative samples to cover common patterns  
  - **Dataset balancing**: finding under/overrepresented regions  
  - **Efficient annotation**: prioritizing samples that “stand in” for many others 

In short:

- **Uniqueness** → “How unusual is this sample?”  
- **Representativeness** → “How well does this sample represent many others?”

In [None]:
import fiftyone.brain as fob

fob.compute_representativeness(
    dataset,
    representativeness_field = "qwen_rep",
    embeddings = "qwen_embeddings",
    batch_size=64,
    num_workers=4
)

### The Cross-Lens Moment

Try to **find a tight visual cluster in the Qwen UMAP, then look at where those same videos land in the transcript UMAP.**

You may find that videos which cluster tightly by visual content are often spread out in transcript space. Cooking videos that all look similar — same kitchen setup, same hands-on-food visual — may use completely different spoken language: detailed step-by-step narration, casual conversational chat, or total silence.

This is not a failure of any embedding model, but a property of the data: **visual similarity and linguistic similarity are measuring different things.** Your choice of embedding determines what questions you can ask.

- Qwen visual embeddings to find *visually similar content* or do text-to-video semantic search

- Molmo2 embeddings to find videos with similar *physical actions and spatial structure*

- Transcript embeddings to find videos that *talk about similar topics*

In [None]:
# Pick a video and find its nearest neighbors in each embedding space — 
# a concrete way to see how the three spaces differ
from fiftyone import ViewField as F

sample = dataset.first()

root_view = dataset.filter_labels("gpt_summary_brief", F("tier") == "root")

print(f"Reference video: {sample.title}")
ref_root = root_view[sample.id].gpt_summary_brief.detections[0].label
print(f"Root summary: {ref_root}\n")

for brain_key, label in [
    ("qwen_sim", "Based on QwenVL Embeddings"), 
    ("molmo_sim", "Based on Molmo Embeddings"), 
    ("text_sim", "Based on Text Embeddings")
    ]:
    neighbors = root_view.sort_by_similarity(sample.id, brain_key=brain_key, k=4)
    print(f"Nearest neighbors ({label}):")
    for n in neighbors.skip(1).take(3):
        print(f"  → {n.gpt_summary_brief.detections[0].label}")
    print()

**What to look at:** 

- In the Embeddings panel, hover over clusters. Click points to see which videos land near each other. Try coloring by `tree_depth` or `title` to see if the visual clusters map onto any metadata patterns.


- Compare the Molmo2 UMAP with the Qwen UMAP. Are the clusters in the same positions? Similar shape, different boundaries — this is two models with different visual priors producing different embedding spaces. Neither is "correct" — they're measuring different aspects of the same content.

---
## Section 3: The Second Opinion

We've built three independent maps of the dataset. Now let's do what the original pipeline did — but with a completely different model.

The Action100M annotations were produced by GPT-OSS-120B processing outputs from PerceptionLM-3B and Llama-3.2-Vision-11B. 

We're going to run **Qwen3-VL-8B** on the same videos, completely independently, and ask:

- How would *you* describe this video?

- What events do *you* see, and when?

We're not trying to beat the pipeline. We're trying to get a second opinion. Where the two agree, we gain confidence in the original annotation. Where they diverge, we've found something worth looking at.

In [None]:
import fiftyone.zoo as foz 

foz.register_zoo_model_source(
    "https://github.com/harpreetsahota204/qwen3vl_video",
    overwrite=True
)

qwen_video_model = foz.load_zoo_model(
    "Qwen/Qwen3-VL-8B-Instruct",
)

#### Generate detailed video description


In [None]:
# Generate a full-video description for every sample.
# This produces a string field: qwen_desc_summary
qwen_video_model.operation = "description"

dataset.apply_model(
    qwen_video_model,
    label_field="qwen_desc",
    skip_failures=True,
)

print(f"Descriptions generated: {dataset.exists('qwen_desc_summary').count()}")

#### Comprehensive analysis

Analyzes video for all aspects: description, events, objects, scene info, activities.

Output fields:

- `analysis_summary` - Video description (string)
- `analysis_events` - Temporal events (fo.TemporalDetections)
- `analysis_objects` - Object appearances (fo.TemporalDetections)
- `analysis_scene_info_*` - Scene classifications
- `analysis_activities_*` - Activity classifications
- `sample.frames[N].objects` - Frame-level object detections
- `sample.frames[N].text_content` - Frame-level OCR

In [None]:
qwen_video_model.operation = "comprehensive"

dataset.apply_model(
    qwen_video_model,
    label_field="qwen_comp",
    skip_failures=True,
)

print(f"Temporal events generated: {dataset.exists('qwen_events').count()}")

You'll notice the descriptions often describe the same video in quite different language — different vocabulary, different sentence structure, different level of detail. Sometimes one catches a detail the other misses. 

The question is: **how do we move from "these look different" to a number we can sort and filter by?**

#### Using a prompt from the paper

In the paper they describe some prompts they use. We can prompt Qwen3-VL the same way and see what we end up with:

In [None]:
PROMPT = """Identify the main actor and the physical action performed in the current segment. Provide both a brief
description that represents the overall action step, and a detailed description that contains sufficient
procedural detail. Use "N/A" (without further explaination) if there are no visible actors or physical
actions (e.g., static).

# Response Formats
## output
{
    "type": "object",
    "properties": {
        "summary": {
            "type": "object",
            "properties": {
                "brief": {
                    "type": "string",
                    "description": "Single sentence video caption."
                },
                "detailed": {
                    "type": "string",
                    "description": "Detailed, comprehensive description."
                }
            }
        },
        "action": {
            "type": "object",
            "properties": {
                "brief": {
                    "type": "string",
                    "description": "A single verb phrase (no -ing forms) brifly summarizing the overall action content."
                },
                "detailed": {
                    "type": "string",
                    "description": "A single imperitive sentence describing how the action is performed with more details."
                },
                "actor": {
                    "type": "string",
                    "description": "Single sentece or an imformative noun phrase describing who is performing the action."
                }
            }
        }
    },
    "required": ["summary", "action"]
}"""

In [None]:
# Generate a full-video description for every sample.
# This produces a string field: qwen_desc_summary


qwen_video_model.operation = "custom"

qwen_video_model.prompt = PROMPT

dataset.apply_model(
    qwen_video_model,
    label_field="qwen_custom",
    skip_failures=True,
)

We can also do the same with the Molmo2 model, let's use the 8B parameter model:

In [None]:
molmo_8b_model = foz.load_zoo_model("allenai/Molmo2-8B")

molmo_8b_model.operation = "describe"

molmo_8b_model.prompt = PROMPT

dataset.apply_model(
    molmo_8b_model,
    label_field="molmo_custom",
    skip_failures=True,
)

The model returned its output as a raw JSON string stored in `qwen_custom_result`.

We parse the JSON once and promote each nested value to its own top-level string field. 

This makes every piece of structured output a first-class citizen in the dataset: filterable in the App, embeddable with a text model, and comparable against the existing GPT-generated fields (`gpt_summary_brief`, `gpt_action_actor`, etc.) that ship with the Action100M dataset.

In [None]:
import json

def extract(raw):
    try:
        parsed = json.loads(raw)
        props = parsed["properties"]
        s = props["summary"]["properties"]
        a = props["action"]["properties"]
        return s.get("brief"), s.get("detailed"), a.get("brief"), a.get("detailed"), a.get("actor")
    except:
        return None, None, None, None, None

SOURCES = [
    ("qwen_custom_result",       "qwen3vl"),
    ("molmo_custom_description", "molmo"),
]

for src_field, prefix in SOURCES:
    raws = dataset.values(src_field)
    results = [extract(r) for r in raws]

    dataset.set_values(f"{prefix}_summary_brief",    [r[0] for r in results])
    dataset.set_values(f"{prefix}_summary_detailed", [r[1] for r in results])
    dataset.set_values(f"{prefix}_action_brief",     [r[2] for r in results])
    dataset.set_values(f"{prefix}_action_detailed",  [r[3] for r in results])
    dataset.set_values(f"{prefix}_action_actor",     [r[4] for r in results])

Each `gpt_summary_brief`, `gpt_summary_detailed`, etc. is a [`TemporalDetections` field](https://docs.voxel51.com/user_guide/using_datasets.html#temporal-detection) containing a list of detections across multiple tiers. 

We want to "hoist" just the root-tier detection's label string up to a flat sample-level field for easy filtering and comparison. To accomplish this we can [use a combination of `ViewField` and `Filtering`](https://docs.voxel51.com/user_guide/using_views.html#querying-samples).

This is necessary for the next step in the workshop, where we will compare the Qwen3-VL outputs with the "ground truth".

In [None]:
from fiftyone import ViewField as F

# Fields to extract root labels from, and their new target field names
GPT_FIELDS = {
    "gpt_summary_brief":    "gpt_summary_root_brief",
    "gpt_summary_detailed": "gpt_summary_root_detailed",
    "gpt_action_brief":     "gpt_action_root_brief",
    "gpt_action_detailed":  "gpt_action_root_detailed",
    "gpt_action_actor":     "gpt_action_root_actor",
}

for src_field, dst_field in GPT_FIELDS.items():
    # Create a view that filters each sample's detections to root-tier only.
    # This does NOT mutate the dataset — it's a virtual filter.
    root_view = dataset.filter_labels(src_field, F("tier") == "root")

    # values() on a TemporalDetections field returns a list-of-lists:
    # one inner list per sample, containing the labels of surviving detections.
    # Since there is exactly one root per sample, each inner list has one element.
    nested = root_view.values(f"{src_field}.detections.label")

    # Flatten: take the first (only) label, or None if a sample had no root.
    flat = [labels[0] if labels else None for labels in nested]

    dataset.set_values(dst_field, flat)

---
## Section 4: Measuring Agreement

Eyeballing agreement doesn't scale. We need metrics.


##### **Normalized Levenshtein Similarity** 

This metric measures how similar two text strings are on a scale from 0.0 to 1.0.

**How it works:**

1. **Levenshtein Distance**: Counts the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another.

2. **Normalization**: Converts the distance to a similarity score:
   
$$
\text{similarity} = 1.0 - \frac{\text{Levenshtein distance}}{\max(\text{len}(s_1), \text{len}(s_2))}
$$


3. **Score**: 
   - **1.0** = perfect match
   - **0.0** = completely different
   - **0.0-1.0** = partial similarity

**Example:**

- **"hello"** vs **"hello"** → 1.0 (no edits)
- **"hello"** vs **"helo"** → 0.8 (1 deletion / 5 chars)
- **"hello"** vs **"world"** → 0.2 (4 substitutions / 5 chars)

**Use Cases:**

Evaluating OCR accuracy, transcription quality, text generation models, or any scenario requiring quantification of text similarity.


In [None]:
import fiftyone.operators as foo

sim_op = foo.get_operator("@harpreetsahota/text-evaluation-metrics/compute_normalized_similarity")

# (pred_field, gt_field, output_field)
# lex_q_ = lexical similarity, Qwen vs GPT root
# lex_m_ = lexical similarity, Molmo vs GPT root
COMPARISON_PAIRS = [
    ("qwen3vl_summary_brief",    "gpt_summary_root_brief",    "lex_q_sum_brief"),
    ("qwen3vl_summary_detailed", "gpt_summary_root_detailed", "lex_q_sum_detailed"),
    ("qwen3vl_action_brief",     "gpt_action_root_brief",     "lex_q_act_brief"),
    ("qwen3vl_action_detailed",  "gpt_action_root_detailed",  "lex_q_act_detailed"),
    ("qwen3vl_action_actor",     "gpt_action_root_actor",     "lex_q_actor"),
    ("molmo_summary_brief",      "gpt_summary_root_brief",    "lex_m_sum_brief"),
    ("molmo_summary_detailed",   "gpt_summary_root_detailed", "lex_m_sum_detailed"),
    ("molmo_action_brief",       "gpt_action_root_brief",     "lex_m_act_brief"),
    ("molmo_action_detailed",    "gpt_action_root_detailed",  "lex_m_act_detailed"),
    ("molmo_action_actor",       "gpt_action_root_actor",     "lex_m_actor"),
]

for pred_field, gt_field, output_field in COMPARISON_PAIRS:
    sim_op(
        dataset,
        pred_field=pred_field,
        gt_field=gt_field,
        output_field=output_field,
        case_sensitive=False,
        delegate=False,
    )

This metric computes **Semantic Similarity** using neural embeddings, measuring whether two texts have the same *meaning* rather than the same *characters*.


1. **Sentence Encoding**: Uses a pre-trained neural network (from `sentence-transformer`) to convert each text into a high-dimensional vector (embedding) that captures its meaning.

2. **Cosine Similarity**: Computes the angle between the two embedding vectors:
   
   $$
   \text{similarity} = \max(0, \cos(\text{embedding}_{\text{gt}}, \text{embedding}_{\text{pred}}))
   $$

3. **Score**: 
   - **1.0** = semantically identical
   - **0.0** = completely unrelated
   - **0.0-1.0** = partial semantic overlap

**Example:**

- **"The car is fast"** vs **"The automobile is quick"** → ~0.95 (same meaning, different words)
- **"hello"** vs **"helo"** → ~0.85 (similar meaning despite typo)
- **"cat"** vs **"dog"** → ~0.6 (related concepts)

**Comparison to Normalized Levenshtein Similarity:**

**Levenshtein** measures character-level edits using string matching. It's very fast but penalizes any character difference, even if the meaning is preserved. Best for detecting typos, OCR errors, or when exact wording matters.

**Semantic** measures meaning using neural embeddings. It's slower (requires model inference) but rewards conceptual equivalence regardless of wording. For example, "car" vs "automobile" scores 0.0 with Levenshtein but ~0.9 with Semantic.


In [None]:
import fiftyone.operators as foo

sem_op = foo.get_operator(
    "@harpreetsahota/text-evaluation-metrics/compute_semantic_similarity"
)

# (pred_field, gt_field, output_field)
# sem_q_ = semantic similarity, Qwen vs GPT root
# sem_m_ = semantic similarity, Molmo vs GPT root
COMPARISON_PAIRS = [
    ("qwen3vl_summary_brief",    "gpt_summary_root_brief",    "sem_q_sum_brief"),
    ("qwen3vl_summary_detailed", "gpt_summary_root_detailed", "sem_q_sum_detailed"),
    ("qwen3vl_action_brief",     "gpt_action_root_brief",     "sem_q_act_brief"),
    ("qwen3vl_action_detailed",  "gpt_action_root_detailed",  "sem_q_act_detailed"),
    ("qwen3vl_action_actor",     "gpt_action_root_actor",     "sem_q_actor"),
    ("molmo_summary_brief",      "gpt_summary_root_brief",    "sem_m_sum_brief"),
    ("molmo_summary_detailed",   "gpt_summary_root_detailed", "sem_m_sum_detailed"),
    ("molmo_action_brief",       "gpt_action_root_brief",     "sem_m_act_brief"),
    ("molmo_action_detailed",    "gpt_action_root_detailed",  "sem_m_act_detailed"),
    ("molmo_action_actor",       "gpt_action_root_actor",     "sem_m_actor"),
]

for pred_field, gt_field, output_field in COMPARISON_PAIRS:
    sem_op(
        dataset,
        pred_field=pred_field,
        gt_field=gt_field,
        model_name="all-mpnet-base-v2",
        output_field=output_field,
        delegate=False,
    )

**What to look at:** Browse the bottom of the sorted view (lowest confidence). Are these actually bad annotations, or just paraphrases that look different? This is the limitation of lexical metrics — they can't tell the difference. A future extension here is a semantic similarity operator using sentence embeddings, which would score paraphrases much more accurately.


---
## Section 5

Now that we have both lexical and semantic agreement scores for every sample, we can slice the dataset into meaningful subsets — not just "good" and "bad", but a more nuanced three-way split that reveals something important about the limits of lexical evaluation alone.

We'll demonstrate this using `summary_detailed` as our primary signal: it's the richest description field, so agreement or disagreement here is the most informative. The same pattern applies directly to any of the other four field pairs (`summary_brief`, `action_brief`, `action_detailed`, `action_actor`) — try them on your own and see if the distributions differ.


In [None]:
from fiftyone import ViewField as F

HIGH_LEX = 0.6
HIGH_SEM = 0.6
LOW_LEX  = 0.3
LOW_SEM  = 0.5

lex_q = F("lex_q_sum_detailed")
sem_q = F("sem_q_sum_detailed")
lex_m = F("lex_m_sum_detailed")
sem_m = F("sem_m_sum_detailed")

TAGS = {
    "qwen_agrees":        (sem_q >= HIGH_SEM) & (lex_q >= HIGH_LEX),
    "molmo_agrees":       (sem_m >= HIGH_SEM) & (lex_m >= HIGH_LEX),
    "triple_agreement":   (sem_q >= HIGH_SEM) & (sem_m >= HIGH_SEM),
    "triple_disagreement":(sem_q < LOW_SEM)   & (sem_m < LOW_SEM),
    "qwen_only":          (sem_q >= HIGH_SEM) & (sem_m < LOW_SEM),
    "molmo_only":         (sem_m >= HIGH_SEM) & (sem_q < LOW_SEM),
    "paraphrase":         (lex_q < LOW_LEX)   & (sem_q >= HIGH_SEM) &
                          (lex_m < LOW_LEX)   & (sem_m >= HIGH_SEM),
}

for tag, expr in TAGS.items():
    view = dataset.match(expr)
    view.tag_samples(tag)
    print(f"{tag:<25} → {len(view)} samples tagged")

---

## What We Built

We started with 1,144 videos and a set of annotations we had to take on faith. Every label was written by a model. No human ever watched these clips.

That's increasingly the norm. The question isn't whether to use AI-generated annotations but if you can blindly trust them, and for which samples.

Here's what we built to answer that:

- **Three independent views of the same data.** Qwen3-VL-Embedding, Molmo2, and transcript embeddings each carve the dataset differently: visual semantics, spatial grounding, and spoken language. Where they agree on which videos are similar, that structure is real. Where they diverge, you've learned something about the data that no single model would have told you.

- **An independent second opinion on the annotations.** Qwen3-VL and Molmo2 watched the same videos cold, with no knowledge of what GPT-OSS-120B wrote. Where all three models converge, confidence is high. Where they diverge and especially where one model agrees with GPT and the other doesn't. By doing this, you've found the hard cases worth a human look.

- **A per-sample confidence map.** Lexical similarity tells you whether the words match. Semantic similarity tells you whether the *meaning* matches. The gap between them (samples with low lexical but high semantic scores) is the **paraphrase class**: descriptions that look like disagreements but aren't. Without both metrics, you'd be reviewing the wrong samples.

**Tags that survive.** Every insight is written back to the dataset as a sample tag — `triple_agreement`, `triple_disagreement`, `qwen_only`, `molmo_only`, `paraphrase`. Open the App, filter by tag, and the confidence map is right there.

---

### The Workflow, Generalized

The dataset doesn't matter. This workflow does — and it transfers directly to any video dataset with AI-generated labels:

1. **Explore before trusting** — understand the annotation structure, know what the pipeline actually produced

2. **Embed from multiple angles** — visual content, spatial grounding, language; each lens surfaces different structure

3. **Get a second opinion** — run an independent model, compare outputs field by field

4. **Quantify agreement** — a per-sample score turns qualitative inspection into a sortable, filterable, taggable signal

The uncomfortable truth about AI-generated datasets is that they look authoritative. The labels are structured, consistent, and comprehensive. But consistency isn't correctness. The only way to know which annotations to trust is to ask another model and measure how much they agree.

That's what you just did.