# 2026 Visual AI Hackathon Enablement Kit: Manufacturing & Workplace Safety

## 1. Objective
This project aims to build a demo application and tutorial that serves as the primary **"Enablement Asset"** for the **CVPR 2026 Worker Safety Challenge**. It serves as a semantic dataset curator and visualizer.

The asset demonstrates an end-to-end example workflow between **TwelveLabs** and **FiftyOne**, providing a tool for participants to build a high-quality, small-data training set from raw footage without manual framing.

> **Strategic Goal**: Demonstrate that "Small Data" does not mean "Manual Data." We aim to show how modern semantic search can replace hours of manual video scrubbing.

## 2. Challenge Context
*   **Event**: Visual AI Hackathon 2026.
*   **Track**: Challenge Track (Worker Safety).

### The "Enablement" Gap
Participants will see an end-to-end workflow utilizing:
1.  **Marengo 3.0 Vector Embedding Generation**: For multimodal understanding.
2.  **Pegasus Cluster Metadata & Identification**: For zero-shot auto-labeling.
3.  **Voxel51 UI Data Curation and Visualizer**: For interactive exploration.

By using this general semantic data curator, participants will gain hands-on exposure to the underlying API and SDK for both platforms, avoiding the high latency of 40+ hours of manual video scrubbing.

## 3. Setup and Dependencies
The following cell installs the necessary Python packages: `fiftyone`, `twelvelabs`, `python-dotenv`, and `torch`.

In [None]:
!pip install fiftyone
!pip install twelvelabs
!pip install umap-learn
!pip install torch torchvision

## Configuration

Let's go ahead and set up our TwelveLabs API key. You'll need to create an account on the TwelveLabs platform, which you can do here: https://playground.twelvelabs.io/

You can then [create an API key here](https://playground.twelvelabs.io/dashboard/api-keys), making sure that you save the API key somewhere safe as you won't be able to view it again.

Once you've done that, run the following cell and enter your API key.

In [None]:
from getpass import getpass

TL_API_KEY = getpass("Enter TwelveLabs API Key: ")

## Download Dataset and Parse into FiftyOne 

We'll start by downloading an example dataset, note this dataset will consume ~9.3GB in disk space. 

The quickest way to download the dataset is by pulling it from the [Voxel51 Hugging Face org](https://huggingface.co/Voxel51). You can do that as follows:

In [None]:
from fiftyone.utils.huggingface import load_from_hub

dataset = load_from_hub(
    "Voxel51/Safe_and_Unsafe_Behaviours",
    name="safe_unsafe_behaviours",
    persistent=True,
    overwrite=True
    )

What is a FiftyOne Dataset? 

It's composed of multiple Sample objects which contain Field attributes, all of which can be dynamically created, modified and deleted. FiftyOne uses a lightweight non-relational database to store datasets, so you can easily scale to datasets of any size without worrying about RAM constraints on your machine.


In the event that you encounter some rate limiting from Hugging Face, it's advised that you download and parse the dataset manually. You can download the dataset [here](https://data.mendeley.com/datasets/xjmtb22pff/1), or by running the following command in your terminal: `wget -O dataset.zip "https://data.mendeley.com/public-api/zip/xjmtb22pff/download/1"`

Once the data is downloaded, [importing it to FiftyOne](https://docs.voxel51.com/user_guide/import_datasets.html) is quite straight forward. 

You'll need to unzip and extract the dataset and then follow along with the instructions below.

This dataset is in video classification directory tree format, so we can [use the appropriate loader](https://docs.voxel51.com/user_guide/import_datasets.html#video-classification-dir-tree) as follows:

```python
import fiftyone as fo

base_dir = "Video Dataset for Safe and Unsafe Behaviours/Safe and Unsafe Behaviours Dataset"

# Create test dataset
test_dataset = fo.Dataset.from_dir(
    dataset_dir=f"{base_dir}/test",
    dataset_type=fo.types.VideoClassificationDirectoryTree,
    tags=["test"]
)

# Create train dataset
train_dataset = fo.Dataset.from_dir(
    dataset_dir=f"{base_dir}/train",
    dataset_type=fo.types.VideoClassificationDirectoryTree,
    tags=["train"]
)
```

We can then combine the datasets using the [`add_collections` method](https://docs.voxel51.com/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.add_collection) of the [FiftyOne Dataset](https://docs.voxel51.com/user_guide/using_datasets.html):

```python
train_dataset.add_collection(test_dataset)
train_dataset.name = "safe_unsafe_behaviours"
train_dataset.persistent = True
```

Then, we can map the labels (which come from the subdirectory each [Sample](https://docs.voxel51.com/api/fiftyone.core.sample.html#fiftyone.core.sample.Sample) lives in) to a more human readable format using the [`map_labels` method](https://docs.voxel51.com/api/fiftyone.core.view.html#fiftyone.core.view.DatasetView.map_labels) of the Dataset.

```python
label_mapping = {
    '0_safe_walkway_violation': 'Safe Walkway Violation',
    '1_unauthorized_intervention': 'Unauthorized Intervention',
    '2_opened_panel cover': 'Opened Panel Cover',
    '3_carrying_overload_with_forklift': 'Carrying Overload with Forklift',
    '4_safe_walkway': 'Safe Walkway',
    '5_authorized_intervention': 'Authorized Intervention',
    '6_closed_panel_cover': 'Closed Panel Cover',
    '7_safe_carrying': 'Safe Carrying'
}


view = train_dataset.map_labels("ground_truth", label_mapping)
view.save()
```

Finally, you will want to use the [`compute_metadata` method](https://docs.voxel51.com/user_guide/using_datasets.html#metadata) of the Dataset. When you run `compute_metadata()` on a video dataset, FiftyOne populates each sample’s `metadata` field with a `VideoMetadata` object containing at least the following fields:

- `size_bytes` – file size in bytes  
- `mime_type` – MIME type (e.g. `video/mp4`)  
- `frame_width` – width of the video frames in pixels  
- `frame_height` – height of the video frames in pixels  
- `frame_rate` – frames per second  
- `total_frame_count` – total number of frames  
- `duration` – duration in seconds  
- `encoding_str` – codec/encoding string (e.g. `avc1`)

[View the docs here](https://docs.voxel51.com/user_guide/basics.html) to learn more about the basics of FiftyOne Datasets, and the docs here for the [specifics of video datasets](https://docs.voxel51.com/user_guide/using_datasets.html#video-datasets).

Once the Dataset has been parsed to FiftyOne, if you need to access it again in another session you can simply call:

```python
import fiftyone as fo

dataset = fo.load_dataset("safe_unsafe_behaviours")
```

If you need to [delete the Dataset](
https://docs.voxel51.com/user_guide/using_datasets.html#deleting-a-dataset), say because you reran the code here and FiftyOne tells you the Dataset name already exists, all you have to do is open the terminal and run: `fiftyone datasets delete safe_unsafe_behaviours`. Note this won't delete the files from your local disk, just the reference to it in the FiftyOne database

## Initialize TwelveLabs Index

Now we'll connect to TwelveLabs and create an **index** to store our video data. Think of an index as a container that holds all your indexed videos along with their generated embeddings and metadata.

### What is an Index?

A [TwelveLabs Index](https://docs.twelvelabs.io/docs/concepts/indexes) is a searchable collection of videos. When you upload a video to an index, TwelveLabs processes it through the configured models to generate:

- **Vector embeddings**: Dense numerical representations that capture the semantic meaning of video content
- **Temporal segments**: The video is divided into meaningful chunks for fine-grained search
- **Multimodal understanding**: Both visual and audio tracks are analyzed

This enables powerful capabilities like semantic search ("find moments where workers are not wearing helmets") and video-to-text generation.

### Models We're Using

We'll configure our index with two complementary models:

| Model | Type | Purpose |
|-------|------|---------|
| [**Marengo 3.0**](https://docs.twelvelabs.io/v1.3/docs/concepts/models/marengo) | Embedding | Generates rich multimodal embeddings from video. Processes both `visual` and `audio` modalities to create 1024-dimensional vectors that capture semantic content. These embeddings power similarity search and clustering. |
| [**Pegasus 1.2**](https://docs.twelvelabs.io/v1.3/docs/concepts/models/pegasus) | Generative | Video-to-text model that can analyze video content and generate natural language descriptions. We'll use this later for zero-shot labeling of our clusters. |

### Under the Hood

When we call [`indexes.create()`](https://docs.twelvelabs.io/v1.3/docs/concepts/indexes#create-an-index), TwelveLabs provisions cloud infrastructure to:
1. Accept video uploads via the Tasks API
2. Run the configured models on each video
3. Store the resulting embeddings in a vector database
4. Enable fast similarity search across all indexed videos

In [None]:
from twelvelabs import TwelveLabs
from twelvelabs.indexes import IndexesCreateRequestModelsItem

TL_INDEX_NAME = "fiftyone-twelvelabs-index"

# Create or retrieve TwelveLabs index
twelvelabs_client = TwelveLabs(api_key=TL_API_KEY)

def get_twelvelabs_index(index_name: str) -> int:
    """
    Returns the ID of the TwelveLabs index with the given name.
    If the index does not exist, it creates a new index with the given name.
    """
    indexes = twelvelabs_client.indexes.list()
    for index in indexes:
        if index.index_name == TL_INDEX_NAME:
            print("Found index with name {} with ID {}".format(TL_INDEX_NAME, index.id))
            return index.id
    index = twelvelabs_client.indexes.create(
        index_name=TL_INDEX_NAME,
        models=[
            IndexesCreateRequestModelsItem(
                model_name="marengo3.0", model_options=["visual", "audio"]
            ),
            IndexesCreateRequestModelsItem(
                model_name="pegasus1.2", model_options=["visual", "audio"]
            ),
        ]
    )
    print("Created index with name {} with ID {}".format(TL_INDEX_NAME, index.id))
    return index.id

index_id = get_twelvelabs_index(TL_INDEX_NAME)

## Video Ingestion and Indexing
This step uploads videos to TwelveLabs using FiftyOne views to filter the dataset:

1.  **Filtering**: Selects videos from the target split with duration ≥ 4 seconds that haven't been indexed yet.

2.  **Sampling**: Takes up to `VIDEOS_PER_LABEL` videos per label.

3.  **Indexing**: Uploads each video to TwelveLabs and stores the `tl_video_id` on the sample.

The cell is idempotent—rerunning it will skip already-indexed videos.

First, let's define some variables and a helper function:

In [None]:
# Only run this cell if videos have not been indexed already.
import json
from fiftyone import ViewField as F

def index_video_to_twelvelabs(sample):
    """
    Upload a FiftyOne video sample to TwelveLabs for indexing.
    
    This function reads the video file from disk, uploads it to TwelveLabs,
    and waits for the indexing task to complete. The video is indexed using
    the Marengo 3.0 model which generates visual and audio embeddings.
    
    Args:
        sample: A FiftyOne Sample object with 'filepath' and 'ground_truth.label' fields.
        
    Returns:
        str: The TwelveLabs video_id assigned to the indexed video.
        
    Raises:
        Exception: If the indexing task fails (status != "ready").
    """
    # Read video bytes from the sample's filepath
    with open(sample.filepath, "rb") as f:
        # Create an indexing task in TwelveLabs
        # user_metadata stores FiftyOne sample info for later cross-referencing
        task = twelvelabs_client.tasks.create(
            index_id=index_id,
            video_file=f.read(),
            user_metadata=json.dumps({
                "filepath": sample.filepath,
                "sample_id": sample.id,
                "label": sample.ground_truth.label
            })
        )
    
    # Block until TwelveLabs finishes processing the video
    task = twelvelabs_client.tasks.wait_for_done(task_id=task.id)
    
    # Verify the task completed successfully
    if task.status != "ready":
        raise Exception(f"Task failed: {task.status}")
    
    # Retrieve and return the assigned video_id
    return twelvelabs_client.tasks.retrieve(task_id=task.id).video_id


Next, we'll use FiftyOne's powerful [view operations](https://docs.voxel51.com/user_guide/using_views.html) to filter and iterate through our dataset. 

### The Filtering Pipeline

We build a view with three filters chained together:

1. **`match_tags(DATASET_SPLIT)`** - Select only samples tagged with our target split (e.g., "train")

2. **`match(F("metadata.duration") >= MIN_DURATION)`** - Keep videos at least 4 seconds long (TwelveLabs requirement)

3. **`match(~F("tl_video_id").exists())`** - Exclude videos we've already indexed (idempotency)

The `F()` syntax is FiftyOne's [ViewField](https://docs.voxel51.com/api/fiftyone.core.expressions.html) expression language, which lets you filter on any field in your samples.

### Stratified Sampling

We then loop through each unique label and use `.take(VIDEOS_PER_LABEL)` to limit how many videos we index per class. This ensures balanced representation across categories while keeping API costs manageable during development.

You can learn more about Views in [this cheat sheet](https://docs.voxel51.com/cheat_sheets/views_cheat_sheet.html) and Filtering [in this cheat sheet](https://docs.voxel51.com/cheat_sheets/filtering_cheat_sheet.html)

In [None]:
MIN_DURATION = 4.0
VIDEOS_PER_LABEL = 3
DATASET_SPLIT = "train"

# Filter: correct split, sufficient duration, not yet indexed
base_view = (
    dataset
    .match_tags(DATASET_SPLIT)
    .match(F("metadata.duration") >= MIN_DURATION)
    .match(~F("tl_video_id").exists())
)

# Index up to VIDEOS_PER_LABEL samples per label
for label in base_view.distinct("ground_truth.label"):
    label_view = base_view.match(F("ground_truth.label") == label).take(int(VIDEOS_PER_LABEL))
    print(f"\n{label}: {len(label_view)} to index")
    
    for sample in label_view.iter_samples(autosave=True, progress=True):
        try:
            sample["tl_video_id"] = index_video_to_twelvelabs(sample)
            print(f"  ✓ {sample.filename}")
        except Exception as e:
            print(f"  ✗ {sample.filename}: {e}")

print(f"\nTotal indexed: {len(dataset.exists('tl_video_id'))}")

## Fetch Embeddings and Populate Dataset

Now that our videos are indexed in TwelveLabs, we can [retrieve the **visual embeddings**](https://docs.twelvelabs.io/v1.3/api-reference/create-embeddings-v1/video-embeddings/retrieve-video-embeddings) generated by Marengo 3.0. These are 512-dimensional vectors that encode the semantic content of each video.

### What are Video Embeddings?

When TwelveLabs indexes a video, Marengo 3.0 processes the visual and audio streams to produce dense vector representations. These embeddings capture high-level semantic information like:

- Actions and activities occurring in the video
- Objects and people present
- Scene context and environment
- Temporal patterns and motion

Videos with similar content will have embeddings that are close together in vector space, enabling similarity search and clustering.

### Efficient Retrieval Pattern

We use FiftyOne's optimized iteration pattern:

- [**`select_fields("tl_video_id")`**](tl_embedding) - Only loads the field we need, avoiding unnecessary memory usage
- [**`iter_samples(autosave=True)`**](https://docs.voxel51.com/api/fiftyone.core.view.html#fiftyone.core.view.DatasetView) - Batches database writes efficiently instead of saving after each sample

The embeddings are stored on each sample as `tl_embedding` for persistence, and also collected in a list for the clustering step that follows.

In [None]:
def fetch_embedding(video_id: str) -> list:
    """Fetch visual embedding for a TwelveLabs video."""
    video_info = twelvelabs_client.indexes.videos.retrieve(
        index_id=index_id,
        video_id=video_id,
        embedding_option=["visual"]
    )
    return video_info.embedding.video_embedding.segments[0].float_

# Work with indexed samples (those with tl_video_id)
indexed_view = dataset.exists("tl_video_id")
print(f"Fetching embeddings for {len(indexed_view)} indexed videos...")

# Efficient iteration: select_fields avoids loading unnecessary data,
# autosave=True lets FiftyOne batch the writes efficiently
embeddings = []
for sample in indexed_view.select_fields("tl_video_id").iter_samples(
    progress=True, autosave=True
):
    embedding = fetch_embedding(sample.tl_video_id)
    sample["tl_embedding"] = embedding
    embeddings.append(embedding)  # Keep in memory for clustering later

print(f"\nStored {len(embeddings)} embeddings on dataset")

You can confirm the length of the embedding as follows:

In [None]:
len(indexed_view.first()['tl_embedding'])

## Semantic Clustering and Auto-Labeling

To achieve "Small Data" curation without manual effort:

1.  **KMeans Clustering**: We cluster the video embeddings into 8 distinct groups based on semantic similarity.

2.  **Pegasus Generation**: For each cluster, we use the **TwelveLabs Pegasus 1.2** model to generate a [descriptive label](https://docs.twelvelabs.io/v1.3/docs/get-started/quickstart/analyze-videos)

3.  **Annotation**: These labels are applied to all samples in the cluster.



In [None]:
from sklearn.cluster import KMeans

CLUSTER_LABEL_PROMPT = """
Analyze this workplace safety video and classify it as exactly ONE of the following labels.

UNSAFE BEHAVIORS (violations):
- Safe Walkway Violation: Worker walking outside the designated safe walkway boundaries, entering restricted or hazardous areas
- Unauthorized Intervention: Worker intervening on equipment/machinery WITHOUT wearing required safety gear (intervention vest)
- Opened Panel Cover: Electrical/machinery panel cover left open after intervention
- Carrying Overload with Forklift: Forklift carrying 3 or more blocks

SAFE BEHAVIORS (compliance):
- Safe Walkway: Worker staying within the designated safe walkway boundaries
- Authorized Intervention: Worker intervening on equipment/machinery while wearing proper safety gear (intervention vest)
- Closed Panel Cover: Electrical/machinery panel cover properly closed after intervention
- Safe Carrying: Forklift carrying 2 or fewer blocks

Return ONLY the exact label name, nothing else.
"""

def generate_label(video_id: str) -> str:
    """
    Generate a semantic label for a video using TwelveLabs Pegasus 1.2.
    
    This function uses TwelveLabs' video-to-text analysis to generate a 
    descriptive label for a video. The label is intended to represent the 
    video's content in the context of workplace safety (violations or 
    good practices).
    
    We use this to auto-label clusters: one representative video from each 
    cluster is analyzed, and the generated label is applied to all videos 
    in that cluster.
    
    Args:
        video_id: The TwelveLabs video ID to analyze.
        
    Returns:
        A single label string with underscores instead of spaces 
        (e.g., "Unsafe_Walking_Path", "Proper_PPE_Usage").
        
    Note:
        Uses temperature=0.2 for more deterministic/consistent outputs.
        See: https://docs.twelvelabs.io/v1.3/docs/get-started/quickstart/analyze-videos
    """
    result = twelvelabs_client.analyze(
        video_id=video_id,
        prompt=CLUSTER_LABEL_PROMPT,
        temperature=0.2,
    )
    return result.data

# Cluster embeddings using KMeans
num_clusters = 8

kmeans = KMeans(n_clusters=num_clusters, random_state=0)

cluster_labels = kmeans.fit_predict(embeddings)

### Generate Cluster Labels

Now we generate a human-readable label for each cluster using Pegasus 1.2. Rather than labeling every video individually (expensive and slow), we:

1. Pick one representative video from each cluster
2. Ask Pegasus to analyze it and generate a descriptive label
3. Apply that label to all videos in the cluster

This is the "zero-shot auto-labeling" that makes this workflow scalable—8 API calls instead of 24.

### Extracting Field Values Efficiently

Since we only need the `tl_video_id` field for this step, we use FiftyOne's [`values()`](https://docs.voxel51.com/api/fiftyone.core.collections.html#fiftyone.core.collections.SampleCollection.values) method to extract just that field as a Python list:

`video_ids = indexed_view.values("tl_video_id")`


In [None]:
cluster_label_map = {}
cluster_strings = []

video_ids = indexed_view.values("tl_video_id")

for video_id, cluster_idx in zip(video_ids, cluster_labels):
    if cluster_idx not in cluster_label_map:
        print(f"Generating label for cluster {cluster_idx}...")
        cluster_label_map[cluster_idx] = generate_label(video_id)
        print(f"  -> {cluster_label_map[cluster_idx]}")
    cluster_strings.append(cluster_label_map[cluster_idx])

### Batch Update with `set_values()`

Finally, we use [`set_values()`](https://docs.voxel51.com/api/fiftyone.core.collections.html#fiftyone.core.collections.SampleCollection.set_values) to assign the cluster labels to all samples in a single operation.

### Storing as Classification Labels

Rather than storing plain strings, we convert our cluster labels to [`Classification`](https://docs.voxel51.com/api/fiftyone.core.labels.html#fiftyone.core.labels.Classification) objects. 

In [None]:
import fiftyone as fo

classifications = [fo.Classification(label=s) for s in cluster_strings]

indexed_view.set_values("pred_cluster", classifications)

We can inspect the results as follows:

In [None]:
indexed_view.first().pred_cluster.label

In [None]:
indexed_view.first().ground_truth.label

Having the predicitions stored as clusters allows you to use FiftyOne's [built-in methods for evaluation](https://docs.voxel51.com/user_guide/evaluation.html#evaluating-models).

For example, we can [evaluate the results of the classification](https://docs.voxel51.com/user_guide/evaluation.html#classifications) as follows:

In [None]:
# Evaluate the predictions in the `predictions` field with respect to the
# labels in the `ground_truth` field
results = indexed_view.evaluate_classifications(
    "pred_cluster",
    gt_field="ground_truth",
    eval_key="eval_simple",
)

# Print a classification report
results.print_report()

## Visualization
We use **FiftyOne Brain** to compute a 2D visualization (UMAP) of the embeddings.
Finally, we launch the **FiftyOne App**, allowing you to explore the clusters, view the auto-generated labels, and analyze the dataset interactively.

In [None]:
# Create 2D UMAP visualization of embeddings
import fiftyone as fo
import fiftyone.brain as fob

results = fob.compute_visualization(
    indexed_view,
    embeddings="tl_embedding",
    num_dims=2,
    brain_key="tl_embeddings_viz",
    method="umap",
    verbose=True,
    seed=51,
)

# Launch the FiftyOne App to explore the dataset
session = fo.launch_app(dataset, auto=False, port=5151)
session.show()

## 6. Export to PyTorch Dataset

Now we convert our FiftyOne dataset into a PyTorch `Dataset` for training a classifier on the embeddings.

### Why `to_torch()`?

FiftyOne provides a [`to_torch()`](https://docs.voxel51.com/api/fiftyone.core.collections.html#fiftyone.core.collections.SampleCollection.to_torch) method that wraps any view in a PyTorch Dataset interface. This is cleaner than manually extracting lists because:

- **Lazy loading** - Samples are transformed on-access, not all loaded into memory upfront

- **Reusable** - The same `GetItem` works with any view (train/val splits, filtered subsets)

- **Decoupled** - Transformation logic lives in one place, separate from data extraction

### The `GetItem` Pattern

We define a [`GetItem`](https://docs.voxel51.com/api/fiftyone.utils.torch.html#fiftyone.utils.torch.GetItem) subclass that tells FiftyOne:

1. **What fields to extract** (`required_keys`)

2. **How to transform them** (`__call__`)




In [79]:
import torch
from torch.utils.data import DataLoader
from fiftyone.utils.torch import GetItem


# Define label mapping (string label → integer index)
LABEL_TO_IDX = {
    "Safe Walkway Violation": 0,
    "Unauthorized Intervention": 1,
    "Opened Panel Cover": 2,
    "Carrying Overload with Forklift": 3,
    "Safe Walkway": 4,
    "Authorized Intervention": 5,
    "Closed Panel Cover": 6,
    "Safe Carrying": 7,
}


class WorkerSafetyGetItem(GetItem):
    """
    Extracts embeddings and labels from FiftyOne samples for classifier training.
    
    Transforms each sample into:
    - embedding: 512-dim tensor from TwelveLabs Marengo
    - label_idx: integer class index
    - label_str: human-readable label string
    """
    
    def __init__(self, label_to_idx, field_mapping=None):
        self.label_to_idx = label_to_idx
        super().__init__(field_mapping=field_mapping)
    
    @property
    def required_keys(self):
        # Fields we need from each sample
        return ["tl_embedding", "ground_truth"]
    
    def __call__(self, d):
        embedding = d.get("tl_embedding")
        ground_truth = d.get("ground_truth")
        label_str = ground_truth.label
        
        return {
            "embedding": torch.tensor(embedding, dtype=torch.float32),
            "label_idx": torch.tensor(self.label_to_idx.get(label_str, -1), dtype=torch.long),
            "label_str": label_str,
        }

Created PyTorch dataset with 24 samples
Embedding shape: torch.Size([512])
Label: Safe Walkway Violation (idx: 0)



Then `indexed_view.to_torch(getter)` gives us a PyTorch Dataset where `dataset[i]` calls our `GetItem` to load and transform sample `i`.

In [None]:
# Create the PyTorch dataset from FiftyOne
getter = WorkerSafetyGetItem(LABEL_TO_IDX)
torch_dataset = indexed_view.to_torch(getter)

print(f"Created PyTorch dataset with {len(torch_dataset)} samples")

# Verify a sample
sample = torch_dataset[0]
print(f"Embedding shape: {sample['embedding'].shape}")
print(f"Label: {sample['label_str']} (idx: {sample['label_idx']})")


### Custom Collation for DataLoader

When PyTorch's `DataLoader` batches samples together, it needs to know how to combine them. The default collate works for simple tensors, but our samples contain a mix of:
- **Tensors** (`embedding`, `label_idx`) - need to be stacked
- **Strings** (`label_str`) - need to stay as a list

A custom [`collate_fn`](https://pytorch.org/docs/stable/data.html#dataloader-collate-fn) tells the DataLoader exactly how to handle each field. This gives us batches ready for training—embeddings as a `[batch_size, 512]` tensor and labels as a `[batch_size]` tensor.


In [80]:
# Create DataLoader for training
def collate_fn(batch):
    return {
        "embedding": torch.stack([b["embedding"] for b in batch]),
        "label_idx": torch.stack([b["label_idx"] for b in batch]),
        "label_str": [b["label_str"] for b in batch],
    }

train_loader = DataLoader(
    torch_dataset, 
    batch_size=4, 
    shuffle=True, 
    collate_fn=collate_fn
)

# Verify a batch
batch = next(iter(train_loader))
print(f"Batch embeddings: {batch['embedding'].shape}")
print(f"Batch labels: {batch['label_idx']}")

Batch embeddings: torch.Size([4, 512])
Batch labels: tensor([3, 7, 2, 2])


You now have a standard PyTorch `DataLoader` that yields batches ready for your training loop.

## Summary

This notebook demonstrated an end-to-end workflow from raw video to trainable dataset:

| Step | Tool | What Happens |
|------|------|--------------|
| **1. Ingest** | FiftyOne | Load videos with metadata and ground truth labels |
| **2. Index** | TwelveLabs | Upload videos, generate Marengo 3.0 embeddings |
| **3. Retrieve** | TwelveLabs | Fetch 512-dim embeddings back to FiftyOne samples |
| **4. Cluster** | scikit-learn | Group similar videos using KMeans on embeddings |
| **5. Auto-label** | TwelveLabs Pegasus | Generate semantic labels for each cluster |
| **6. Export** | FiftyOne `to_torch()` | Convert to PyTorch Dataset for training |

The key insight: **"Small Data" doesn't mean "Manual Data."** By combining TwelveLabs' semantic understanding with FiftyOne's data management, you can build high-quality training sets from raw footage without hours of manual video scrubbing.