# WebUOT-1M dataset

WebUOT-1M is the largest million-scale benchmark for underwater object tracking (UOT), designed to address limitations in existing datasets by providing diverse underwater scenarios, rich annotations, and language prompts. 

It comprises 1.1 million frames across 1,500 underwater videos, covering 408 target categories categorized into 12 superclasses (e.g., fish, molluscs, inanimate objects). The dataset includes high-quality bounding box annotations, 23 tracking attributes (e.g., illumination variation, camouflage), and language descriptions for multimodal tracking research.

Note: This dataset, which has been parsed into FiftyOne format, comprises 238 randomly selected videos from the WebUOT-1M test set for a total of 192,000+ frames.

In [None]:
!pip install fiftyone umap-learn timm hiera-transformer einops

In [None]:
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

dataset = load_from_hub(
    "Voxel51/WebUOT-238-Test",
    name="webuot238",
    overwrite=True,
    )

After the dataset has been downloaded you can begin exploring it in the FiftyOne app. Once the dataset has been downloaded, you can do some initial exploration by launching the app.

There are two ways to use the app:

1. As a cell in your notebook, which you can do by running:

```python
fo.launch_app(dataset)
```

2. In a separate browser window, run the following in your terminal:

```bash
fiftyone app launch
```

Once the app is launched, you can explore the dataset by:

• Scrolling through the videos for a visual vibe check of its contents

• Filter based on the labels (the various attributes associated with each video)

• Filter based on the objects (the various ground truth labels)

• Create a dashboard of plots for the various information fields of the dataset.


In [1]:
fo.launch_app(dataset)

![Explore WebUOT](assets/explore-webuot.gif)


## Exploring deeper

We can gain a deeper understaning of this dataset by computing and visualizing embeddings for the videos.

I've built a plugin which allows us to use the [Hiera embedding model](https://github.com/facebookresearch/hiera). FiftyOne's plugin framework lets you extend and customize the functionality of FiftyOne to suit your needs. If you’re interested in learning more about plugins, you might be interested in attending one of our monthly workshops. You can [see the full schedule here](https://voxel51.com/computer-vision-events/) and look for the Advanced Computer Vision Data Curation and Model Evaluation workshop.

The [Hiera Embedding model](https://arxiv.org/abs/2306.00989) from Facebook is a hierarchical vision transformer for efficient image and video understanding tasks. It combines speed with high accuracy by simplifying traditional transformer architectures while maintaining performance through masked autoencoder (MAE) pretraining. This video embedding model was pretrained on the Kinetics-400 (K400) dataset. The masked autoencoder objective forces the model to learn robust spatiotemporal patterns by reconstructing randomly masked video patches. This video-specific pretraining enables temporal understanding capabilities, while still maintaining the core hierarchical architecture developed through image training.

While not guaranteed, Hiera's embeddings frequently retain semantic value even for OOD data (like what we're working with) due to it's sparse token hierarchy and MAE's reconstruction-driven learning.

The main point is that we can compute video embeddings [relatively easily with the plugin](https://github.com/harpreetsahota204/hiera-video-embeddings-plugin). Let's start by downloading the plugin and installing it's necessary dependencies.



In [None]:
!fiftyone plugins download https://github.com/harpreetsahota204/hiera-video-embeddings-plugin

In [None]:
!fiftyone plugins requirements @harpreetsahota/hiera_video_embeddings --install

We'll need to set an enviornment variable:

In [29]:
import os

os.environ['FIFTYONE_ALLOW_LEGACY_ORCHESTRATORS'] = 'true'

 I’ll assume that you’re running this in a Jupyter notebook, in which case you can run the entire model on the dataset as follows:

In [30]:
import fiftyone.operators as foo

hiera_embeddings = foo.get_operator("@harpreetsahota/hiera_video_embeddings/compute_hiera_video_embeddings")

Alternatively, you can use the app and fill out the operator form. I'll refer you to [the GitHub repo for the plugin](https://github.com/harpreetsahota204/hiera-video-embeddings-plugin) for more details.

This plugin supports all currently released versions and checkpoints of the [Hiera Video Models collection](https://github.com/facebookresearch/hiera):

    - `hiera_base_16x224`
    - `hiera_base_plus_16x224`
    - `hiera_large_16x224`
    - `hiera_huge_16x224`

It also two types of embeddings:

- **Terminal Embedding (`terminal`)**: A 768-dimensional embedding vector derived from the final layer of the model. This represents the global semantic context of the video sequence. Can optionally be normalized.
  
- **Hierarchical Embedding (`hierarchical`)**: A 1440-dimensional embedding vector that concatenates features across all intermediate outputs (96+192+384+768 = 1440 dimensions). This captures multi-scale representations of the video content. **These embeddings cannot be normalized.**

Sadly, the Hiera video embedding model struggles with long duration videos. We'll work with only short duration videos.  I'm not too familar with many video embedding models, but if you know of one that I should create a plugin that works well for longer duration videos for please let me know. Note: the [V-JEPA model](https://github.com/facebookresearch/jepa) for video embeddings is currently on the roadmap. 

Luckily, you can easily filter your dataset as follows:

In [None]:
from fiftyone import ViewField as F

short_videos = dataset.filter_labels(
    "Length", F("label").is_in(["short"])
).clone(name="short_videos")

This leaves us with 147 samples that we will work with going forward:

In [None]:
len(short_videos)

Before running the following cell, you'll need to kick off a delegated operation. You can do this by opening your terminal and running `fiftyone delegated launch`.

In [None]:
await hiera_embeddings(
    short_videos,
    model_name="hiera_base_plus_16x224",
    checkpoint="mae_k400", #one of mae_k400 OR mae_k400_ft_k400
    embedding_types="terminal", #or hierarchical
    emb_field="hiera_video_embeddings",
    normalize=True, #defaults to False, only works with `terminal` embeddings
    delegate=True
    )

In [32]:
short_videos.persistent = True

In [33]:
short_videos.reload()

While we're at it, let's go ahead and compute embeddings for the video captions as well. For this we'll make use of [Jina Embeddings V3](https://huggingface.co/jinaai/jina-embeddings-v3):

In [None]:
import torch

from transformers import AutoModel

jina_embeddings_model = AutoModel.from_pretrained(
    "jinaai/jina-embeddings-v3", 
    trust_remote_code=True,
    device_map = "cuda" if torch.cuda.is_available() else "cpu"
    )


We can run the model on our dataset and use the `seperation` task as it's suitable for visualizing clusters.

In [35]:
for sample in short_videos.iter_samples(autosave=True):
    text_embeddings = jina_embeddings_model.encode(
        sentences = [sample["language"]], # model expects a list of strings
        task="separation"
        )
    sample["text_embeddings"] = text_embeddings.squeeze()

Now we can compute a 2D representation of our high-dimensional embeddings using UMAP.

In [None]:
import fiftyone.brain as fob

embedding_fields = [ "hiera_video_embeddings", "text_embeddings"]

for fields in embedding_fields:
    _fname = fields.split("_embeddings")[0]
    results = fob.compute_visualization(
        short_videos,
        embeddings=fields,
        method="umap",
        brain_key=f"{_fname}_viz",
        num_dims=2,
        )

And from here we can visualize our embeddings in the app

![Visualizing Embeddings](assets/webuot-viz-embeddings.gif)

I think an interesting next step is applying SAM2 to this subset of data and seeing how it performs. To do that, start by installing the required dependencies for SAM2:

In [None]:
!pip install "git+https://github.com/facebookresearch/sam2.git#egg=sam-2"

FiftyOne has an [integration with SAM2](https://voxel51.com/blog/sam-2-is-now-available-in-fiftyone/), and we can make use of that through the [FiftyOne Model Zoo](https://docs.voxel51.com/model_zoo/index.html#fiftyone-model-zoo). The model zoo gives provides you native access to hundreds of pre-trained models. 



In [None]:
import torch 
import fiftyone.zoo as foz

sam_model = foz.load_zoo_model(
    "segment-anything-2-hiera-tiny-video-torch",
    device="cuda" if torch.cuda.is_available() else "cpu"
    )

SAM2 (Segment Anything Model 2) offers powerful video segmentation capabilities. 

Its key features include:

1. Precise object segmentation and tracking across video frames
2. Simple prompting methods:
   - Bounding boxes
   - Point selections
3. Efficient workflow:
   - Only requires prompts on the first frame
   - Automatically propagates segmentation masks to subsequent frames

This means we can identify an object in the first frame of a video, and SAM2 will automatically track and segment that object throughout the entire sequence.


Once you've instantiated the model, the next step is to apply it to your dataset. Note that depending on the type of GPU you're running this on, it can take quite a bit of time. For reference, I ran this on an NVIDIA RTX 6000 Ada and it took a little over an hour. 

In [None]:
short_videos.apply_model(
    sam_model,
    label_field="sam_segmentations",
    prompt_field="frames.gt", # Can be a detections or a keypoint field
)

Once the model has been applied to the dataset, we can look at the results in the app for a heuristics driven visual vibe check of model performance.

From an initial visual inspection it seems like SAM2 does a fairly good job of segmenting the objects of interest. There are some cases where the masks aren't as tight, but given the fact that this is an underwater dataset it's still quite impressive.

![SAM2 Predictions](assets/webuot-sam2-preds.gif)

However, what's more impressive, at least from my initial visual vibe check, is the quality of the bounding boxes generated by SAM2. It seems the boxes are on point with, and at times tighter than, the ground truth boxes!

Of course, we can perform a more rigorous evaluation using the [`evaluate_detections`](https://docs.voxel51.com/tutorials/evaluate_detections.html#Evaluating-Object-Detections-with-FiftyOne) method of the dataset and get some concrete number for model performance. Since the dataset doesn't have ground truth annotations for segmentation masks, we can peform evaluation of the predicted bounding boxes against the ground truth.

In [None]:
short_videos.evaluate_detections(
    pred_field="frames.sam_segmentations",
    gt_field="frames.gt",
    eval_key="sam_eval",
    iou=0.7
)

You can analyze the results of the evaluation right in the app via the Model Evaluation panel:

![SAM2 Model Eval](assets/webuot-model-eval.gif)

### An important consideration

In this demonstration, we're using SAM2 to showcase basic segmentation capabilities on underwater footage, focusing primarily on the spatial accuracy of masks and bounding boxes.  For this simplified use case, we'll evaluate the model using IoU (Intersection over Union) metrics to assess how well SAM2 can identify objects frame by frame. 

However, it's important to note that real-world underwater object tracking presents significantly more complex challenges. 

While this SAM2 demonstration shows promising results for basic segmentation and bounding box tracking,a complete tracking solution would need more sophisticated components to handle these advanced tracking requirements. To illustrate the concretely, suppose we're concerned with tracking fish, then we need to consider (though the same considerations apply to any object tracking task):

   - **Identity Preservation**: Maintaining track of a specific fish among similar-looking ones. For example, when tracking a particular clownfish in a group, the system must maintain its unique identity even when other clownfish cross its path.

   - **Distractor Handling**: Not getting confused by other fish of the same species. The system needs to distinguish the target from visually similar fish that may enter the frame, even when they exhibit similar patterns or behaviors.

   - **Temporal Consistency**: Maintaining the same ID across frames. This involves predicting motion patterns and understanding typical fish behaviors to maintain tracking even during quick movements or direction changes.

   - **Re-identification**: Recognizing the same fish after temporary occlusion. When the target fish temporarily disappears behind coral or other fish, the system must be able to recognize and re-establish tracking when it reappears.

   - **Group Behavior Handling**: Managing scenarios where fish (or other marine life) move in schools or groups, need more sophistication to maintain individual tracking within collective movement patterns.


So while SAM2 is great for demonstrating segmentation capabilities, a production underwater tracking system would need additional components to handle the complex identity tracking challenges. Check out [this blog](https://voxel51.com/blog/tracking-datasets-in-fiftyone/) for more detail.

# Conclusion

In this exploration of the WebUOTdataset, we've demonstrated several key capabilities:

1. Dataset visualization and exploration using FiftyOne

2. Computing and visualizing video embeddings using the Hiera model

3. Generating text embeddings using Jina Embeddings V3

4. Applying SAM2 for object segmentation and detection

Our evaluation of SAM2's performance on underwater footage shows promising results for basic segmentation tasks. However, this demonstration only scratches the surface of what's needed for comprehensive underwater object tracking. Real-world applications require sophisticated systems that can handle identity preservation, temporal consistency, occlusion recovery, and group dynamics. 

These challenges are particularly acute in underwater environments where factors like variable visibility, light refraction, and complex marine life behaviors come into play.
