# Multi-Instance Vehicle Damage Detection Workshop
# 
# [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/harpreetsahota204/odsc_west_workshop/blob/main/workshop_notebook.ipynb)


## Act 1: Explore & Visualize 

### Goal

Understand what makes multi-instance vehicle damage detection challenging through visual exploration and embedding analysis.

> **Here's what makes this dataset realistic and challenging:**
> 
> **First**, each image can contain MULTIPLE damages of DIFFERENT types. A single accident might cause a dent on the door, scratches on the bumper, AND a cracked taillight - all in one photo. This isn't a simple 'one object per image' scenario.
>
> **Second**, cars are photographed from various perspectives - front, side, rear, close-ups, wide shots. A scratch viewed straight-on looks different than the same scratch at an angle.
>
> **Third**, this is instance segmentation - we need to detect and segment EACH individual damage, not just classify the whole image. Much harder!
>
> Before we rush to train a model, let's actually understand what we're working with. This data-centric approach will save us time and improve our results.

Let's start by loading our data:

In [None]:
from fiftyone.utils.huggingface import load_from_hub
dataset = load_from_hub(
    "harpreetsahota/CarDD", 
    persistent=True,
    overwrite=True
    )

Make sure you install the [Dashboard Plugin](https://docs.voxel51.com/plugins/plugins_ecosystem/dashboard.html), which can be installed by running the following in your terminal:

```bash
fiftyone plugins download \
    https://github.com/voxel51/fiftyone-plugins \
    --plugin-names @voxel51/dashboard
```

## Profiling the dataset

### Dataset size

- `len(dataset)` returns the number of samples in the dataset.  
  [Reference](https://docs.voxel51.com/user_guide/using_datasets.html#using-datasets)


In [None]:
print(f"Dataset size: {len(dataset)} samples (images)")

### Total damage instances

- `dataset.count("detections.detections")` counts the total number of detection objects across all samples.  
  [Reference](https://github.com/voxel51/fiftyone/blob/develop/docs/source/tutorials/pandas_comparison.ipynb)


In [None]:
print(f"Total damage instances: {dataset.count('detections.detections')}")

### Damage types

- `dataset.distinct("detections.detections.label")` returns all unique labels for detections.  
  [Reference](https://github.com/voxel51/fiftyone/blob/develop/docs/source/tutorials/pandas_comparison.ipynb)


In [None]:
print(f"Damage types: {dataset.distinct('detections.detections.label')}")

### Average damages per image

- This computes the mean number of detections per image. 


In [None]:
avg_damages = dataset.count('detections.detections') / len(dataset)

print(f"\nAverage damages per image: {avg_damages:.2f}")

Alternatively, you can use:

[Reference](https://docs.voxel51.com/user_guide/using_aggregations.html#aggregating-expressions)

In [None]:
from fiftyone import ViewField as F

avg_damages = dataset.mean(F("detections.detections").length())

### Distribution of number of damages per image

The idiomatic FiftyOne way to count the number of detection labels in a sample is to use a [`ViewField`](https://docs.voxel51.com/api/fiftyone.core.expressions.html#fiftyone.core.expressions.ViewField) expression to access the list of labels and then use `.length()` to count them. 

To add the number of damages per image as a field on each sample in your dataset, you can use FiftyOne's [`set_values()`](https://docs.voxel51.com/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.set_values). This will efficiently compute and store the count for each sample.

**References:**  
- [Transforming fields with set_field](https://docs.voxel51.com/user_guide/using_views.html#transforming-fields)
- [Example: Counting detections](https://docs.voxel51.com/api/fiftyone.core.dataset.html)  
- [Pandas comparison: Adding new columns](https://docs.voxel51.com/tutorials/pandas_comparison.html#Add-a-new-column/frame-from-existing-columns/fields)


In [None]:
import fiftyone as fo
from fiftyone import ViewField as F

# Add a field "num_damages" to each sample with the count of detection labels

num_damages = dataset.values(F("detections.detections").length())

dataset.set_values("num_damages", num_damages)

dataset.save()

Remember to call `save()` on the view to persist the changes to the dataset itself.  

[Reference: Add new column/field](https://github.com/voxel51/fiftyone/blob/develop/docs/source/tutorials/pandas_comparison.ipynb)  

[Reference: Transforming fields](https://docs.voxel51.com/user_guide/using_views.html#transforming-fields)

### 6. Images with multiple damages

- This counts images with more than one detection.  
  [Reference](https://docs.voxel51.com/api/fiftyone.core.view.html)


In [None]:
multi_damage = len(dataset.match(F("detections.detections").length() > 1))

print(f"\nImages with multiple damages: {multi_damage} ({multi_damage/len(dataset)*100:.1f}%)")

You can make the query for images that have **both** "scratch" and "crack" by using the `all=True` argument in `contains()`, which checks that both values are present in the list. 

[Reference](https://docs.voxel51.com/cheat_sheets/filtering_cheat_sheet.html#detections)

For example:


In [None]:
# How many images have both scratch AND crack?
has_scratch_and_crack = dataset.match(
    F("detections.detections.label").contains(["scratch", "crack"], all=True)
)

dataset.save_view("has_scratch_and_crack", has_scratch_and_crack)

print(f"Images with BOTH scratch and crack: {len(has_scratch_and_crack)}")

If you want to select images where **all** detections are "scratch" (i.e., the set of labels is a subset of `["scratch"]`), you should use the `is_subset()` method:

In [None]:
# Images with ONLY scratch
only_scratch = dataset.match(
    F("detections.detections.label").is_subset(["scratch"])
)

dataset.save_view("only_scratch", only_scratch)

This will ensure that the only label present in the detections is "scratch" (or the list is empty). The same logic applies for "crack":

In [None]:
only_crack = dataset.match(
    F("detections.detections.label").is_subset(["crack"])
)

dataset.save_view("only_crack", only_crack)

This approach is documented in the [FiftyOne filtering cheat sheet](https://docs.voxel51.com/cheat_sheets/filtering_cheat_sheet.html#detections), where `is_subset()` is used to match samples that only contain a specific label and no others.

**Summary:**  
- Use `is_subset(["scratch"])` to match images where all detections are "scratch" and no other labels are present.
- Your original approach does not exclude other label types besides "crack."

[Reference: Filtering Cheat Sheet](https://docs.voxel51.com/cheat_sheets/filtering_cheat_sheet.html#detections)

#### Add a "complexity_score" field: number of damages + number of unique damage types

The complexity score tries to capture "how hard is this image to process?"

- **More damages = harder** (finding 4 damages is harder than finding 1)

- **More damage types = harder** (distinguishing scratch vs dent vs crack in one image is harder than just finding 3 scratches)

So it adds them: `num_damages + num_unique_types`


##### **Why This Is Useful**

**1. It captures TWO kinds of difficulty:**

**Quantity Difficulty**: More damages = harder to detect all of them

- Finding 1 damage is easier than finding 4 damages

**Diversity Difficulty**: More damage types = harder to classify correctly

- Distinguishing between 3 different damage types in one image is harder than identifying 3 of the same type

### **2. It helps you stratify your data for training:**

```python
# Start training on simple cases
simple = dataset.match(F("complexity_score") <= 3)  # 1-2 damages, maybe 1-2 types

# Then add moderate complexity
moderate = dataset.match((F("complexity_score") > 3) & (F("complexity_score") <= 5))

# Finally tackle hard cases
hard = dataset.match(F("complexity_score") > 5)  # Many damages + high diversity
```

### **3. It correlates with model performance:**

You'll want to understand if:
- Low complexity → higher precision/recall
- High complexity → lower precision/recall

This tells you WHERE to focus improvement efforts.

In [None]:
from fiftyone import ViewField as F

labels_per_sample = dataset.values("detections.detections.label")
num_distinct_labels_per_sample = [len(set(labels)) if labels else 0 for labels in labels_per_sample]
dataset.set_values("num_unique_labels", num_distinct_labels_per_sample)

In [None]:
unique_label_counts = dataset.values("num_unique_labels")

# Compute complexity scores for all samples
num_damages_values = dataset.values("num_damages")

complexity_scores = [nd + nul for nd, nul in zip(num_damages_values, unique_label_counts)]

# Set the values
dataset.set_values("complexity_score", complexity_scores)

dataset.save()

### Visual Similarity Through Embeddings

> Embeddings are powerful: they convert images into high-dimensional vectors where visually similar images are close together. We'll use CLIP - a vision-language model trained on 400M image-text pairs - to embed our damage images.
>
> **Important note**: These embeddings capture the ENTIRE image - the car, the perspective, ALL the damages together. An image with 'dent + scratch' gets one embedding that represents that whole scene. This is different from patch embeddings, which we could compute per damage instance.
>
> Let's see what patterns emerge when we map these complex, multi-damage images into embedding space.

- **`foz.load_zoo_model()`** loads a pre-trained model from the FiftyOne Model Zoo. In this case, it loads the CLIP ViT-B/32 model, which is commonly used for generating image embeddings.  
  [Reference](https://github.com/voxel51/fiftyone/blob/develop/docs/source/getting_started/manufacturing/02_embeddings.ipynb)

In [None]:
import fiftyone.zoo as foz

model = foz.load_zoo_model("clip-vit-base32-torch")

- **`dataset.compute_embeddings()`** computes and stores embeddings for each image in the dataset using the specified model. The `embeddings_field` argument specifies the sample field where the resulting embedding vectors will be stored. Each image receives a single embedding vector representing the whole image.  
  [Reference](https://github.com/voxel51/fiftyone/blob/develop/docs/source/getting_started/manufacturing/02_embeddings.ipynb)

In [None]:
# Create 2D visualization using UMAP
import fiftyone.brain as fob

results = fob.compute_visualization(
    dataset,
    embeddings="clip_embeddings",
    method="umap",
    brain_key="clip_viz",
    num_dims=2
)


- **`fob.compute_visualization()`** performs dimensionality reduction (here, UMAP) on the stored embeddings to create a 2D representation for visualization. The `embeddings` argument specifies which field to use, `method="umap"` selects the UMAP algorithm, and `brain_key` is used to store and retrieve the visualization results.  
  [Reference](https://github.com/voxel51/fiftyone/blob/develop/docs/source/getting_started/manufacturing/02_embeddings.ipynb)


In [None]:
# Create 2D visualization using UMAP
import fiftyone.brain as fob

results = fob.compute_visualization(
    dataset,
    embeddings="clip_embeddings",
    method="umap",
    brain_key="clip_embeddings_viz",
    num_dims=2
)

#### Semantic Search Across Damage Scenarios 

We can search by complex queries that span multiple damages, perspectives, and contexts.

Semantic search makes this possible - and it's incredibly valuable for building training sets, searching images with natural language, and quality control.

In [None]:
text_img_index = fob.compute_similarity(
    dataset,
    model="clip-vit-base32-torch",
    brain_key="text_img_sim"
)

### Representativeness: Finding "Typical" Damage Scenarios

> Now let's ask: What does a 'typical' car damage image look like? And what are the edge cases?


### Representativeness

- **Definition:** Representativeness measures how well a sample typifies or summarizes the main patterns in your dataset. A highly representative sample is similar to many other samples—it sits near the center of a cluster in embedding space.

- **Use case:** Useful for finding prototypical or "easy" examples, or for visualizing the main modes of your data. Representative samples are good for understanding the core structure of your dataset.

- **How it's computed:** FiftyOne computes a scalar-valued `representativeness` field for each sample, also normalized to [0, 1], with 1 being the most representative. This is based on clustering—samples close to cluster centers are more representative.

- **Example:** If you want to quickly understand the main types of data you have, or select examples that best summarize your dataset, you would look for high-representativeness samples.

[Learn more about representativeness in the docs](https://docs.voxel51.com/brain.html#image-representativeness).


In [None]:
# Compute representativeness
print("Computing representativeness scores...")

fob.compute_representativeness(
    dataset,
    embeddings="clip_embeddings",
    representativeness_field="representativeness"
)

## Cross-analyzing Complexity and Representativeness

We want to identify samples that face **two challenges simultaneously**:

1. **High complexity** - Many damages to detect (harder task)
2. **Low representativeness** - Rare/unusual scenarios (less training data)

### Why This Matters

When a model struggles on complex images, it could be for two reasons:
- **Inherent difficulty**: More objects = harder detection task
- **Data scarcity**: Fewer training examples = less learned patterns

The worst-case scenarios are images that suffer from **both** problems - they're hard to detect AND the model has seen few similar examples during training.

### The Edge Case Score

We compute an **edge case score** for each sample by:

1. **Standardizing both metrics** (z-scores) to put them on the same scale
2. **Combining them**: `edge_case_score = complexity_z - representativeness_z`

**High scores** indicate samples that are:
- More complex than average (high complexity_z)
- Less representative than average (low representativeness, so subtracting it increases the score)

### What We Learn

- **Overall correlation**: Does complexity correlate with representativeness across the dataset? A negative correlation confirms that complex scenarios are indeed rare.

- **Per-sample scores**: Which specific images are the "double whammy" cases that deserve special attention for data collection or model improvement?

### Actionable Insights

1. **Prioritize data collection**: Find more examples similar to high edge-case-score samples
2. **Explain performance gaps**: Model struggles aren't just about task difficulty - it's also data availability
3. **Strategic evaluation**: Separate "hard because complex" from "hard because rare"


In [None]:
import numpy as np

complexity = np.array(dataset.values("complexity_score"))
representativeness = np.array(dataset.values("representativeness"))

# Standardize both metrics (z-scores)
complexity_z = (complexity - complexity.mean()) / complexity.std()
representativeness_z = (representativeness - representativeness.mean()) / representativeness.std()

# Compute "edge case score": high complexity + low representativeness
# Flip representativeness so low values become high scores
edge_case_score = complexity_z - representativeness_z

dataset.set_values("edge_case_score", edge_case_score.tolist())

# Also store the overall correlation
correlation = np.corrcoef(complexity, representativeness)[0,1]
dataset.info["complexity_representativeness_correlation"] = float(correlation)
dataset.save()


### Scenario complexity

**Not all data is created equal.** Some images are easy, some are hard - but for DIFFERENT reasons.

By understanding WHAT makes data hard (unusual conditions vs. lots of stuff), you can make smarter decisions about training order, data collection priorities, and production routing.**

It's the difference between "some images are hard" (vague) and "images are hard because of X or Y" (actionable).

### **Two Independent Axes of Difficulty:**

1. **Representativeness:** "Is this a common scenario or a weird one?"
   - Think: normal lighting vs weird angle

2. **Complexity:** "How much stuff is in this image?"
   - Think: 1 damage vs 5 damages

**An image can be hard for DIFFERENT reasons:**

- **Hard because unusual:** Rare angle you haven't seen much (low rep, low complexity)

- **Hard because complex:** Many damages to track (high complexity, high rep)

- **Hard for BOTH:** Rare situation + many damages = nightmare scenario


### **For Machine Learning:**

Same principle:
1. **Train on simple typical** → Model learns the basics

2. **Add complex typical** → Model learns to handle multiple objects

3. **Add simple edge** → Model learns robustness to conditions

4. **Add complex edge** → Model becomes expert


In [None]:
from fiftyone import ViewField as F

dataset.add_sample_field("scenario_complexity", fo.StringField)

rep_scenario_expr = (
    ((F("representativeness") > 0.7) & (F("complexity_score") <= 3)).if_else(
        "simple_typical",
        ((F("representativeness") > 0.6) & (F("complexity_score") > 3)).if_else(
            "complex_typical",
            ((F("representativeness") < 0.4) & (F("complexity_score") <= 3)).if_else(
                "simple_edge",
                ((F("representativeness") < 0.4) & (F("complexity_score") > 3)).if_else(
                    "complex_edge",
                    "other"
                )
            )
        )
    )
)

dataset.set_field("scenario_complexity", rep_scenario_expr)

dataset.save()

### Uniqueness: Finding Outlier Scenarios


> Uniqueness tells us which images represent truly unusual scenarios - ones that don't fit into any common pattern.

- **Definition:** Uniqueness measures how different a sample is from all other samples in the dataset. A higher uniqueness score means the sample is less similar to others—it's an outlier or rare example.

- **Use case:** Useful for identifying and removing near-duplicate images, or for selecting the most unique samples to bootstrap model training. Unique samples help your model learn efficiently by exposing it to the full diversity of your data.

- **How it's computed:** FiftyOne computes a scalar-valued `uniqueness` field for each sample, normalized to [0, 1], with 1 being the most unique sample in the dataset. This is based on the distance in embedding space to other samples—samples far from others are more unique.

- **Example:** If you want to avoid bias or redundancy in your training data, you might filter for high-uniqueness samples to ensure diversity.
  
[Learn more about uniqueness in the docs](https://docs.voxel51.com/brain.html#image-uniqueness) and see the [tutorial](https://docs.voxel51.com/tutorials/uniqueness.html) for practical examples.



In [None]:
fob.compute_uniqueness(
    dataset,
    embeddings="clip_embeddings",
    uniqueness_field="uniqueness"
)

**In summary:**  
- **Uniqueness** finds outliers and rare examples.  
- **Representativeness** finds prototypical, common examples.


#### **When an image is an outlier (unique), WHY is it an outlier?**

Are your outliers outliers because they're **genuinely rare situations** (complex) or just **photographed weirdly** (simple but unusual angle)?

Computing the following score tells you whether to collect more complex scenarios vs. just augment for viewpoint/lighting variations.

**Two Possibilities:**

1. **Unique AND Complex:** Outlier because it has tons of damages (4+)
   - Example: Car completely wrecked, 5 different damage types
   - **Why outlier:** Legitimately rare scenario

2. **Unique BUT Simple:** Outlier despite having few damages (1-2)
   - Example: Single scratch, but photographed from underneath the car
   - **Why outlier:** Weird angle/lighting, not the damage itself

##### **Different causes = different solutions:**

- **Unique + Complex:** Probably need more training data for multi-damage scenarios

- **Unique + Simple:** Probably need better augmentation (weird angles, lighting)


In [None]:
from fiftyone import ViewField as F

dataset.add_sample_field("uniqueness_complexity_scenario", fo.StringField)

unq_scenario_expr = (
    ((F("uniqueness") > 0.7) & (F("complexity_score") > 4)).if_else(
        "unique_and_complex",
        ((F("uniqueness") > 0.7) & (F("complexity_score") <= 2)).if_else(
            "unique_but_simple",
            "other"
        )
    )
)


dataset.set_field("uniqueness_complexity_scenario", unq_scenario_expr)

dataset.save()

### The Two-Dimensional Framework for Multi-Instance Data

Let's bring this together with the representativeness-uniqueness framework, adapted for our multi-instance, multi-perspective challenge.

```markdown
        High Representativeness
                │
    "Mainstream │   "Suspicious"
    Scenarios"  │   (investigate)
   (Train here) │
────────────────┼────────────────
                │
    "Niche      │   "True
    Clusters"   │   Outliers"
   (After base) │  (Review/Exclude)
                │
        Low Representativeness

        Uniqueness →

```

For multi-instance detection, this framework helps us handle scenario diversity:

• **Mainstream**: Train foundation model here - covers most insurance claims

• **Niche**: Specialized handling - maybe fine-tune separately for rear-view damages

• **Outliers**: Human review - too unusual for automated processing

• **Suspicious**: Investigate - might be duplicates or errors


In [None]:
from fiftyone import ViewField as F

dataset.add_sample_field("twod_scenario_analysis", fo.StringField)

twod_scenario_expr = (
    ((F("representativeness") > 0.6) & (F("uniqueness") < 0.4)).if_else(
        "mainstream",
        ((F("representativeness") < 0.4) & (F("uniqueness") < 0.5)).if_else(
            "niche",
            ((F("representativeness") < 0.4) & (F("uniqueness") > 0.7)).if_else(
                "outlier",
                ((F("representativeness") > 0.6) & (F("uniqueness") > 0.7)).if_else(
                    "suspicious",
                    "other"
                )
            )
        )
    )
)

dataset.set_field("twod_scenario_analysis", twod_scenario_expr)

dataset.save()

### Act 1 Wrap-Up


**What Embeddings Revealed**:
- The visual structure of our data - what looks similar, what's distinct

- Which damage types will be easy to separate (glass shatter) vs confusing (scratch/crack)

- How complexity and perspective affect clustering

**What Representativeness Showed**:
- What "typical" looks like in our dataset vs what's an edge case

- How to stratify data by scenario difficulty

- Where we might have training data gaps

**What Uniqueness Identified**:

- True outliers that don't fit any pattern

- Which unusual samples are worth handling vs excluding

- Different types of "difficult" (rare situation vs weird capture)

**The Power of This Approach:**

- We understand our data BEFORE spending time/money on training

- We can discover where models will struggle and why

- We can make strategic decisions about training order, data collection, and deployment

This is data-centric AI for complex, real-world problems.


### Let's go ahead an turn to the FiftyOne App to see all othes results on our dataset

It will be helfpul to have The [Dashboard Plugin](https://docs.voxel51.com/plugins/plugins_ecosystem/dashboard.html) installed, which can be installed by running the following in your terminal:

```bash
fiftyone plugins download \
    https://github.com/voxel51/fiftyone-plugins \
    --plugin-names @voxel51/dashboard
```
## Act 2: Enrich

Now let's add even richer context with Vision Language Models...

### VLMs for Comprehensive Damage Reports

Our current annotations tell us: 'This image has a dent at [x,y,w,h], a scratch at [x2,y2,w2,h2], and a crack at [x3,y3,w3,h3].'

But for real-world applications - insurance claims, repair estimation, fleet management - we need more:
 
 - HOW do the damages relate? Is the scratch connected to the dent?
 
 - WHICH parts of the car are affected? Door? Bumper? Multiple panels?
 
 - What's the OVERALL severity? Minor cosmetic or structural concern?
 
 - Are there SECONDARY effects? Paint chipping? Rust? Deformation?

Vision Language Models can generate these holistic damage assessments that consider ALL damages in context, plus the perspective and car condition.

Note, we are using moondream3 here which is a gated model. Follow the instructions [here](https://github.com/harpreetsahota204/moondream3) to get access. But basically you just sign into Hugging Face, [request access to the model](https://huggingface.co/moondream/moondream3-preview) and you will have access instantaneously. Then run `hf auth login` in your terminal and pass your HF Token.

In [None]:
# Register and download the remotely-sourced zoo model
import fiftyone.zoo as foz

foz.register_zoo_model_source(
    "https://github.com/harpreetsahota204/moondream3", 
    overwrite=True
)

foz.download_zoo_model(
    "https://github.com/harpreetsahota204/moondream3",
    model_name="moondream/moondream3-preview"
)

moondream_model = foz.load_zoo_model("moondream/moondream3-preview")

In [None]:
moondream_model.operation = "query"

moondream_model.prompt = """Complete a comprehensive damage report for this vehicle. 
Include:
1. All visible damages and their locations on the vehicle
2. How the damages relate to each other (if applicable)
3. Overall severity assessment
4. Any secondary effects (paint damage, deformation, etc.)"""

dataset.apply_model(moondream_model, "damage_report")

We can use the Keyword Search Plugin and search the damage reports by keywords, install the plugin:

```bash
fiftyone plugins download fiftyone plugins download https://github.com/jacobmarks/keyword-search-plugin
```

We can also create a view programatically:

In [None]:
# Find high-severity cases
from fiftyone import ViewField as F
high_severity = dataset.match(
    F("damage_report").contains("high severity") |
    F("damage_report").contains("severe") |
    F("damage_report").contains("safety-critical")
)
dataset.save_view("high_severity_reports", high_severity)

You can also compute embeddings for the `damage_report` field we generated using the VLM and visualize those in the App:

In [None]:
import os
import torch
import fiftyone.brain as fob
from transformers import AutoModel

#set an environment variable so tokenizers doesn't yell at us,
# note this related to the `transformers` and `tokenizers` libraries and not a FiftyOne specific environment variable
os.environ["TOKENIZERS_PARALLELISM"] = "false"

jina_embeddings_model = AutoModel.from_pretrained(
    "jinaai/jina-embeddings-v3", 
    trust_remote_code=True,
    device_map = "auto"
    )

for sample in dataset.iter_samples(autosave=True):
    text_embeddings = jina_embeddings_model.encode(
        sentences = [sample["damage_report"]], # model expects a list of strings
        task="separation"
        )
    sample["text_embeddings"] = text_embeddings.squeeze()


results = fob.compute_visualization(
    dataset,
    embeddings="text_embeddings",
    method="umap",
    brain_key="text_embeddings",
    num_dims=2,
    skip_failures=True,
    create_index=True
)

Before launching the app the [Caption Viewer plugin](https://docs.voxel51.com/plugins/plugins_ecosystem/caption_viewer.html) installed, which can be installed by running the following in your terminal:

```bash
# Install from GitHub
fiftyone plugins download https://github.com/harpreetsahota204/caption-viewer
```

**Multi-perspective captures**: Cars photographed from various angles
   - Front view, side view, rear view, close-ups, wide shots
   - Same damage type looks different from different perspectives
   - Adds complexity: model must recognize "scratch" from any angle

In [None]:
moondream_model.operation = "classify"

model.prompt = ["Front view", "Side view", "Rear view", "Close-ups", "Wide angle shots"]

dataset.apply_model(moondream_model, "camera_perspective")

In [None]:
moondream_model.operation = "classify"

model.prompt = [
    "Front bumper damage",
    "Rear bumper damage",
    "Hood damage",
    "Roof damage",
    "Trunk damage",
    "Door damage",
    "Fender damage",
    "Quarter panel damage",
    "Windshield damage",
    "Window damage",
    "Headlight damage",
    "Taillight damage",
    "Mirror damage",
    "Wheel damage",
    "Tire damage",
    "Grille damage"
]

dataset.apply_model(moondream_model, "damage_location")

## Act 3: Build & Evaluate

The goal here is to demonstrate that multi-instance detection evaluation requires going beyond simple metrics to understand per-instance performance, multi-damage confusion, and perspective invariance.


In a longer-form version of this workshop, [I talk about using zero-shot models and how to fine-tune a model on this dataset](https://github.com/harpreetsahota204/car_dd_dataset_workshop/blob/main/03_model_evaluation.ipynb). We're going to skip those details in the interest of time, and take a model that I've already fine-tuned on this dataset and evaluate it's performance.

In [None]:
!wget https://huggingface.co/harpreetsahota/car-dd-segmentation-yolov11/resolve/main/best.pt -O yolov11-seg-cardd.pt

In [None]:
import fiftyone.zoo as foz
from ultralytics import YOLO

model_path = "yolov11-seg-cardd.pt"

yolo_model = YOLO(model_path)

dataset.apply_model(yolo_model, label_field="yolo_predictions")

In [None]:
# Quick stats on what was detected
total_gt = dataset.count("detections.detections")

total_pred = dataset.count("yolo_predictions.detections")

print(f"\nGround truth instances: {total_gt}")

print(f"Predicted instances: {total_pred}")

print(f"Detection ratio: {total_pred/total_gt:.2f}")

In [None]:
# Images where we detected fewer instances than GT
under_detected = dataset.match(
    F("yolo_predictions.detections").length() < 
    F("detections.detections").length()
)

dataset.save_view("under_detected", under_detected)

print(f"Under-detected (missed some damages): {len(under_detected)} images")

In [None]:
over_detected = dataset.match(
    F("yolo_predictions.detections").length() > 
    F("detections.detections").length()
)

dataset.save_view("over_detected", over_detected)

print(f"Over-detected: {len(over_detected)} images")

In [None]:
# Perfect instance count match
perfect_count = dataset.match(
    F("yolo_predictions.detections").length() == 
    F("detections.detections").length()
)

dataset.save_view("perfect_count", perfect_count)

print(f"Perfect instance count: {len(perfect_count)} images")

Evaluating multi-instance detection is complex. We need to:
 - Match predicted instances to GT instances (Hungarian matching)
 - Compute precision/recall at the instance level
 - Account for images with varying numbers of damages
 - Understand per-class performance when damages co-occur

FiftyOne's evaluation framework handles all of this with COCO-style evaluation.

In [None]:
results = dataset.evaluate_detections(
    "yolo_predictions",
    gt_field="detections",
    eval_key="eval",
    compute_mAP=True,
    method="coco",
    use_boxes=True  # Can also use use_masks=True for mask IoU
)

In [None]:
# Overall metrics
print(f"\n{'='*70}")
print(f"OVERALL MULTI-INSTANCE DETECTION PERFORMANCE")
print(f"{'='*70}")
print(f"mAP @ IoU=0.50:0.95: {results.mAP():.3f}")
print(f"mAP @ IoU=0.50:      {results.mAP(iou=0.5):.3f}")
print(f"mAP @ IoU=0.75:      {results.mAP(iou=0.75):.3f}")

# Per-class breakdown
print(f"\n{'='*70}")
print(f"PER-CLASS BREAKDOWN (Remember: multi-damage images affect all classes)")
print(f"{'='*70}")
results.print_report()

You should also install the [Model Evaluation Panel](https://docs.voxel51.com/user_guide/app.html#app-model-evaluation-panel), which can be installed by running the following in your terminal:

```bash
fiftyone plugins download \
    https://github.com/voxel51/fiftyone-plugins \
    --plugin-names @voxel51/evaluation
```

This will let us do some more fine grained analysis in the FiftyOne app.

We can also create some views that we can inspect in the App.

> "In multi-damage scenarios (2+ damages), what is the model confidently predicting that's actually wrong?"

**Step 1** (`.match()`): "Give me multi-damage images that have at least one false positive"

**Step 2** (`.filter_labels()`): "Now hide everything except the high-confidence FPs"

You see images with multiple damages, but only the false positive predictions are visible. This makes it easier to examine what the model got wrong without distraction from correct predictions.


- `.match()` with `.contains("fp")` = **select which images to show**

- `.filter_labels()` with `F("eval") == "fp"` = **select which detections to show within those images**

In [None]:

# === FALSE POSITIVES (predicted damage that's not there) ===
# High-confidence FPs in multi-damage scenarios
multi_damage_fps = dataset.match(
    (F("detections.detections").length() > 1) &
    (F("yolo_predictions.detections.eval").contains("fp"))  # Step 1: Get images with FPs
).filter_labels(
    "yolo_predictions",
    (F("eval") == "fp") & (F("confidence") > 0.7)  # Step 2: Show only those FPs
)

> "In complex scenarios with many damages, which specific damages did the model miss?"

**Step 1** (`.match()`): Get images with 3+ damages (complex scenarios)

**Step 2** (`.filter_labels()`): Show only the ground truth detections that were missed (false negatives)

**Result:** You see complex multi-damage images, but only the damages the model **failed to detect** are visible.

In [None]:
# === FALSE NEGATIVES (missed damage) ===

# Missed damages in complex scenarios
missed_in_complex = dataset.match(
    F("detections.detections").length() > 2
).filter_labels(
    "detections",
    F("eval") == "fn"
)
dataset.save_view("missed_in_complex", missed_in_complex)

print(f"Missed damages in complex scenes: {missed_in_complex.count('detections.detections')}")

We can now open the app and look more closely at the model evaluation panel and do some scenario analysis.

## Act 4: Deployment Strategy

The goal here is to show how to validate label quality for multi-instance annotations and build risk-stratified deployment strategies that account for scenario complexity.

Multi-instance annotations are prone to errors:

 - **Missed instances**: Annotator overlooked a damage
 - **Wrong labels**: Scratch vs crack confusion
 - **Poor localization**: Bounding box doesn't capture full damage
 - **Merged instances**: Two damages annotated as one

[FiftyOne's mistakenness](https://docs.voxel51.com/brain.html#brain-label-mistakes) helps find these issues by comparing confident predictions against ground truth.

This algorithm finds potential annotation errors by checking when confident model predictions disagree with your ground truth labels.

The core idea is simple: if your model is really confident about a prediction but your ground truth says something different, there's probably a labeling mistake. The algorithm calculates a "mistakenness score" that's high when the model is confident and wrong, suggesting the ground truth might be incorrect rather than the model. It works for both classification (wrong class label) and localization (wrong bounding box position).

This helps you clean up datasets by automatically flagging suspicious annotations for human review.

You can read more detail about exactly this works by looking under the "Mistakenness" section of this [notebook](https://github.com/harpreetsahota204/car_dd_dataset_workshop/blob/main/03_model_evaluation.ipynb)


In [None]:
fob.compute_mistakenness(
    dataset,
    "yolo_predictions",
    label_field="detections"
)

In [None]:
# Analyze mistakenness patterns
from fiftyone import ViewField as F

# High mistakenness samples
high_mistake_samples = dataset.match(F("mistakenness") > 0.7)
print(f"\nHigh mistakenness samples: {len(high_mistake_samples)}")
print(f"Avg complexity: {high_mistake_samples.mean('complexity_score'):.2f}")

# High mistakenness instances
high_mistake_instances = dataset.filter_labels(
    "detections",
    F("mistakenness") > 0.8
)
print(f"High mistakenness instances: {high_mistake_instances.count('detections.detections')}")


In [None]:
print(f"\nMistakenness by damage type:")
for label in dataset.distinct("detections.detections.label"):
    view = dataset.filter_labels(
        "detections",
        (F("label") == label) & (F("mistakenness") > 0.7)
    )
    count = view.count("detections.detections")
    total = len(dataset.filter_labels("detections", F("label") == label))
    pct = (count / total * 100) if total > 0 else 0
    print(f"  {label:15s}: {count:3d} / {total:4d} ({pct:.1f}%)")

In [None]:
# Possible missing annotations (model found damage, but no GT)
possible_missing = dataset.filter_labels(
    "yolo_predictions",
    F("possible_missing") == True
)
print(f"\nPossible missing annotations: {possible_missing.count('yolo_predictions.detections')}")

# Possible spurious annotations (GT exists but model never found it)
possible_spurious = dataset.filter_labels(
    "detections",
    F("possible_spurious") == True
)
print(f"Possible spurious annotations: {possible_spurious.count('detections.detections')}")

### Act 4: Multi-Instance Deployment Strategy

Let's design a production deployment strategy that accounts for:
 - Scenario complexity (# of damages)
 - Representativeness (typical vs edge case)
 - Model confidence
 - Damage type (cracks need special handling)

We'll create risk tiers for automotive insurance workflows.

##### **Tier 1 - Auto-Approve:**

Images where:
- **Scenario is manageable:** Typical conditions OR few damages (≤2)
- **Model is confident:** All predictions >70% confidence
- **Detection is complete:** Found the right number of damages

**Translation:** "Model handled this well and we trust it" → Process automatically without human review.

In [None]:
from fiftyone import ViewField as F

# === TIER 1: AUTO-APPROVE (High confidence, simple/typical scenarios) ===

tier1_auto = dataset.match(
    # Typical OR simple scenarios
    ((F("representativeness") > 0.6) | (F("complexity_score") <= 2)) &
    # High model confidence (if has predictions)
    ((F("yolo_predictions.detections.confidence").length() == 0) |
     (F("yolo_predictions.detections.confidence").min() > 0.7)) &
    # Complete detection (predicted count matches GT count)
    (F("yolo_predictions.detections").length() == 
     F("detections.detections").length())
)
dataset.save_view("tier1_auto_approve", tier1_auto)

auto_count = len(tier1_auto)
auto_instances = tier1_auto.count("detections.detections")

print(f"TIER 1 - AUTO-APPROVE:")
print(f"  Criteria: Typical/simple + complete detection + high confidence")
print(f"  Samples: {auto_count} ({auto_count/len(dataset)*100:.1f}%)")
print(f"  Instances: {auto_instances}")
print(f"  Avg complexity: {tier1_auto.mean('complexity_score'):.2f}")


##### **Tier 2 - Expert Review:**

Images where:
- **Scenario is challenging:** Many damages (3+) OR unusual conditions OR model missed/added damages
- **Not confident enough for auto-approval**

**Translation:** "Something is tricky here" → Route to human expert for review before processing.

In [None]:
# === TIER 2: EXPERT REVIEW (Complex OR edge cases OR partial detection) ===

tier2_expert = dataset.match(
    # Complex scenarios
    ((F("complexity_score") > 3) |
    # Edge cases
     (F("representativeness") < 0.4) |
    # Partial detection
     (F("yolo_predictions.detections").length() != 
      F("detections.detections").length())) &
    # Not already in tier 1
    (~F("id").is_in(tier1_auto.values("id")))
)
dataset.save_view("tier2_expert_review", tier2_expert)

expert_count = len(tier2_expert)
expert_instances = tier2_expert.count("detections.detections")

print(f"\nTIER 2 - EXPERT REVIEW:")
print(f"  Criteria: Complex OR edge OR partial detection")
print(f"  Samples: {expert_count} ({expert_count/len(dataset)*100:.1f}%)")
print(f"  Instances: {expert_instances}")
print(f"  Avg complexity: {tier2_expert.mean('complexity_score'):.2f}")


#### **Tier 3 - Senior Adjuster:**

Images where:
- **High risk of errors:** Contains cracks (ambiguous damage type) OR very complex (4+ damages) OR labels look suspicious
- **Requires expertise:** Beyond standard review

**Translation:** "This is ambiguous/complex and could be wrong" → Route to senior expert with specialized damage assessment skills.

In [None]:
# === TIER 3: SENIOR ADJUSTER (Cracks OR very high complexity OR high mistakenness) ===

tier3_senior = dataset.match(
    # Contains cracks (high confusion damage type)
    (F("detections.detections.label").contains("crack") |
    # Very high complexity
     (F("complexity_score") > 4) |
    # High mistakenness
     (F("mistakenness") > 0.7)) &
    # Not in tier 1
    (~F("id").is_in(tier1_auto.values("id")))
)
dataset.save_view("tier3_senior_review", tier3_senior)

senior_count = len(tier3_senior)
senior_instances = tier3_senior.count("detections.detections")

print(f"\nTIER 3 - SENIOR ADJUSTER:")
print(f"  Criteria: Contains cracks OR very complex OR suspicious labels")
print(f"  Samples: {senior_count} ({senior_count/len(dataset)*100:.1f}%)")
print(f"  Instances: {senior_instances}")
print(f"  Avg complexity: {tier3_senior.mean('complexity_score'):.2f}")




#### **Tier 4 - Data Collection Gaps:**

Images where:
- **Underrepresented scenarios:** Unusual crack cases OR unique complex situations (outliers with many damages)
- **Not enough training examples**

**Translation:** "Model struggles here because we lack training data" → Prioritize collecting more similar examples to improve model performance.

In [None]:
# === TIER 4: DATA COLLECTION GAPS (Need more training data) ===

tier4_gaps = dataset.match(
    # Cracks in edge cases
    ((F("detections.detections.label").contains("crack")) &
     (F("representativeness") < 0.5)) |
    # High uniqueness complex scenarios
    ((F("uniqueness") > 0.7) & (F("complexity_score") > 3))
)
dataset.save_view("tier4_data_gaps", tier4_gaps)

gaps_count = len(tier4_gaps)

print(f"\nTIER 4 - DATA COLLECTION GAPS:")
print(f"  Criteria: Edge-case cracks OR unique complex scenarios")
print(f"  Samples: {gaps_count}")
print(f"  Avg complexity: {tier4_gaps.mean('complexity_score'):.2f}")



## Workshop Wrap-Up: Key Takeaways 

In this workshop, we covered a complete workflow for multi-instance detection problems...

**1. Explore Before Training**

- Quantify dataset complexity (objects per sample, co-occurrence patterns)

- Use embeddings to reveal visual structure

- Identify typical vs edge cases with representativeness metrics

**2. Evaluate at Instance Level**

- Per-object metrics, not just per-image accuracy

- Understand partial detection (finding some but not all objects)

- Stratify performance by scenario complexity

**3. Add Semantic Context**

- VLMs capture relationships between objects

- Generate searchable descriptions from visual data

- Bridge gaps between bounding boxes and business needs

**4. Validate Quality Systematically**

- Use model confidence to find annotation errors

- Focus review on ambiguous object types

- Distinguish model failures from labeling mistakes

**5. Deploy with Risk Awareness**

- Auto-process high-confidence typical cases

- Route complex scenarios to human experts

- Build systems that degrade gracefully

### Key Principles

**Multi-instance problems require multi-dimensional thinking:**

 - Complexity is measurable (count objects, diversity, unusualness)

 - Partial success is normal (models find some objects, miss others)

 - Context matters (relationships between objects)

 - Stratification is essential (simple ≠ complex)

 - Instance-level evaluation required

 **This methodology generalizes to any multi-object detection problem.**

---

### Next Steps

- Join one of our upcoming virtual events: https://voxel51.com/events

- Join our community Discord where you can ask any questions you may have: https://discord.com/invite/fiftyone-community
