[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/harpreetsahota204/car_dd_dataset_workshop/blob/main/03_model_evaluation.ipynb)

Note: If using in Google Colab, make sure you [install all the requirements listed here](https://github.com/harpreetsahota204/car_dd_dataset_workshop/blob/main/requirements.txt).

# Model evaluation in FiftyOne

Before we can evaluate model performance, we need to make some predictions!

Let's start by loading the dataset:

In [1]:
import fiftyone as fo

dataset = fo.load_dataset("cardd_from_hub")

In [None]:
# # # or if you are in a new notebook
# import fiftyone as fo
# from fiftyone.utils.huggingface import load_from_hub

# dataset = load_from_hub(
#     "harpreetsahota/CarDD",
#     name="cardd_from_hub",
#     max_samples=100, # if you want to work with a subset of the dataset
#     persistent=True,
#     overwrite=True,
#     )

Let's grab the ground truth labels for our bounding boxes:

In [None]:
classes=dataset.distinct("detections.detections.label")
print(classes)

# Start with some zero-shot models

### Using the Hugging Face Integration

We can use our [integration with Hugging Face](https://docs.voxel51.com/integrations/huggingface.html#image-classification) to generate some predictions using a zero shot object detection model.

In this case, let's use a version of [the OWL model by Google](https://huggingface.co/google/owlv2-base-patch16-finetuned).

In [None]:
import fiftyone.zoo as foz

owl_model = foz.load_zoo_model(
    "zero-shot-detection-transformer-torch",
    name_or_path="google/owlv2-base-patch16-finetuned",  # HF model name or path
    classes=classes,
)

In [None]:
dataset.apply_model(
    owl_model, 
    label_field="owlvit_pred", 
    confidence_thresh=0.2,
    )

# Ultralytics Integration

You can also use FiftyOne's [integration with Ultralytics](https://docs.voxel51.com/integrations/ultralytics.html) to generate some predictions.

Let's use the [YOLO-World](https://docs.ultralytics.com/models/yolo-world) model for open vocabulary segmentation:

In [None]:
import fiftyone.zoo as foz

yolo_model = foz.load_zoo_model(
    "yoloe11l-seg-torch",
    classes=classes,
)

dataset.apply_model(
    yolo_model, 
    label_field="yolo_seg"
    )

# Use a Vision Language Model - PaliGemma2-Mix


Note: PaliGemma reuiqres `flax` and `jax`

> You can also use Florence2! Check out the [GitHub repo for the integration here](https://github.com/harpreetsahota204/florence2)

In [None]:
import fiftyone.zoo as foz
foz.register_zoo_model_source("https://github.com/harpreetsahota204/paligemma2", overwrite=True)

foz.download_zoo_model(
    "https://github.com/harpreetsahota204/paligemma2",
    model_name="google/paligemma2-3b-mix-224", 
)

In [None]:
paligemma_model = foz.load_zoo_model(
    "google/paligemma2-3b-mix-224",
    # install_requirements=True #if you are using for the first time and need to download reuirement,
    # ensure_requirements=True #  ensure any requirements are installed before loading the model
)

In [None]:
# Set operation and classes to segment
paligemma_model.operation = "detection"
paligemma_model.prompt = classes

# Apply to dataset
dataset.apply_model(paligemma_model, label_field="pg_detection")

In [None]:
# Set operation and classes to segment
paligemma_model.operation = "segment"
paligemma_model.prompt = classes

# Apply to dataset
dataset.apply_model(paligemma_model, label_field="pg_segmentations")

### Using `evaluate_detections()` to Analyze Your Object Detection Models

FiftyOne's [`evaluate_detections()`](https://docs.voxel51.com/user_guide/evaluation.html#supported-types) method gives you powerful insights into how well your object detection models are performing.

This method compares your model's predicted bounding boxes against ground truth annotations, calculating metrics like precision, recall, and mAP while also providing sample-level statistics that help you identify exactly where your model struggles. You can evaluate any detection task including standard object detection, instance segmentation, polygon detection, keypoints, and even temporal detections in videos. 

The evaluation uses COCO-style evaluation by default, but you can also switch to Open Images-style or ActivityNet-style evaluation depending on your needs.

When you specify an `eval_key` parameter it populates helpful fields on each sample and detection that you can explore interactively in the FiftyOne App.


In [None]:
# Evaluate predictions against ground truth
box_results = dataset.evaluate_detections(
    "yolo_seg",           # field containing model predictions  
    gt_field="detections", # field containing ground truth
    eval_key="eval",        # key to store evaluation results
    compute_mAP=True,        # compute mean Average Precision
    use_boxes=True,
    method="coco"
)


In [None]:
# Print detailed metrics
box_results.print_report()
print(f"mAP: {box_results.mAP():.3f}")

# View false positives in the app
from fiftyone import ViewField as F
fp_view = dataset.filter_labels("yolo_seg", F("eval") == "fp")
dataset.save_view(view=fp_view, name="fp")

### Using `evaluate_detections()` for Instance Segmentation Analysis

FiftyOne's `evaluate_detections()` method seamlessly handles instance segmentation when your masks are stored as `Detections` with populated mask fields.

When you have segmentation masks stored within `Detection` objects, the evaluation automatically switches to using mask-based IoU calculations instead of bounding box IoU by setting `use_masks=True`. This gives you precise pixel-level overlap measurements between predicted and ground truth instance masks while still providing all the standard detection evaluation metrics like precision, recall, and mAP. The method follows the same COCO-style evaluation protocol but calculates IoU based on the actual segmented regions rather than just rectangular bounding boxes.

You get the best of both worlds - instance-level evaluation with pixel level accuracy measurements.


In [None]:
# Evaluate predictions against ground truth
mask_results = dataset.evaluate_detections(
    "yolo_seg",           # field containing model predictions  
    gt_field="segmentations", # field containing ground truth
    eval_key="seg_evals",     # key to store evaluation results
    use_masks=True,         # Use pixel masks for IoU calculation
    compute_mAP=True,        # compute mean Average Precision
    method="coco",
    # iou=0.50           #you can set this threshold, it defualts to 0.50
)


In [None]:
# Print segmentation-specific metrics
mask_results.print_report()
print(f"Mask-based mAP: {mask_results.mAP():.3f}")

In [None]:
from fiftyone import ViewField as F
poor_masks = dataset.filter_labels("yolo_seg", 
                                   (F("eval") == "tp") & (F("eval_iou") < 0.5))

dataset.save_view(view=poor_masks, name="poor_masks")

### Comparing Mask Quality vs Bounding Box Performance

You can run both mask-based and box-based evaluations on the same dataset to understand where your model excels at detection versus precise segmentation.

Running separate evaluations with `use_masks=True` and `use_masks=False` reveals interesting insights about your instance segmentation model's behavior. A detection might be correctly localized (good bounding box IoU) but have poor mask quality, or conversely, have excellent mask precision within a slightly misplaced bounding box. This dual analysis helps you understand whether your model's weaknesses lie in object localization, boundary delineation, or both.

The sample-level fields from both evaluations let you identify specific failure modes and edge cases.

```python


In [None]:
# Compare the two approaches
print("Mask-based mAP:", mask_results.mAP())
print("Box-based mAP:", box_results.mAP())

# Find samples where box detection succeeded but mask quality failed
from fiftyone import ViewField as F
mask_failures = dataset.filter_labels(
    "yolo_seg",
    (F("box_eval") == "tp") & (F("mask_eval") == "fp")
)

dataset.save_view(mask_failures, "mask_failures")

print(f"Found {mask_failures.count('predictions.detections')} instances with good boxes but poor masks")

Now let's install a panel to make it easy for us to understand overall model performance:

In [None]:
!fiftyone plugins download \
    https://github.com/voxel51/fiftyone-plugins \
    --plugin-names @voxel51/evaluation

In [None]:
fo.launch_app(dataset)

# Fine-tune and evaluate a model

### Split dataset

Let's start by splitting our dataset, which we can with the [built-in utility functions in FiftyOne](https://docs.voxel51.com/api/fiftyone.utils.random.html#fiftyone.utils.random.random_split).

In [43]:
import fiftyone as fo
import fiftyone.utils.random as four
import fiftyone.zoo as foz

four.random_split(dataset, {"train": 0.8, "val": 0.2})

In [None]:
dataset.skip(51).first()

Let's also go ahead and load in the test dataset:

In [None]:
import fiftyone as fo

test_dataset = fo.Dataset.from_dir(
    data_path="CarDD_release/CarDD_COCO/test2017",
    labels_path="CarDD_release/CarDD_COCO/annotations/instances_test2017.json",
    dataset_type=fo.types.COCODetectionDataset,
    name="car_dd_test",
    overwrite=True,
    include_id=True,
)

FiftyOne's [`match_tags()` method](https://docs.voxel51.com/api/fiftyone.core.dataset.html#fiftyone.core.dataset.Dataset.match_tags) lets you easily create views for each split after using `random_split()`.

When you use `random_split()`, FiftyOne automatically adds tags like "train", "test", and "val" to your samples based on the split ratios you specified. You can then create filtered views of your dataset by matching these tags, which is perfect for running separate evaluations on your training, validation, and test sets. This approach lets you analyze how your model performs across different data splits and catch issues like overfitting or distribution shifts between splits.

Each view acts like a subset of your original dataset while maintaining all the same functionality.


In [45]:
train_view = dataset.match_tags("train")
val_view = dataset.match_tags("val")

### Leaky Split Detector

This code finds data contamination between training/validation/test splits in machine learning datasets.

When you're building ML models, you split your data into different buckets - like training, validation, and test sets. But sometimes the same image (or very similar images) accidentally end up in multiple buckets, which is called "data leakage" and can make your model performance metrics totally bogus. [This tool uses image similarity to hunt down these sneaky duplicates that are hiding across your splits](https://docs.voxel51.com/api/fiftyone.brain.html#fiftyone.brain.compute_leaky_splits).

This method computes embeddings for all your images and builds a similarity index to find near-duplicates.

First, it runs your images through a neural network (default is ResNet-18, but you can change it) to get feature embeddings that capture what each image looks like.  Then, it uses these embeddings to build a similarity index that can quickly find which images are suspiciously similar to each other.  When you call `find_leaks()` with a threshold, it identifies pairs of similar images that live in different splits and flags them as potential leaks.

You can tag leaky samples, exclude them from views, or drill down into specific leak relationships.

Yu can automatically tag all the problematic samples, create filtered views that exclude the leaks entirely, or investigate specific samples to see all their similar neighbors across splits. It also warns you about overlapping splits and samples that couldn't be processed, so you know exactly what's going on with your data quality.

You can learn more about [this here](https://voxel51.com/blog/on-leaky-datasets-and-a-clever-horse).

In [None]:
import fiftyone as fo
from fiftyone.brain import compute_leaky_splits

leaks_index = compute_leaky_splits(
    dataset, 
    splits=['train', "val"], 
    model="clip-vit-base32-torch"
    )

leaks = leaks_index.leaks_view()

### Export dataset to YOLO format

FiftyOne supports [exporting datasets to disk in various common formats](https://docs.voxel51.com/user_guide/export_datasets.html) and can be extended for custom formats.

 You can export entire datasets or subsets using the Python library or CLI. The `export()` method automatically coerces data types to match the requested export types. Several dataset types are supported, including:
 
 *  `ImageDirectory`
 *  `VideoDirectory`
 *  `FiftyOneImageClassificationDataset`
 *  `COCODetectionDataset`
 
 You can also export datasets in custom formats by defining your own Dataset or DatasetExporter class.


We're going to fine-tune a YOLO model, so we'll make use of the [YOLO exporter.](https://docs.voxel51.com/user_guide/export_datasets.html#yolov5)

In [None]:
import os
import fiftyone as fo
import fiftyone.utils.ultralytics as fou
import fiftyone.utils.labels as fol

EXPORT_DIR = "/tmp/car_dd_yolo"

YAML_FILE = os.path.join(EXPORT_DIR, "dataset.yaml")

fol.instances_to_polylines(dataset, "segmentations", "polylines")

train_view.export(
    export_dir=EXPORT_DIR,
    dataset_type=fo.types.YOLOv5Dataset,
    label_field="polylines",
    split="train",
    classes=classes,
)

val_view.export(
    export_dir=EXPORT_DIR,
    dataset_type=fo.types.YOLOv5Dataset,
    label_field="polylines",
    split="val",  # Ultralytics uses 'val'
    classes=classes,
)

In [49]:
from ultralytics import YOLO

model = YOLO('yolo11s-seg.pt')  # Pre-trained YOLOv11 segmentation model

# Set up training configuration
model.train(
    data=YAML_FILE,
    epochs=100,                # Increased for effective learning
    batch=8,                   # Adjust if increasing imgsz
    imgsz=1024,                # Higher for small object segmentation
    val=True,
    save=True,
    save_period=10,
    project='car_damage_segmentation',
    name='yolov11_seg_run',
    workers=8,                 # If hardware allows
    device='cuda',
    patience=15,               # Slightly more patience for complex data
    verbose=True
    )

Now, let's apply this fine-tuned model to our test dataset:

In [None]:
model_path = "./car_damage_segmentation/yolov11_seg_run/weights/best.pt"

model = YOLO(model_path)

test_dataset.apply_model(model, label_field="fine_tuned_yolo")

In [None]:
# Evaluate predictions against ground truth
ft_results = test_dataset.evaluate_detections(
    "fine_tuned_yolo",           # field containing model predictions  
    gt_field="segmentations", # field containing ground truth
    eval_key="ft_evals",     # key to store evaluation results
    use_masks=True,         # Use pixel masks for IoU calculation
    compute_mAP=True,        # compute mean Average Precision
    method="coco",
    # iou=0.50           #you can set this threshold, it defualts to 0.50
)


In [None]:
# Create a view that contains only high confidence false positive model
# predictions, with samples containing the most false positives first
most_fp_view = (
    test_dataset
    .filter_labels("fine_tuned_yolo", (F("confidence") > 0.8) & (F("eval") == "fp"))
    .sort_by(F("fine_tuned_yolo.detections").length(), reverse=True)
)

test_dataset.save(most_fp_view, "most_fp_view")

### Mistakenness

You can [use FiftyOne to calculate a mistakenness score for your ground truth labels](https://docs.voxel51.com/brain.html#brain-label-mistakes). This algorithm finds potential annotation errors by checking when confident model predictions disagree with your ground truth labels.

The core idea is simple: if your model is really confident about a prediction but your ground truth says something different, there's probably a labeling mistake. The algorithm calculates a "mistakenness score" that's high when the model is confident and wrong, suggesting the ground truth might be incorrect rather than the model. It works for both classification (wrong class label) and localization (wrong bounding box position).

This helps you clean up datasets by automatically flagging suspicious annotations for human review.

### How the Scoring Works

The mistakenness score combines model confidence with whether the prediction matches the ground truth label.

For classification, the formula is `(m * exp(-entropy(logits)) + 1) / 2` where `m` equals -1 for correct predictions and +1 for incorrect ones. The entropy part measures model confidence - lower entropy means higher confidence. So when the model is confident and wrong (high confidence, m=+1), you get a high mistakenness score. When the model is confident and correct (high confidence, m=-1), you get a low score.

For **localization mistakenness**, it uses a different formula that considers both confidence and IoU: `(c * ((2 * i) - 1) + 1) / 2` where `c` is the model confidence and `i` represents how bad the localization is (0 for perfect IoU, 1 for terrible IoU at the 0.5 threshold). This means objects with poor bounding box alignment get higher localization mistakenness scores, especially when the model was confident about the detection.

### Three Types of Issues It Finds

Thiss will help you identify three categories of potential problems: incorrect labels, missing annotations, and spurious annotations.

1) **Incorrect labels** happen when ground truth objects have matching predictions but disagree on class or location - these get mistakenness scores to rank how suspicious they are. 

2) **Missing annotations** are high-confidence predictions (above 95% confidence) that don't match any ground truth object, suggesting the annotator missed something obvious. 

3) **Spurious annotations** are ground truth objects that don't match any prediction, indicating they might be phantom labels or incorrectly placed.

Each category helps you focus on different types of annotation quality issues.

### The Detection Pipeline

[For object detection](https://docs.voxel51.com/tutorials/detection_mistakes.html), it first matches predictions to ground truth using IoU overlap, then computes mistakenness for each matched pair.

The process starts by running standard detection evaluation with a 0.5 IoU threshold to pair up predicted and ground truth objects. For each matched pair, it calculates both classification mistakenness (wrong class) and localization mistakenness (wrong position). Unmatched high-confidence predictions become missing annotation candidates, while unmatched ground truth objects become spurious annotation candidates. The final sample-level score is just the highest mistakenness of any object in that image.

This matching step is crucial because it establishes which predictions correspond to which annotations before judging their quality.


### Note: Pay attention that we are now going to apply our trained model to the train dataset

In [None]:
# model_path = "./car_damage_segmentation/yolov11_seg_run/weights/best.pt"

# model = YOLO(model_path)

dataset.apply_model(model, label_field="fine_tuned_yolo")

In [None]:
import fiftyone.brain as fob

# Compute mistakenness of annotations in `segmentations` field using
# predictions from `predictions` field as point of reference
fob.compute_mistakenness(dataset, "fine_tuned_yolo", label_field="segmentations")

### What Gets Added to Your Dataset

The mistakenness analysis adds new fields to your ground truth objects, predicted objects, and samples to flag potential annotation issues.

**Ground truth objects** get three new attributes: 

- `mistakenness` scores how likely the class label is wrong
- `mistakenness_loc` scores how likely the bounding box position is wrong
- `possible_spurious` marks objects that probably shouldn't exist (no matching prediction found). 

**Predicted objects** get one new attribute: 

- `possible_missing` flags high-confidence predictions that have no ground truth match, suggesting the annotator missed labeling an obvious object.

**Sample-level fields** summarize the issues: 

- `mistakenness` takes the highest mistakenness score from any ground truth object in that image
- `possible_spurious` and `possible_missing` count how many problematic objects were found in each sample.


This gives you both fine-grained object-level feedback and high-level sample summaries to prioritize your annotation review.

###  How to Use These Flags

You can sort and filter your dataset using these new fields to focus on the most problematic annotations first.

Sort samples by the `mistakenness` field to see images with the most suspicious ground truth labels at the top. Filter for samples with high `possible_missing` counts to find images where your annotators likely missed obvious objects. Use `possible_spurious` counts to identify samples with phantom annotations that probably don't belong.

At the object level, you can filter ground truth objects by `mistakenness > 0.8` to see the most questionable class labels, or `mistakenness_loc > 0.7` to find bounding boxes that are probably in the wrong place.

This approach helps you tackle annotation quality issues in order of severity rather than randomly reviewing your entire dataset.

In [None]:
from fiftyone import ViewField as F

# Sort by likelihood of mistake (most likely first)
mistake_view = dataset.sort_by("mistakenness", reverse=True)

dataset.save_view(mistake_view, "sorted_mistake_view")

# Print some information about the view
print(mistake_view)

In [None]:
high_mistaken_score = dataset.filter_labels("segmentations", F("mistakenness") > 0.7)

dataset.save_view(high_mistaken_score,"high_mistaken_scores")

In [None]:
pooly_localized = dataset.filter_labels("segmentations", F("mistakenness_loc") > 0.85)
dataset.save_view(pooly_localized, "pooly_localized")

We can take this a step further and fix the errors in the labels!

To do this we will use [FiftyOne's integration with CVAT](https://docs.voxel51.com/integrations/cvat.html). You can sign up [here](https://app.cvat.ai). I recommend signing up with an email and setting your password explicitly, rather than using a third-party sign on.

Once you've done that, set the following environment variables:
- `FIFTYONE_CVAT_USERNAME`
- `FIFTYONE_CVAT_PASSWORD`


Note: FiftyOne also integrats with [Label Studio](https://docs.voxel51.com/integrations/index.html), [V7](https://docs.voxel51.com/integrations/v7.html), and [Labelbox](https://docs.voxel51.com/integrations/labelbox.html).

In [None]:
import os
from getpass import getpass

os.environ['FIFTYONE_CVAT_USERNAME'] = getpass("Enter your CVAT user name:")

In [None]:
os.environ['FIFTYONE_CVAT_PASSWORD'] = getpass("Enter your CVAT password:")

You can [learn more about annotations](https://docs.voxel51.com/user_guide/annotation.html?) in FiftyOne here.


In [None]:
high_mistaken_score.annotate(
    anno_key="high_mistaken_score",
    backend="cvat",
    label_field="segmentations",
    label_type="instances",
    classes=dataset.default_classes,
    allow_additions=True,
    allow_deletions=True,
    allow_label_edits=True,
    launch_editor=True
)

Once we've done the annotation work, we can load them back into FiftyOne!

In [None]:
high_mistaken_score.load_annotations("high_mistaken_score")