In [None]:
# pip install -U fiftyone sahi ultralytics huggingface_hub --quiet

In [1]:
import fiftyone as fo
import fiftyone.utils.huggingface as fouh

# Load the dataset from Hugging Face if it's your first time using it

# train_dataset = fouh.load_from_hub(
#     "Voxel51/Coursera_lecture_dataset_train", 
#     dataset_name="lecture_dataset_train", 
#     persistent=True
#     )

# test_dataset = fouh.load_from_hub(
#     "Voxel51/Coursera_lecture_dataset_test", 
#     dataset_name="lecture_dataset_test", 
#     persistent=True
#     )

In [2]:
dataset = fo.load_dataset("lecture_dataset_test_clone")

dataset = dataset.clone()

## Challenges in Small Object Detection

Detecting small objects in images and video presents unique challenges for computer vision systems. Several factors contribute to the difficulty of this task:

### Limited Visual Information

The primary challenge lies in the scarcity of visual data. 

Small objects occupy only a few pixels in an image, providing minimal information for detection models to work with. Just as humans struggle to discern distant objects, AI models have difficulty identifying small objects that lack clearly visible features like wheels on a car or facial details on a person.

### Dataset Bias

The quality of a model's performance is intrinsically tied to its training data. 

Unfortunately, many standard object detection datasets and benchmarks predominantly feature medium to large objects. This bias results in off-the-shelf detection models that are not optimized for identifying small objects, as they haven't been adequately exposed to such examples during training.

### Fixed Input Dimensions

Most object detection models operate on fixed-size inputs. 

For instance, YOLOv8 processes images with a maximum side length of 640 pixels. When presented with a larger image, such as one with dimensions of 1920×1080, the model downsamples it to 640×360 before analysis. This reduction in resolution can lead to a loss of crucial details, particularly for small objects.

### Additional Complexities

Several other factors compound the difficulty of small object detection:

1. **Class Imbalance**: Datasets often contain fewer instances of small objects, leading to training biases.

2. **Scale Variation**: The significant size difference between small and large objects in the same image can challenge detection algorithms.

3. **Occlusion and Clutter**: Small objects are more susceptible to being partially hidden or lost in busy backgrounds.

4. **Computational Demands**: Processing high-resolution images to detect small objects can be computationally intensive.


You could train a model on larger images to improve the detection of small objects. 

But this requires more memory computational power, and datasets that are more labor-intensive to create. An alternative to this is to leverage existing object detection, apply the model to patches or slices of fixed size in our image, and then stitch the results together. 

This is the idea behind [Slicing-Aided Hyper Inference (SAHI)](https://github.com/obss/sahi)!

# SAHI

<img src="https://raw.githubusercontent.com/obss/sahi/main/resources/sliced_inference.gif">

Image source: [SAHI GitHub Repo](https://github.com/obss/sahi)

SAHI divides large images into smaller, overlapping slices. This makes small objects to appear relatively larger within each slice, and easier for the model to detect. The model performs detection on each slice independently, potentially capturing small objects that might be undetected in the full image.

In a nutshell, here's how it works:

1. SAHI takes a the full image as input.

2. The input image is divided into smaller, overlapping slices. The slice size and overlap ratio are configurable parameters.

3. The object detection model processes each slice independently.

4. The chosen object detection model performs inference on each slice. Note: SAHI can be integrated with various object detection models, including YOLO series, without requiring modifications to the underlying detector

5. The coordinates of detected objects in each slice are transformed back to the original image's coordinate system .

6. Detections from all slices are collected and combined.

7. Duplicate detections from overlapping slices are merged or filtered, often using non-maximum suppression (NMS).

8. A consolidated list of detections for the original image is produced .

Keep in mind that inference times will be longer than the original inference time. 

This is because we're running the model on multiple slices *per* image, which increases the number of forward passes the model has to make. This is a trade-off we're making to improve the detection of small objects.

To use SAHI start by running `pip install sahi` in your terminal or notebook. 

Then you pass the path of your trained detection model to create an instance of SAHI's `AutoDetectionModel` class:


In [None]:
from sahi import AutoDetectionModel
from sahi.predict import get_prediction, get_sliced_prediction

ckpt_path = "..." #this will be the path to the best_model you trained in the previous module. 

detection_model = AutoDetectionModel.from_pretrained(
    model_type='yolov8',
    model_path=ckpt_path,
    confidence_threshold=0.25,
    image_size=640,
    # device="cuda", # if you have a GPU
)

To get a sense of what the output looks like, use SAHI's `get_prediction` function:

In [None]:
result = get_prediction(dataset.first().filepath, detection_model, verbose=0)
print(result)

SAHI results objects have a `to_fiftyone_detections()` method, which converts the results to FiftyOne detections:

In [None]:
print(result.to_fiftyone_detections())

SAHI's `get_sliced_prediction()` function works in the same way as `get_prediction()`, with a few additional hyperparameters that let us configure how the image is sliced. In particular, we can specify the slice height and width, and the overlap between slices. Here's an example:

In [None]:
sliced_result = get_sliced_prediction(
    dataset.skip(40).first().filepath,
    detection_model,
    slice_height = 320,
    slice_width = 320,
    overlap_height_ratio = 0.2,
    overlap_width_ratio = 0.2,
)

Now compare the number of detections in the sliced predictions to the number of detections in the original predictions:

In [None]:
num_sliced_dets = len(sliced_result.to_fiftyone_detections())
num_orig_dets = len(result.to_fiftyone_detections())

print(f"Detections predicted without slicing: {num_orig_dets}")
print(f"Detections predicted with slicing: {num_sliced_dets}")

Notice the change in the number of predictions.

Later in this notebook we'll determine if the additional predictions are valid, or if we just have more false positives, with [FiftyOne's Evaluation API](https://docs.voxel51.com/user_guide/evaluation.html). 

The task now is to find a good set of hyperparameters for our slicing. For this we can apply SAHI to the entire dataset. 

The function below adds predictions to a sample in a specified label field, and then we will iterate over the dataset, applying the function to each sample. This function will pass the sample's filepath and slicing hyperparameters to `get_sliced_prediction()`, and then add the predictions to the sample in the specified label field:

In [None]:
def predict_with_slicing(sample, label_field, **kwargs):
    """
    Perform sliced prediction on a sample and add the results to a specified label field.

    This function uses SAHI's get_sliced_prediction to perform object detection on
    slices of the image, then converts the results to FiftyOne Detections and adds
    them to the sample.

    Args:
        sample (fiftyone.core.sample.Sample): The FiftyOne sample to process.
        label_field (str): The name of the field to store the predictions in.
        **kwargs: Additional keyword arguments to pass to get_sliced_prediction.

    Returns:
        None. The function modifies the sample in-place.

    Note:
        This function assumes that a global 'detection_model' object is available,
        which should be an instance of a SAHI-compatible detection model.
    """
    result = get_sliced_prediction(
        sample.filepath, detection_model, verbose=0, **kwargs
    )
    sample[label_field] = fo.Detections(detections=result.to_fiftyone_detections())

We'll keep the slice overlap fixed at $0.2$, and see how the slice height and width affect the quality of the predictions:

In [None]:
kwargs = {"overlap_height_ratio": 0.2, "overlap_width_ratio": 0.2}

for sample in dataset.iter_samples(progress=True, autosave=True):
    predict_with_slicing(sample, label_field="small_slices", slice_height=320, slice_width=320, **kwargs)
    predict_with_slicing(sample, label_field="large_slices", slice_height=480, slice_width=480, **kwargs)

Let's run an evaluation routine comparing our predictions from each of the prediction label fields to the ground truth labels. 

Using the `evaluate_detections()` method will mark each detection as a true positive, false positive, or false negative. Here we use the default IoU threshold of $0.5$, but you can adjust this as needed.

Note that this will take some time!

In [None]:
base_results = dataset.evaluate_detections("base_model", gt_field="ground_truth", eval_key="eval_base_model")

large_slice_results = dataset.evaluate_detections("large_slices", gt_field="ground_truth", eval_key="eval_large_slices")

small_slice_results = dataset.evaluate_detections("small_slices", gt_field="ground_truth", eval_key="eval_small_slices")

We can see that as we introduce more slices, the number of false positives increases, while the number of false negatives decreases. This is expected, as the model is able to detect more objects with more slices, but also makes more mistakes! You could apply more agressive confidence thresholding to combat this increase in false positives, but even without doing this the $F_1$-score has significantly improved.

Let's dive a little bit deeper into these results. We noted earlier that the model struggles with small objects, so let's see how these three approaches fare on objects smaller than $32 \times 32$ pixels. We can perform this filtering using FiftyOne's [ViewField](https://docs.voxel51.com/recipes/creating_views.html#View-expressions):

In [None]:
## Filtering for only small boxes
from fiftyone import ViewField as F

box_width, box_height = F("bounding_box")[2], F("bounding_box")[3]
rel_bbox_area = box_width * box_height

im_width, im_height = F("$metadata.width"), F("$metadata.height")
abs_area = rel_bbox_area * im_width * im_height

small_boxes_view = dataset.filter_labels("label_field", abs_area < 32**2, only_matches=False)

In [None]:
small_boxes_base_results = small_boxes_view.evaluate_detections("base_model", gt_field="ground_truth", eval_key="eval_small_boxes_base_model")

small_boxes_large_slice_results = small_boxes_view.evaluate_detections("large_slices", gt_field="ground_truth", eval_key="eval_small_boxes_large_slices")

small_boxes_small_slice_results = small_boxes_view.evaluate_detections("small_slices", gt_field="ground_truth", eval_key="eval_small_boxes_small_slices")

In [None]:
print("Small Box — Base model results:")
small_boxes_base_results.print_report()

print("-" * 50)
print("Small Box — Large slice results:")
small_boxes_large_slice_results.print_report()

print("-" * 50)
print("Small Box — Small slice results:")
small_boxes_small_slice_results.print_report()

This makes the value of SAHI crystal clear! The recall when using SAHI is much higher for small objects without significant dropoff in precision, leading to improved F1-score. This is especially pronounced for `` detections, where the $F_1$ score is tripled!


### Edge cases
Now that we know SAHI is effective at detecting small objects, let's look at the places where our predictions are most confident but do not align with the ground truth labels. We can do this by creating an evaluation patches view, filtering for predictions tagged as false positives and sorting by confidence:

In [None]:
high_conf_fp_view = dataset.to_evaluation_patches(eval_key="eval_small_slices").match(F("type")=="fp").sort_by("small_slices.detection.confidence")

Our predictions are mostly accurate, but some ground truth labels are missing. Implementing human-in-the-loop (HITL) workflows can help correct this. We can then re-evaluate our models and train new ones with the updated data.


To maximize the effectiveness of SAHI, you may want to experiment with the following:

- Slicing hyperparameters, such as slice height and width, and overlap. 
- Base object detection models, as SAHI is compatible with many models, including YOLOv5, and Hugging Face Transformers models.
- Confidence thresholding (potentially on a class-by-class basis), to reduce the number of false positives.
- Post-processing techniques, such as [non-maximum suppression (NMS)](https://docs.voxel51.com/api/fiftyone.utils.labels.html#fiftyone.utils.labels.perform_nms), to reduce the number of overlapping detections.
- Human-in-the-loop (HITL) workflows, to correct ground truth labels.

You will also want to determine which evaluation metrics make the most sense for your use case!

##### Additional resources

- [Tutorial on Evaluating Object Detections](https://docs.voxel51.com/tutorials/evaluate_detections.html)
- [Tutorial on Finding Object Detection Mistakes](https://docs.voxel51.com/tutorials/detection_mistakes.html)
- [Tutorial on Fine-tuning YOLOv8 on Custom Data](https://docs.voxel51.com/tutorials/yolov8.html)
- [FiftyOne Plugin for Comparing Models on Specific Detections](https://github.com/allenleetc/model-comparison)


If you ever need assistance, have more complex questions, or want to keep in touch, feel free to join the Voxel51 community Discord server [here](https://discord.gg/QAyfnUhfpw)