# Dataset Analysis: Box Count Distribution & Outlier Identification

## Purpose

This notebook provides **offline, pre-training** analysis of the iSAID dataset to:

1. **Understand box-count distribution** per image across train and validation splits
2. **Identify outlier images** with unusually high numbers of bounding boxes
3. **Visualize** the distribution and most extreme samples

## Why This Analysis Matters

### Performance Impact of High Box-Count Images

Images with many bounding boxes can cause:

- **Memory spikes**: Each box generates proposals in the RPN, and the ROI head processes
  features for each detected region. Images with 100+ boxes can require 10x more GPU memory
  than images with 10 boxes.

- **Slow batches**: Training time per batch is dominated by the image with the most boxes.
  A batch containing one extreme outlier (e.g., 500 boxes) will be much slower than
  a batch of typical images (10-30 boxes).

- **Training instability**: Very dense images may produce noisy gradients, especially
  if they represent unusual scenes (e.g., parking lots with 200+ cars).

### Why Analyze Offline (Not During Training)

- **No runtime overhead**: Dataset analysis during training adds latency to every epoch.
- **Informed decisions**: Review outliers manually before deciding whether to exclude them.
- **Reproducibility**: Document which images are outliers without automatically modifying
  the training set.

---

**Important**: This notebook is **diagnostic only**. It does NOT automatically remove
or filter any images. The decision to exclude outliers is left to the user.


In [None]:
import os
import sys
import json
from pathlib import Path
from collections import Counter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Add project root to path
project_root = Path(".").resolve().parent
sys.path.insert(0, str(project_root))

# Set plotting style
plt.style.use("seaborn-v0_8-whitegrid")
sns.set_palette("husl")

print(f"Project root: {project_root}")

## 1. Configuration

Set the path to your iSAID dataset and configure analysis parameters.


In [None]:
# Dataset configuration
DATA_ROOT = project_root / "iSAID_patches"

# Analysis parameters
TOP_N_OUTLIERS = 100  # Number of top outliers to identify
EXPORT_RESULTS = True  # Whether to export results to CSV/JSON
OUTPUT_DIR = project_root / "analysis_results"

# Verify dataset exists
assert DATA_ROOT.exists(), f"Dataset not found at {DATA_ROOT}"
print(f"Dataset path: {DATA_ROOT}")
print(f"Output directory: {OUTPUT_DIR}")

## 2. Load Annotation Files

Load the COCO-format annotation files for train and validation splits.


In [None]:
def load_annotations(data_root: Path, split: str) -> dict:
    """
    Load COCO-format annotations for a given split.

    Args:
        data_root: Path to iSAID_patches directory
        split: 'train' or 'val'

    Returns:
        Dictionary containing annotations
    """
    ann_file = data_root / split / f"instances_only_filtered_{split}.json"

    if not ann_file.exists():
        raise FileNotFoundError(f"Annotation file not found: {ann_file}")

    print(f"Loading {split} annotations from {ann_file}...")
    with open(ann_file, "r") as f:
        annotations = json.load(f)

    print(f"  - Images: {len(annotations['images'])}")
    print(f"  - Annotations: {len(annotations['annotations'])}")
    print(f"  - Categories: {len(annotations['categories'])}")

    return annotations


# Load annotations for both splits
train_anns = load_annotations(DATA_ROOT, "train")
val_anns = load_annotations(DATA_ROOT, "val")

## 3. Count Bounding Boxes Per Image

Iterate over all annotations and count boxes per image.


In [None]:
def count_boxes_per_image(annotations: dict) -> pd.DataFrame:
    """
    Count the number of bounding boxes per image.

    Args:
        annotations: COCO-format annotations dictionary

    Returns:
        DataFrame with columns: image_id, file_name, num_boxes, width, height
    """
    # Create image_id to info mapping
    img_info = {img["id"]: img for img in annotations["images"]}

    # Count boxes per image
    box_counts = Counter()
    for ann in annotations["annotations"]:
        box_counts[ann["image_id"]] += 1

    # Build DataFrame with all images (including those with 0 boxes)
    records = []
    for img in annotations["images"]:
        img_id = img["id"]
        records.append(
            {
                "image_id": img_id,
                "file_name": img["file_name"],
                "num_boxes": box_counts.get(img_id, 0),
                "width": img.get("width", 0),
                "height": img.get("height", 0),
            }
        )

    df = pd.DataFrame(records)
    df = df.sort_values("num_boxes", ascending=False).reset_index(drop=True)

    return df


# Count boxes for both splits
train_box_counts = count_boxes_per_image(train_anns)
val_box_counts = count_boxes_per_image(val_anns)

print(f"\nTrain split: {len(train_box_counts)} images")
print(f"Val split: {len(val_box_counts)} images")

## 4. Statistical Summary

Compute and display statistics for box counts in each split.


In [None]:
def compute_statistics(df: pd.DataFrame, split_name: str) -> dict:
    """
    Compute statistics for box counts.

    Args:
        df: DataFrame with 'num_boxes' column
        split_name: Name of the split for display

    Returns:
        Dictionary of statistics
    """
    box_counts = df["num_boxes"]

    stats = {
        "Split": split_name,
        "Total Images": len(df),
        "Total Boxes": box_counts.sum(),
        "Min": box_counts.min(),
        "Max": box_counts.max(),
        "Mean": box_counts.mean(),
        "Median": box_counts.median(),
        "Std": box_counts.std(),
        "Q25 (25th percentile)": box_counts.quantile(0.25),
        "Q75 (75th percentile)": box_counts.quantile(0.75),
        "Q90 (90th percentile)": box_counts.quantile(0.90),
        "Q95 (95th percentile)": box_counts.quantile(0.95),
        "Q99 (99th percentile)": box_counts.quantile(0.99),
        "Images with 0 boxes": (box_counts == 0).sum(),
        "Images with >50 boxes": (box_counts > 50).sum(),
        "Images with >100 boxes": (box_counts > 100).sum(),
        "Images with >200 boxes": (box_counts > 200).sum(),
    }

    return stats


# Compute statistics for both splits
train_stats = compute_statistics(train_box_counts, "Train")
val_stats = compute_statistics(val_box_counts, "Validation")

# Display as DataFrame for easy comparison
stats_df = pd.DataFrame([train_stats, val_stats]).set_index("Split").T
print("\n" + "=" * 60)
print("BOX COUNT STATISTICS")
print("=" * 60)
display(stats_df)

## 5. Distribution Visualization

Visualize the distribution of box counts per image.


In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Training set histogram
ax1 = axes[0, 0]
ax1.hist(
    train_box_counts["num_boxes"],
    bins=50,
    edgecolor="black",
    alpha=0.7,
    color="steelblue",
)
ax1.axvline(
    train_box_counts["num_boxes"].mean(),
    color="red",
    linestyle="--",
    label=f"Mean: {train_box_counts['num_boxes'].mean():.1f}",
)
ax1.axvline(
    train_box_counts["num_boxes"].median(),
    color="orange",
    linestyle="--",
    label=f"Median: {train_box_counts['num_boxes'].median():.1f}",
)
ax1.set_xlabel("Number of Boxes per Image")
ax1.set_ylabel("Frequency")
ax1.set_title("Train Set: Box Count Distribution")
ax1.legend()

# Validation set histogram
ax2 = axes[0, 1]
ax2.hist(
    val_box_counts["num_boxes"], bins=50, edgecolor="black", alpha=0.7, color="seagreen"
)
ax2.axvline(
    val_box_counts["num_boxes"].mean(),
    color="red",
    linestyle="--",
    label=f"Mean: {val_box_counts['num_boxes'].mean():.1f}",
)
ax2.axvline(
    val_box_counts["num_boxes"].median(),
    color="orange",
    linestyle="--",
    label=f"Median: {val_box_counts['num_boxes'].median():.1f}",
)
ax2.set_xlabel("Number of Boxes per Image")
ax2.set_ylabel("Frequency")
ax2.set_title("Validation Set: Box Count Distribution")
ax2.legend()

# Log-scale histogram (combined)
ax3 = axes[1, 0]
ax3.hist(
    train_box_counts["num_boxes"], bins=50, alpha=0.6, label="Train", color="steelblue"
)
ax3.hist(
    val_box_counts["num_boxes"],
    bins=50,
    alpha=0.6,
    label="Validation",
    color="seagreen",
)
ax3.set_xlabel("Number of Boxes per Image")
ax3.set_ylabel("Frequency (log scale)")
ax3.set_yscale("log")
ax3.set_title("Combined Distribution (Log Scale)")
ax3.legend()

# Box plot comparison
ax4 = axes[1, 1]
combined_data = pd.DataFrame(
    {
        "Split": ["Train"] * len(train_box_counts)
        + ["Validation"] * len(val_box_counts),
        "Box Count": list(train_box_counts["num_boxes"])
        + list(val_box_counts["num_boxes"]),
    }
)
sns.boxplot(
    data=combined_data,
    x="Split",
    y="Box Count",
    ax=ax4,
    palette=["steelblue", "seagreen"],
)
ax4.set_title("Box Count Distribution by Split")

plt.tight_layout()
(
    plt.savefig(OUTPUT_DIR / "box_count_distribution.png", dpi=150, bbox_inches="tight")
    if EXPORT_RESULTS
    else None
)
plt.show()

## 6. Outlier Identification

Identify the top N images with the highest number of bounding boxes.

### Why These Images May Cause Issues

- **GPU Memory**: Each box generates anchors, proposals, and ROI features. Images with
  300+ boxes may require 5-10GB extra GPU memory compared to typical images.
- **Batch Time**: With batch_size=4, if one image has 500 boxes while others have 20,
  the forward pass time is dominated by the 500-box image.
- **Gradient Noise**: Very dense images may not be representative of test distribution,
  potentially degrading model generalization.


In [None]:
def get_top_outliers(df: pd.DataFrame, top_n: int = 100) -> pd.DataFrame:
    """
    Get the top N images with the most bounding boxes.

    Args:
        df: DataFrame with box counts
        top_n: Number of outliers to return

    Returns:
        DataFrame with top outliers
    """
    return df.head(top_n).copy()


# Get top outliers for both splits
train_outliers = get_top_outliers(train_box_counts, TOP_N_OUTLIERS)
val_outliers = get_top_outliers(val_box_counts, TOP_N_OUTLIERS)

print("\n" + "=" * 60)
print(f"TOP {TOP_N_OUTLIERS} TRAIN IMAGES (HIGHEST BOX COUNTS)")
print("=" * 60)
display(train_outliers.head(20))  # Show first 20

print("\n" + "=" * 60)
print(f"TOP {TOP_N_OUTLIERS} VALIDATION IMAGES (HIGHEST BOX COUNTS)")
print("=" * 60)
display(val_outliers.head(20))  # Show first 20

## 7. Bar Plot of Top Outliers


In [None]:
def plot_top_outliers(df: pd.DataFrame, top_n: int = 30, title: str = ""):
    """
    Create a bar plot of top N outliers by box count.
    """
    top_df = df.head(top_n)

    fig, ax = plt.subplots(figsize=(14, 8))

    bars = ax.barh(
        range(len(top_df)),
        top_df["num_boxes"],
        color="coral",
        edgecolor="darkred",
        alpha=0.8,
    )

    # Add value labels
    for i, (v, name) in enumerate(zip(top_df["num_boxes"], top_df["file_name"])):
        ax.text(v + 2, i, f"{v}", va="center", fontsize=8)

    ax.set_yticks(range(len(top_df)))
    ax.set_yticklabels(
        [f[:30] + "..." if len(f) > 30 else f for f in top_df["file_name"]], fontsize=8
    )
    ax.invert_yaxis()
    ax.set_xlabel("Number of Bounding Boxes")
    ax.set_ylabel("Image File")
    ax.set_title(title)
    ax.axvline(
        df["num_boxes"].mean(),
        color="blue",
        linestyle="--",
        alpha=0.7,
        label=f'Mean: {df["num_boxes"].mean():.1f}',
    )
    ax.legend(loc="lower right")

    plt.tight_layout()
    return fig


# Plot top 30 outliers for each split
fig1 = plot_top_outliers(
    train_box_counts, top_n=30, title="Train Set: Top 30 Images by Box Count"
)
if EXPORT_RESULTS:
    fig1.savefig(OUTPUT_DIR / "train_top_outliers.png", dpi=150, bbox_inches="tight")
plt.show()

fig2 = plot_top_outliers(
    val_box_counts, top_n=30, title="Validation Set: Top 30 Images by Box Count"
)
if EXPORT_RESULTS:
    fig2.savefig(OUTPUT_DIR / "val_top_outliers.png", dpi=150, bbox_inches="tight")
plt.show()

## 8. Export Results

Save outlier lists and statistics for future reference.


In [None]:
if EXPORT_RESULTS:
    # Create output directory
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

    # Export train outliers
    train_outliers_path = OUTPUT_DIR / "train_outliers_top100.csv"
    train_outliers.to_csv(train_outliers_path, index=False)
    print(f"Saved train outliers to: {train_outliers_path}")

    # Export validation outliers
    val_outliers_path = OUTPUT_DIR / "val_outliers_top100.csv"
    val_outliers.to_csv(val_outliers_path, index=False)
    print(f"Saved validation outliers to: {val_outliers_path}")

    # Export full box counts
    train_box_counts.to_csv(OUTPUT_DIR / "train_all_box_counts.csv", index=False)
    val_box_counts.to_csv(OUTPUT_DIR / "val_all_box_counts.csv", index=False)
    print(f"Saved full box counts to output directory")

    # Export statistics as JSON
    stats_export = {
        "train": train_stats,
        "validation": val_stats,
    }
    with open(OUTPUT_DIR / "box_count_statistics.json", "w") as f:
        json.dump(stats_export, f, indent=2, default=float)
    print(f"Saved statistics to: {OUTPUT_DIR / 'box_count_statistics.json'}")

    print(f"\n✓ All results exported to: {OUTPUT_DIR}")
else:
    print("Export disabled. Set EXPORT_RESULTS = True to save results.")

## 9. Visualize Extreme Outliers (Optional)

View a few of the most extreme images with their bounding box overlays.

**Note**: This is read-only visualization - no modifications are made to the dataset.


In [None]:
from PIL import Image
import cv2


def visualize_image_with_boxes(
    data_root: Path, split: str, image_info: dict, annotations: dict, figsize=(12, 10)
):
    """
    Visualize an image with its bounding boxes overlaid.

    This is read-only visualization - the dataset is not modified.

    Args:
        data_root: Path to iSAID_patches
        split: 'train' or 'val'
        image_info: Dict with 'image_id', 'file_name', 'num_boxes'
        annotations: Full annotations dict
    """
    img_path = data_root / split / "images" / image_info["file_name"]

    if not img_path.exists():
        print(f"Image not found: {img_path}")
        return

    # Load image
    img = np.array(Image.open(img_path).convert("RGB"))

    # Get annotations for this image
    img_id = image_info["image_id"]
    img_anns = [ann for ann in annotations["annotations"] if ann["image_id"] == img_id]

    # Draw boxes
    img_with_boxes = img.copy()
    for ann in img_anns:
        x, y, w, h = [int(v) for v in ann["bbox"]]
        cv2.rectangle(img_with_boxes, (x, y), (x + w, y + h), (255, 0, 0), 2)

    # Display
    fig, axes = plt.subplots(1, 2, figsize=figsize)

    axes[0].imshow(img)
    axes[0].set_title(f"Original Image")
    axes[0].axis("off")

    axes[1].imshow(img_with_boxes)
    axes[1].set_title(f"With Boxes (n={image_info['num_boxes']})")
    axes[1].axis("off")

    plt.suptitle(
        f"File: {image_info['file_name']}\nImage ID: {image_info['image_id']} | Boxes: {image_info['num_boxes']}",
        fontsize=12,
    )
    plt.tight_layout()
    plt.show()

In [None]:
# Visualize top 3 most extreme train images
print("\n" + "=" * 60)
print("VISUALIZING TOP 3 TRAIN OUTLIERS")
print("=" * 60)

for i in range(min(3, len(train_outliers))):
    row = train_outliers.iloc[i]
    visualize_image_with_boxes(
        DATA_ROOT,
        "train",
        {
            "image_id": row["image_id"],
            "file_name": row["file_name"],
            "num_boxes": row["num_boxes"],
        },
        train_anns,
    )

In [None]:
# Visualize top 3 most extreme validation images
print("\n" + "=" * 60)
print("VISUALIZING TOP 3 VALIDATION OUTLIERS")
print("=" * 60)

for i in range(min(3, len(val_outliers))):
    row = val_outliers.iloc[i]
    visualize_image_with_boxes(
        DATA_ROOT,
        "val",
        {
            "image_id": row["image_id"],
            "file_name": row["file_name"],
            "num_boxes": row["num_boxes"],
        },
        val_anns,
    )

## 10. Summary & Recommendations

Based on the analysis above, you may consider the following actions:

### Potential Actions (User Decision Required)

1. **Do Nothing**: If outliers are few and memory/time is acceptable, keep all images.

2. **Exclude Extreme Outliers**: Remove images with >N boxes (e.g., N=200 or N=300)
   from training. This can be done by:
   - Creating a custom dataset wrapper that filters by image ID
   - Or modifying the annotation file to exclude specific images

3. **Dynamic Batching**: Use a custom sampler that groups images by box count
   to avoid mixing extreme outliers with normal images in the same batch.

4. **Increase GPU Memory**: If possible, use a GPU with more VRAM to handle
   dense images without OOM errors.

### Files Generated

If `EXPORT_RESULTS = True`, the following files are created in `analysis_results/`:

- `train_outliers_top100.csv`: Top 100 train images with highest box counts
- `val_outliers_top100.csv`: Top 100 validation images with highest box counts
- `train_all_box_counts.csv`: Full box counts for all train images
- `val_all_box_counts.csv`: Full box counts for all validation images
- `box_count_statistics.json`: Statistical summary for both splits
- `box_count_distribution.png`: Distribution visualization
- `train_top_outliers.png`: Bar plot of train outliers
- `val_top_outliers.png`: Bar plot of validation outliers


In [None]:
# Print final summary
print("\n" + "=" * 60)
print("ANALYSIS COMPLETE")
print("=" * 60)
print(f"\nTrain Set:")
print(f"  - Total images: {len(train_box_counts)}")
print(
    f"  - Box count range: {train_box_counts['num_boxes'].min()} - {train_box_counts['num_boxes'].max()}"
)
print(f"  - Mean boxes/image: {train_box_counts['num_boxes'].mean():.1f}")
print(f"  - Images with >100 boxes: {(train_box_counts['num_boxes'] > 100).sum()}")

print(f"\nValidation Set:")
print(f"  - Total images: {len(val_box_counts)}")
print(
    f"  - Box count range: {val_box_counts['num_boxes'].min()} - {val_box_counts['num_boxes'].max()}"
)
print(f"  - Mean boxes/image: {val_box_counts['num_boxes'].mean():.1f}")
print(f"  - Images with >100 boxes: {(val_box_counts['num_boxes'] > 100).sum()}")

print(f"\nRemember: This analysis is diagnostic only.")
print(f"   The decision to exclude outliers is left to the user.")

## 11. Class Distribution Analysis

Analyze the distribution of object classes across train and validation splits to identify class imbalance.


In [None]:
def analyze_class_distribution(annotations: dict, split_name: str) -> pd.DataFrame:
    """
    Analyze the distribution of object classes in the dataset.

    Args:
        annotations: COCO-format annotations dictionary
        split_name: Name of the split ('train' or 'val')

    Returns:
        DataFrame with class counts and percentages
    """
    # Create category mapping
    cat_id_to_name = {cat["id"]: cat["name"] for cat in annotations["categories"]}

    # Count instances per category
    category_counts = Counter()
    for ann in annotations["annotations"]:
        category_counts[ann["category_id"]] += 1

    # Build DataFrame
    records = []
    for cat_id, count in category_counts.items():
        records.append(
            {
                "category_id": cat_id,
                "category_name": cat_id_to_name.get(cat_id, f"Unknown_{cat_id}"),
                "count": count,
                "split": split_name,
            }
        )

    df = (
        pd.DataFrame(records)
        .sort_values("count", ascending=False)
        .reset_index(drop=True)
    )
    df["percentage"] = (df["count"] / df["count"].sum() * 100).round(2)

    return df


# Analyze both splits
train_class_dist = analyze_class_distribution(train_anns, "Train")
val_class_dist = analyze_class_distribution(val_anns, "Validation")

# Combine for comparison
combined_class_dist = pd.concat([train_class_dist, val_class_dist], ignore_index=True)

print("\n" + "=" * 60)
print("CLASS DISTRIBUTION - TRAIN SET")
print("=" * 60)
print(train_class_dist.to_string(index=False))

print("\n" + "=" * 60)
print("CLASS DISTRIBUTION - VALIDATION SET")
print("=" * 60)
print(val_class_dist.to_string(index=False))

In [None]:
# Visualize class distribution
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Train set bar chart
ax1 = axes[0, 0]
train_plot_data = train_class_dist.sort_values("count", ascending=True)
ax1.barh(
    train_plot_data["category_name"],
    train_plot_data["count"],
    color="steelblue",
    alpha=0.8,
)
ax1.set_xlabel("Number of Instances")
ax1.set_title("Train Set: Instance Count by Class")
ax1.grid(axis="x", alpha=0.3)

# Validation set bar chart
ax2 = axes[0, 1]
val_plot_data = val_class_dist.sort_values("count", ascending=True)
ax2.barh(
    val_plot_data["category_name"], val_plot_data["count"], color="seagreen", alpha=0.8
)
ax2.set_xlabel("Number of Instances")
ax2.set_title("Validation Set: Instance Count by Class")
ax2.grid(axis="x", alpha=0.3)

# Stacked comparison
ax3 = axes[1, 0]
# Merge train and val data for comparison
comparison_df = (
    train_class_dist[["category_name", "count"]]
    .merge(
        val_class_dist[["category_name", "count"]],
        on="category_name",
        suffixes=("_train", "_val"),
        how="outer",
    )
    .fillna(0)
)
comparison_df = comparison_df.sort_values("count_train", ascending=True)

x = np.arange(len(comparison_df))
width = 0.35
ax3.barh(
    x - width / 2,
    comparison_df["count_train"],
    width,
    label="Train",
    color="steelblue",
    alpha=0.8,
)
ax3.barh(
    x + width / 2,
    comparison_df["count_val"],
    width,
    label="Validation",
    color="seagreen",
    alpha=0.8,
)
ax3.set_yticks(x)
ax3.set_yticklabels(comparison_df["category_name"])
ax3.set_xlabel("Number of Instances")
ax3.set_title("Class Distribution Comparison: Train vs Validation")
ax3.legend()
ax3.grid(axis="x", alpha=0.3)

# Percentage distribution pie charts
ax4 = axes[1, 1]
# Show top 8 classes and group others
top_n_classes = 8
train_top = train_class_dist.head(top_n_classes).copy()
train_others = pd.DataFrame(
    [
        {
            "category_name": "Others",
            "percentage": train_class_dist.iloc[top_n_classes:]["percentage"].sum(),
        }
    ]
)
train_pie_data = pd.concat([train_top, train_others], ignore_index=True)

colors = plt.cm.Set3(np.linspace(0, 1, len(train_pie_data)))
wedges, texts, autotexts = ax4.pie(
    train_pie_data["percentage"],
    labels=None,  # Remove labels from pie chart to avoid overlap
    autopct="%1.1f%%",
    startangle=90,
    colors=colors,
    pctdistance=0.85,
)
ax4.set_title("Train Set: Class Distribution (Top 8 + Others)", pad=20)

# Make percentage text more readable
for autotext in autotexts:
    autotext.set_color("white")
    autotext.set_fontsize(10)
    autotext.set_weight("bold")

# Add legend outside the pie chart with class names
ax4.legend(
    wedges,
    train_pie_data["category_name"],
    loc="center left",
    bbox_to_anchor=(1, 0, 0.5, 1),
    title="Classes",
    fontsize=10,
)

plt.tight_layout()
if EXPORT_RESULTS:
    plt.savefig(OUTPUT_DIR / "class_distribution.png", dpi=150, bbox_inches="tight")
plt.show()

# Print class imbalance statistics
print("\n" + "=" * 60)
print("CLASS IMBALANCE ANALYSIS")
print("=" * 60)
max_train = train_class_dist["count"].max()
min_train = train_class_dist["count"].min()
print(f"\nTrain Set:")
print(
    f"  - Most frequent class: {train_class_dist.iloc[0]['category_name']} ({max_train} instances)"
)
print(
    f"  - Least frequent class: {train_class_dist.iloc[-1]['category_name']} ({min_train} instances)"
)
print(f"  - Imbalance ratio: {max_train/min_train:.2f}:1")

max_val = val_class_dist["count"].max()
min_val = val_class_dist["count"].min()
print(f"\nValidation Set:")
print(
    f"  - Most frequent class: {val_class_dist.iloc[0]['category_name']} ({max_val} instances)"
)
print(
    f"  - Least frequent class: {val_class_dist.iloc[-1]['category_name']} ({min_val} instances)"
)
print(f"  - Imbalance ratio: {max_val/min_val:.2f}:1")

## 12. Bounding Box Size Distribution

Analyze the distribution of bounding box sizes (width, height, and area) to understand object scale variation.


In [None]:
def analyze_bbox_sizes(annotations: dict, split_name: str) -> pd.DataFrame:
    """
    Analyze bounding box dimensions and areas.

    Args:
        annotations: COCO-format annotations dictionary
        split_name: Name of the split ('train' or 'val')

    Returns:
        DataFrame with bbox dimensions, areas, and aspect ratios
    """
    # Extract bbox information
    records = []
    for ann in annotations["annotations"]:
        bbox = ann["bbox"]  # [x, y, width, height]
        width = bbox[2]
        height = bbox[3]
        area = width * height
        aspect_ratio = width / height if height > 0 else 0

        records.append(
            {
                "width": width,
                "height": height,
                "area": area,
                "aspect_ratio": aspect_ratio,
                "split": split_name,
                "category_id": ann["category_id"],
            }
        )

    return pd.DataFrame(records)


# Analyze bbox sizes for both splits
train_bbox_sizes = analyze_bbox_sizes(train_anns, "Train")
val_bbox_sizes = analyze_bbox_sizes(val_anns, "Validation")

print("\n" + "=" * 60)
print("BOUNDING BOX SIZE STATISTICS")
print("=" * 60)

for name, df in [("Train", train_bbox_sizes), ("Validation", val_bbox_sizes)]:
    print(f"\n{name} Set:")
    print(
        f"  Width  - Mean: {df['width'].mean():.1f}, Median: {df['width'].median():.1f}, "
        f"Range: [{df['width'].min():.1f}, {df['width'].max():.1f}]"
    )
    print(
        f"  Height - Mean: {df['height'].mean():.1f}, Median: {df['height'].median():.1f}, "
        f"Range: [{df['height'].min():.1f}, {df['height'].max():.1f}]"
    )
    print(
        f"  Area   - Mean: {df['area'].mean():.1f}, Median: {df['area'].median():.1f}, "
        f"Range: [{df['area'].min():.1f}, {df['area'].max():.1f}]"
    )
    print(
        f"  Aspect Ratio - Mean: {df['aspect_ratio'].mean():.2f}, Median: {df['aspect_ratio'].median():.2f}"
    )

In [None]:
# Visualize bounding box size distributions
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Width distribution
ax1 = axes[0, 0]
ax1.hist(
    train_bbox_sizes["width"],
    bins=50,
    alpha=0.6,
    label="Train",
    color="steelblue",
    density=True,
)
ax1.hist(
    val_bbox_sizes["width"],
    bins=50,
    alpha=0.6,
    label="Validation",
    color="seagreen",
    density=True,
)
ax1.set_xlabel("Width (pixels)")
ax1.set_ylabel("Density")
ax1.set_title("Bounding Box Width Distribution")
ax1.legend()
ax1.grid(alpha=0.3)

# Height distribution
ax2 = axes[0, 1]
ax2.hist(
    train_bbox_sizes["height"],
    bins=50,
    alpha=0.6,
    label="Train",
    color="steelblue",
    density=True,
)
ax2.hist(
    val_bbox_sizes["height"],
    bins=50,
    alpha=0.6,
    label="Validation",
    color="seagreen",
    density=True,
)
ax2.set_xlabel("Height (pixels)")
ax2.set_ylabel("Density")
ax2.set_title("Bounding Box Height Distribution")
ax2.legend()
ax2.grid(alpha=0.3)

# Area distribution (log scale)
ax3 = axes[0, 2]
ax3.hist(
    np.log10(train_bbox_sizes["area"] + 1),
    bins=50,
    alpha=0.6,
    label="Train",
    color="steelblue",
    density=True,
)
ax3.hist(
    np.log10(val_bbox_sizes["area"] + 1),
    bins=50,
    alpha=0.6,
    label="Validation",
    color="seagreen",
    density=True,
)
ax3.set_xlabel("Log10(Area) (pixels²)")
ax3.set_ylabel("Density")
ax3.set_title("Bounding Box Area Distribution (Log Scale)")
ax3.legend()
ax3.grid(alpha=0.3)

# Aspect ratio distribution
ax4 = axes[1, 0]
# Filter extreme aspect ratios for better visualization
train_ar = train_bbox_sizes[train_bbox_sizes["aspect_ratio"] < 10]["aspect_ratio"]
val_ar = val_bbox_sizes[val_bbox_sizes["aspect_ratio"] < 10]["aspect_ratio"]
ax4.hist(train_ar, bins=50, alpha=0.6, label="Train", color="steelblue", density=True)
ax4.hist(val_ar, bins=50, alpha=0.6, label="Validation", color="seagreen", density=True)
ax4.set_xlabel("Aspect Ratio (width/height)")
ax4.set_ylabel("Density")
ax4.set_title("Aspect Ratio Distribution (filtered < 10)")
ax4.legend()
ax4.grid(alpha=0.3)

# Width vs Height scatter (sampled for performance)
ax5 = axes[1, 1]
sample_size = min(5000, len(train_bbox_sizes))
train_sample = train_bbox_sizes.sample(n=sample_size, random_state=42)
val_sample = val_bbox_sizes.sample(
    n=min(sample_size, len(val_bbox_sizes)), random_state=42
)
ax5.scatter(
    train_sample["width"],
    train_sample["height"],
    alpha=0.3,
    s=1,
    label="Train",
    color="steelblue",
)
ax5.scatter(
    val_sample["width"],
    val_sample["height"],
    alpha=0.3,
    s=1,
    label="Validation",
    color="seagreen",
)
ax5.plot([0, 800], [0, 800], "r--", alpha=0.5, linewidth=1, label="Square (1:1)")
ax5.set_xlabel("Width (pixels)")
ax5.set_ylabel("Height (pixels)")
ax5.set_title("Width vs Height (sampled)")
ax5.legend()
ax5.grid(alpha=0.3)
ax5.set_xlim(0, 800)
ax5.set_ylim(0, 800)

# Box plot for area comparison
ax6 = axes[1, 2]
combined_bbox = pd.concat([train_bbox_sizes, val_bbox_sizes], ignore_index=True)
sns.boxplot(
    data=combined_bbox, x="split", y="area", ax=ax6, palette=["steelblue", "seagreen"]
)
ax6.set_yscale("log")
ax6.set_ylabel("Area (pixels²) - Log Scale")
ax6.set_xlabel("Split")
ax6.set_title("Bounding Box Area Distribution by Split")
ax6.grid(alpha=0.3)

plt.tight_layout()
if EXPORT_RESULTS:
    plt.savefig(OUTPUT_DIR / "bbox_size_distribution.png", dpi=150, bbox_inches="tight")
plt.show()

In [None]:
# Analyze bbox sizes by object scale categories (COCO style)
def categorize_by_scale(area):
    """Categorize bbox by COCO scale definitions."""
    if area < 32**2:
        return "small"
    elif area < 96**2:
        return "medium"
    else:
        return "large"


train_bbox_sizes["scale"] = train_bbox_sizes["area"].apply(categorize_by_scale)
val_bbox_sizes["scale"] = val_bbox_sizes["area"].apply(categorize_by_scale)

# Count by scale
train_scale_counts = train_bbox_sizes["scale"].value_counts()
val_scale_counts = val_bbox_sizes["scale"].value_counts()

print("\n" + "=" * 60)
print("OBJECT SCALE DISTRIBUTION (COCO Categories)")
print("=" * 60)
print("\nScale Definitions:")
print("  - Small:  area < 32²  (< 1,024 pixels²)")
print("  - Medium: 32² ≤ area < 96² (1,024 - 9,216 pixels²)")
print("  - Large:  area ≥ 96² (≥ 9,216 pixels²)")

print(f"\nTrain Set:")
for scale in ["small", "medium", "large"]:
    count = train_scale_counts.get(scale, 0)
    pct = count / len(train_bbox_sizes) * 100
    print(f"  {scale.capitalize()}: {count:,} ({pct:.1f}%)")

print(f"\nValidation Set:")
for scale in ["small", "medium", "large"]:
    count = val_scale_counts.get(scale, 0)
    pct = count / len(val_bbox_sizes) * 100
    print(f"  {scale.capitalize()}: {count:,} ({pct:.1f}%)")

# Visualize scale distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Train set
ax1 = axes[0]
scale_order = ["small", "medium", "large"]
train_counts = [train_scale_counts.get(s, 0) for s in scale_order]
ax1.bar(
    scale_order, train_counts, color=["lightcoral", "skyblue", "lightgreen"], alpha=0.8
)
ax1.set_ylabel("Number of Instances")
ax1.set_title("Train Set: Object Scale Distribution")
ax1.grid(axis="y", alpha=0.3)
for i, (scale, count) in enumerate(zip(scale_order, train_counts)):
    pct = count / len(train_bbox_sizes) * 100
    ax1.text(i, count, f"{count:,}\n({pct:.1f}%)", ha="center", va="bottom")

# Validation set
ax2 = axes[1]
val_counts = [val_scale_counts.get(s, 0) for s in scale_order]
ax2.bar(
    scale_order, val_counts, color=["lightcoral", "skyblue", "lightgreen"], alpha=0.8
)
ax2.set_ylabel("Number of Instances")
ax2.set_title("Validation Set: Object Scale Distribution")
ax2.grid(axis="y", alpha=0.3)
for i, (scale, count) in enumerate(zip(scale_order, val_counts)):
    pct = count / len(val_bbox_sizes) * 100
    ax2.text(i, count, f"{count:,}\n({pct:.1f}%)", ha="center", va="bottom")

plt.tight_layout()
if EXPORT_RESULTS:
    plt.savefig(
        OUTPUT_DIR / "bbox_scale_distribution.png", dpi=150, bbox_inches="tight"
    )
plt.show()