# **تمكّن - Tamakkan** Graduation project

## Intelligent platform for driver behavior analysis, training support, and automated license testing.

##**Group Member:**

- ### **Samar Rafat Kintab 443003122**
- ### **Lina Mohammad Bader 444000417**
- ### **Lamar Bandar Felemban 444003576**
- ### **Bashair Fahad Al-jabri 444004184**

## **Supervised By: Dr. Eiman Talal Al-Harby**


#Pipeline
25 FEB 2026
Video Input
    ├── YOLO (BDD100K + BSTLD fine-tune) ──▶ objects, lights, signs
    ├── ByteTrack ──▶ object trajectories ──▶ near-miss logic
    ├── UFLD or LaneATT (CULane) ──▶ lane masks ──▶ deviation logic
    ├── Depth Anything / MiDaS (KITTI fine-tune) ──▶ distance estimation
    └── Optical Flow (RAFT) ──▶ motion estimation [optional v2]

Phone GPS/IMU
    └── Signal processing ──▶ speed, harsh braking

Aggregation Layer  ← you need to design this
    └── Event classifier + Scoring engine ──▶ session report

# **BDD100K Dataset**


### Imports


In [None]:
import os
import json
import glob
import random
import shutil
from pathlib import Path
from collections import Counter, defaultdict
import numpy as np
import pandas as pd
from tqdm import tqdm 
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.patches import Rectangle
from PIL import Image




ModuleNotFoundError: No module named 'torch'

## EDA + Preproccisng 

### Configurations


In [2]:
# !! CHANGE THESE PATHS TO MATCH YOUR LAPTOP !!
CONFIG = {
    # ── Image folders (one parent, with train/ val/ test/ inside) ──
    "train_img_root": r"C:\Users\samar\OneDrive\Desktop\Grad Project\Datasets\bdd100k_images_100k\100k\train",
    "val_img_root":   r"C:\Users\samar\OneDrive\Desktop\Grad Project\Datasets\bdd100k_images_100k\100k\val",
    "test_img_root":  r"C:\Users\samar\OneDrive\Desktop\Grad Project\Datasets\bdd100k_images_100k\100k\test",

    # ── Label (JSON) folders (separate location, same train/val/test structure) ──
    "train_lbl_root": r"C:\Users\samar\OneDrive\Desktop\Grad Project\Datasets\bdd100k_labels\100k\train",
    "val_lbl_root":   r"C:\Users\samar\OneDrive\Desktop\Grad Project\Datasets\bdd100k_labels\100k\val",
    "test_lbl_root":  r"C:\Users\samar\OneDrive\Desktop\Grad Project\Datasets\bdd100k_labels\100k\test",

    # ── Output YOLO dataset folder ──
    "output_dir": r"C:\Users\samar\OneDrive\Desktop\Grad Project\Datasets\down sampled Bdd100k Dataset",

    # ── What we remove + final target data size ──
    "class_to_remove": "train",
    "target_total_size": 30000,   # total across train+val+test

    # Keep split proportions stable
    "keep_split_ratio": True,

    # Safety floors for val/test (so we don't end up with too few images)
    "min_val":  2000,
    "min_test": 2000,

    # Image types
    "image_extensions": [".jpg", ".jpeg", ".png"],

    # Random seed (reproducible)
    "seed": 42,

    # Sampling strategy
    "strategy": "information_maximizing",

    # Scoring knobs
    "score_weights": {
        "rarity":      6.0,
        "diversity":   3.0,
        "density":     0.08,
        "night":       1.0,
        "dawn_dusk":   2.0,
        "bad_weather": 1.5,
        "rare_scene":  1.0,
    },

    "scene_bonus_list":   ["residential", "parking lot", "tunnel", "gas stations"],
    "weather_bonus_list": ["rainy", "snowy", "foggy"],
}

random.seed(CONFIG["seed"])
np.random.seed(CONFIG["seed"])

print("Filter train-class images + undersample ALL to 30k")
print(f"Remove class : {CONFIG['class_to_remove']}")
print(f"Target total : {CONFIG['target_total_size']:,}")
print(f"Output dir   : {CONFIG['output_dir']}")


Filter train-class images + undersample ALL to 30k
Remove class : train
Target total : 30,000
Output dir   : C:\Users\samar\OneDrive\Desktop\Grad Project\Datasets\down sampled Bdd100k Dataset


### Utilities


In [3]:
# Note: We use these through out the notebook so i put them here so we dont get duplicates

_JSON_CACHE = {}

def load_json_cached(json_path: str):
    """Load JSON once and cache it."""
    if json_path in _JSON_CACHE:
        return _JSON_CACHE[json_path]
    try:
        with open(json_path, "r") as f:
            data = json.load(f)
    except Exception:
        data = None
    _JSON_CACHE[json_path] = data
    return data


def extract_box2d_objects(label_data):
    """Return list of objects that have box2d annotations."""
    if not label_data or "frames" not in label_data:
        return []
    objs = []
    for frame in label_data.get("frames", []):
        for obj in frame.get("objects", []):
            if obj.get("box2d") is not None:
                objs.append(obj)
    return objs


def find_images(root_dir: str):
    """Collect all image paths under root_dir (non-recursive, flat folder)."""
    images = []
    for ext in CONFIG["image_extensions"]:
        images.extend(glob.glob(os.path.join(root_dir, f"*{ext}")))
    return images


def build_pairs(img_root: str, lbl_root: str, split: str):
    """
    Match images with their JSON label file using separate image and label folders.
    Image:  img_root/<stem>.jpg
    Label:  lbl_root/<stem>.json
    Returns list of dicts: {split, img_path, json_path}
    """
    if not os.path.exists(img_root):
        print(f"[WARNING] Missing image folder: {img_root}")
        return []
    if not os.path.exists(lbl_root):
        print(f"[WARNING] Missing label folder: {lbl_root}")
        return []

    images = find_images(img_root)
    samples = []

    for img_path in tqdm(images, desc=f"Matching {split}"):
        stem = Path(img_path).stem
        json_path = os.path.join(lbl_root, stem + ".json")
        if os.path.exists(json_path):
            samples.append({
                "split":     split,
                "img_path":  img_path,
                "json_path": json_path,
            })
        # else: image has no matching label — silently skip

    return samples

### Step 1: Scan & match


In [4]:
# We create one master list of matched samples across all splits.
print("\nStep 1: Scanning & matching dataset")

train_samples = build_pairs(CONFIG["train_img_root"], CONFIG["train_lbl_root"], "train")
val_samples   = build_pairs(CONFIG["val_img_root"],   CONFIG["val_lbl_root"],   "val")
test_samples  = build_pairs(CONFIG["test_img_root"],  CONFIG["test_lbl_root"],  "test")

all_samples = train_samples + val_samples + test_samples

print("\nDataset matched pairs:")
print(f"  Train: {len(train_samples):,}")
print(f"  Val:   {len(val_samples):,}")
print(f"  Test:  {len(test_samples):,}")
print(f"  TOTAL: {len(all_samples):,}\n")


Step 1: Scanning & matching dataset


Matching train: 100%|██████████| 70000/70000 [00:09<00:00, 7331.04it/s]
Matching val: 100%|██████████| 10000/10000 [00:01<00:00, 7732.42it/s]
Matching test: 100%|██████████| 20000/20000 [00:02<00:00, 7826.18it/s]


Dataset matched pairs:
  Train: 70,000
  Val:   10,000
  Test:  20,000
  TOTAL: 100,000






### Step 2: remove images containing the 'train' class

(drop whole the image)


In [5]:
# if an image has a train box anywhere we delete the entire sample.
def sample_has_class(sample, class_name: str) -> bool:
    label = load_json_cached(sample["json_path"])
    objs = extract_box2d_objects(label)
    for o in objs:
        if o.get("category") == class_name:
            return True
    return False


print(f"Step 2: Removing any sample that contains class '{CONFIG['class_to_remove']}'")

kept    = []
removed = []

for s in tqdm(all_samples, desc="Filtering train-class samples"):
    if sample_has_class(s, CONFIG["class_to_remove"]):
        removed.append(s)
    else:
        kept.append(s)

print("\nFiltering result:")
print(f"  Original: {len(all_samples):,}")
print(f"  Removed:  {len(removed):,} ({len(removed)/max(1,len(all_samples))*100:.2f}%)")
print(f"  Kept:     {len(kept):,}\n")

kept_by_split = {
    "train": [s for s in kept if s["split"] == "train"],
    "val":   [s for s in kept if s["split"] == "val"],
    "test":  [s for s in kept if s["split"] == "test"],
}

print("Kept breakdown (after removing train-class images):")
for sp in ["train", "val", "test"]:
    print(f"  {sp.upper():5s}: {len(kept_by_split[sp]):,}")
print()

Step 2: Removing any sample that contains class 'train'


Filtering train-class samples: 100%|██████████| 100000/100000 [46:37<00:00, 35.74it/s] 


Filtering result:
  Original: 100,000
  Removed:  145 (0.14%)
  Kept:     99,855

Kept breakdown (after removing train-class images):
  TRAIN: 69,895
  VAL  : 9,986
  TEST : 19,974






### Step 3: analyze data


In [6]:
# We store:
# categories + counts
# metadata: timeofday/scene/weather
# num_objects, diversity score
def shannon_diversity(categories):
    """Normalized Shannon entropy (0..1) as a diversity score."""
    if not categories:
        return 0.0
    counts = Counter(categories)
    total  = sum(counts.values())
    probs  = np.array([c / total for c in counts.values()], dtype=np.float64)
    entropy     = -np.sum(probs * np.log2(probs + 1e-12))
    max_entropy = np.log2(len(counts)) if len(counts) > 1 else 1.0
    return float(entropy / max_entropy) if max_entropy > 0 else 0.0


print("Step 3: Analyzing samples (cached)")

analyzed         = []
all_class_counts = Counter()

for s in tqdm(kept, desc="Analyzing kept samples"):
    label = load_json_cached(s["json_path"])
    if not label:
        continue

    objs = extract_box2d_objects(label)
    cats = [o.get("category", "unknown") for o in objs]

    attrs     = label.get("attributes", {}) if isinstance(label, dict) else {}
    timeofday = attrs.get("timeofday", "unknown")
    scene     = attrs.get("scene",     "unknown")
    weather   = attrs.get("weather",   "unknown")

    all_class_counts.update(cats)

    analyzed.append({
        **s,
        "categories":  cats,
        "num_objects": len(cats),
        "diversity":   shannon_diversity(cats),
        "timeofday":   timeofday,
        "scene":       scene,
        "weather":     weather,
    })

print(f"\n  Analyzed samples    : {len(analyzed):,}")
print(f"  Unique classes      : {len(all_class_counts):,}\n")

Step 3: Analyzing samples (cached)


Analyzing kept samples: 100%|██████████| 99855/99855 [00:02<00:00, 39250.48it/s]


  Analyzed samples    : 99,855
  Unique classes      : 9






### Step 4: score samples (Intelligent undersampling)


In [7]:
# compute a score per image so we keep the “best learning signal” images
# rarity: favor classes that appear less often (inverse frequency)
# diversity: favor images with multiple classes
# density: favor images with more objects
# hard conditions: keep some night / dawn-dusk / rainy-foggy / rare scenes

# rarity is computed from the filtered dataset not hard-coded
def compute_rarity_bonus(categories, class_freq: Counter):
    """Reward rare classes using inverse frequency."""
    if not categories:
        return 0.0
    bonus = 0.0
    for c in set(categories):
        f = class_freq.get(c, 1)
        bonus += 1.0 / np.sqrt(f)
    return float(bonus)


W = CONFIG["score_weights"]

def compute_score(sample, class_freq: Counter):
    score = 0.0
    score += W["rarity"]    * compute_rarity_bonus(sample["categories"], class_freq)
    score += W["diversity"] * sample["diversity"]
    score += W["density"]   * sample["num_objects"]

    if sample["timeofday"] == "night":
        score += W["night"]
    elif sample["timeofday"] == "dawn/dusk":
        score += W["dawn_dusk"]

    if sample["weather"] in CONFIG["weather_bonus_list"]:
        score += W["bad_weather"]

    if sample["scene"] in CONFIG["scene_bonus_list"]:
        score += W["rare_scene"]

    return float(score)


print("Step 4: Scoring images for intelligent undersampling")

for s in tqdm(analyzed, desc="Scoring"):
    s["score"] = compute_score(s, all_class_counts)

Step 4: Scoring images for intelligent undersampling


Scoring: 100%|██████████| 99855/99855 [00:00<00:00, 125018.69it/s]


##Step 5: undersample dataset to 30k
(split-aware)


In [8]:
# split ratios are stable by default
def choose_split_targets(total_target: int, split_counts: dict):
    """Compute how many samples to keep per split."""
    total_available = sum(split_counts.values())
    if total_target >= total_available:
        return {k: split_counts[k] for k in split_counts}

    if not CONFIG["keep_split_ratio"]:
        train_target = max(0, total_target - CONFIG["min_val"] - CONFIG["min_test"])
        return {
            "train": min(train_target, split_counts["train"]),
            "val":   min(CONFIG["min_val"],  split_counts["val"]),
            "test":  min(CONFIG["min_test"], split_counts["test"]),
        }

    raw = {sp: int(round(total_target * (split_counts[sp] / total_available))) for sp in split_counts}
    raw["val"]  = max(raw["val"],  CONFIG["min_val"])
    raw["test"] = max(raw["test"], CONFIG["min_test"])
    raw["train"] = max(0, total_target - raw["val"] - raw["test"])

    for sp in raw:
        raw[sp] = min(raw[sp], split_counts[sp])

    current_total = sum(raw.values())
    remaining = total_target - current_total
    if remaining > 0:
        can_add = min(remaining, split_counts["train"] - raw["train"])
        raw["train"] += max(0, can_add)

    return raw


split_counts = {
    "train": sum(1 for s in analyzed if s["split"] == "train"),
    "val":   sum(1 for s in analyzed if s["split"] == "val"),
    "test":  sum(1 for s in analyzed if s["split"] == "test"),
}

targets = choose_split_targets(CONFIG["target_total_size"], split_counts)

print("\nTarget sizes (after filtering train-class images):")
for sp in ["train", "val", "test"]:
    print(f"  {sp.upper():5s}: target {targets[sp]:,} / available {split_counts[sp]:,}")
print(f"  TOTAL: {sum(targets.values()):,}\n")


def select_top_by_score(samples, k):
    if k >= len(samples):
        return samples
    return sorted(samples, key=lambda x: x["score"], reverse=True)[:k]


print("Step 5: Selecting top scored samples per split")

selected          = []
selected_by_split = {}

for sp in ["train", "val", "test"]:
    pool   = [s for s in analyzed if s["split"] == sp]
    chosen = select_top_by_score(pool, targets[sp])
    selected_by_split[sp] = chosen
    selected.extend(chosen)

print("\nSelection done:")
for sp in ["train", "val", "test"]:
    print(f"  {sp.upper():5s}: {len(selected_by_split[sp]):,}")
print(f"  TOTAL: {len(selected):,}\n")


Target sizes (after filtering train-class images):
  TRAIN: target 20,999 / available 69,895
  VAL  : target 3,000 / available 9,986
  TEST : target 6,001 / available 19,974
  TOTAL: 30,000

Step 5: Selecting top scored samples per split

Selection done:
  TRAIN: 20,999
  VAL  : 3,000
  TEST : 6,001
  TOTAL: 30,000



### Step 6: build class list and YOLO label conversion


In [9]:
# YOLO needs:
# labels as .txt in format: <class_id> <x_center> <y_center> <w> <h>
# all values normalized by image width/height

# We create the class list from the dataset automatically.

def build_class_list(samples):
    all_cats = Counter()
    for s in samples:
        all_cats.update(s["categories"])
    classes = sorted(all_cats.keys())
    return classes, all_cats


classes, class_freq_selected = build_class_list(selected)
class_to_id = {c: i for i, c in enumerate(classes)}

print("Step 6: Building YOLO class mapping")
print(f"Classes used (count={len(classes)}):")
print(classes[:30], "..." if len(classes) > 30 else "")
print()


def clamp(v, lo, hi):
    return max(lo, min(hi, v))


def json_to_yolo_lines(json_path, img_w, img_h, class_to_id):
    """Convert one BDD100K label JSON to YOLO-format lines."""
    label = load_json_cached(json_path)
    if not label:
        return []

    objs  = extract_box2d_objects(label)
    lines = []

    for o in objs:
        cat = o.get("category", "unknown")
        if cat not in class_to_id:
            continue

        b = o.get("box2d", {})
        x1, y1, x2, y2 = b.get("x1"), b.get("y1"), b.get("x2"), b.get("y2")
        if None in (x1, y1, x2, y2):
            continue

        x1 = clamp(float(x1), 0.0, float(img_w))
        x2 = clamp(float(x2), 0.0, float(img_w))
        y1 = clamp(float(y1), 0.0, float(img_h))
        y2 = clamp(float(y2), 0.0, float(img_h))

        if x2 <= x1 or y2 <= y1:
            continue

        xc  = ((x1 + x2) / 2.0) / img_w
        yc  = ((y1 + y2) / 2.0) / img_h
        w   = (x2 - x1) / img_w
        h   = (y2 - y1) / img_h
        cid = class_to_id[cat]
        lines.append(f"{cid} {xc:.6f} {yc:.6f} {w:.6f} {h:.6f}")

    return lines

Step 6: Building YOLO class mapping
Classes used (count=9):
['bike', 'bus', 'car', 'motor', 'person', 'rider', 'traffic light', 'traffic sign', 'truck'] 



### Step 7: create YOLO folder structure + export images/labels + data.yaml


In [10]:
# This makes a fully trainable YOLO dataset:
# output_dir/
#   images/train, images/val, images/test
#   labels/train, labels/val, labels/test
#   data.yaml

from PIL import Image

def ensure_dir(p):
    os.makedirs(p, exist_ok=True)


out     = CONFIG["output_dir"]
img_out = {sp: os.path.join(out, "images", sp) for sp in ["train", "val", "test"]}
lab_out = {sp: os.path.join(out, "labels", sp) for sp in ["train", "val", "test"]}

for sp in ["train", "val", "test"]:
    ensure_dir(img_out[sp])
    ensure_dir(lab_out[sp])

print("Step 7: Exporting YOLO dataset (copy images + write labels)")


def export_split(samples, split):
    for s in tqdm(samples, desc=f"Export {split}"):
        img_src  = s["img_path"]
        json_src = s["json_path"]

        stem    = Path(img_src).stem
        img_dst = os.path.join(img_out[split], Path(img_src).name)
        lab_dst = os.path.join(lab_out[split], stem + ".txt")

        # Skip if already exported (resume-safe)
        if not os.path.exists(img_dst):
            shutil.copy2(img_src, img_dst)

        try:
            with Image.open(img_src) as im:
                w, h = im.size
        except Exception:
            continue

        yolo_lines = json_to_yolo_lines(json_src, w, h, class_to_id)
        with open(lab_dst, "w") as f:
            f.write("\n".join(yolo_lines))


export_split(selected_by_split["train"], "train")
export_split(selected_by_split["val"],   "val")
export_split(selected_by_split["test"],  "test")


# data.yaml for Ultralytics YOLO
yaml_path = os.path.join(out, "data.yaml")
with open(yaml_path, "w") as f:
    f.write(f"path: {out}\n")
    f.write("train: images/train\n")
    f.write("val: images/val\n")
    f.write("test: images/test\n\n")
    f.write(f"nc: {len(classes)}\n")
    f.write("names:\n")
    for i, name in enumerate(classes):
        f.write(f"  {i}: {name}\n")

print("\nYOLO dataset is ready!")
print(f"Output folder : {out}")
print(f"data.yaml     : {yaml_path}")

Step 7: Exporting YOLO dataset (copy images + write labels)


Export train: 100%|██████████| 20999/20999 [09:41<00:00, 36.08it/s]
Export val: 100%|██████████| 3000/3000 [01:23<00:00, 35.76it/s]
Export test: 100%|██████████| 6001/6001 [02:43<00:00, 36.71it/s]


YOLO dataset is ready!
Output folder : C:\Users\samar\OneDrive\Desktop\Grad Project\Datasets\down sampled Bdd100k Dataset
data.yaml     : C:\Users\samar\OneDrive\Desktop\Grad Project\Datasets\down sampled Bdd100k Dataset\data.yaml





### Final summary


In [11]:

print("\nFINAL SUMMARY")
print(f"After removing '{CONFIG['class_to_remove']}' images:")
print(f"  Kept samples total : {len(analyzed):,}")
print(f"Selected for YOLO:")
print(f"  Train : {len(selected_by_split['train']):,}")
print(f"  Val   : {len(selected_by_split['val']):,}")
print(f"  Test  : {len(selected_by_split['test']):,}")
print(f"  TOTAL : {len(selected):,}")
print(f"Classes : {len(classes)}")



FINAL SUMMARY
After removing 'train' images:
  Kept samples total : 99,855
Selected for YOLO:
  Train : 20,999
  Val   : 3,000
  Test  : 6,001
  TOTAL : 30,000
Classes : 9


## Visualizations


Produces:

1. Random sample frames with bounding boxes
2. Class distribution (full dataset vs selected)
3. Train / Val / Test split sizes
4. Time-of-day distribution
5. Weather distribution
6. Scene distribution
7. Objects-per-image histogram
8. Diversity score distribution
9. Image score distribution (what got selected vs dropped)
10. Top-20 class co-occurrence heatmap


### If Standalone


In [15]:
# If you're running this file standalone (not after the main script),
# set STANDALONE = True and fill in the paths below.
STANDALONE = False   # ← change to True if running alone

if STANDALONE:
    # paste the same paths you used in tamakkan_2_local.py
    import sys
    sys.path.insert(0, r"C:\path\to\tamakkan_2_local.py")   # folder only
    exec(open(r"C:\path\to\tamakkan_2_local.py").read())

### Output folder


In [16]:
# Output folder for saving the plots 
VIZ_DIR = os.path.join(CONFIG["output_dir"], "visualizations")
os.makedirs(VIZ_DIR, exist_ok=True)

### Plotting style


In [17]:
plt.rcParams.update({
    "figure.dpi":      150,
    "axes.spines.top":   False,
    "axes.spines.right": False,
    "font.family":     "DejaVu Sans",
    "axes.titlesize":  13,
    "axes.labelsize":  11,
})

PALETTE = [
    "#2E86AB", "#A23B72", "#F18F01", "#C73E1D",
    "#3B1F2B", "#44BBA4", "#E94F37", "#393E41",
    "#F5A623", "#7B2D8B", "#00A878", "#D64045",
]

def save_fig(name):
    path = os.path.join(VIZ_DIR, name)
    plt.savefig(path, bbox_inches="tight")
    plt.close()
    print(f"  Saved → {path}")

### RANDOM SAMPLE FRAMES WITH BOUNDING BOXES


In [18]:
print("\n[1] Drawing random sample frames with bounding boxes")

N_SAMPLES   = 12                 # total images to show
COLS        = 4
ROWS        = N_SAMPLES // COLS
BOX_ALPHA   = 0.85

# Build a color map per class
unique_classes = sorted(set(c for s in selected for c in s["categories"]))
cmap = plt.cm.get_cmap("tab20", len(unique_classes))
class_color = {cls: cmap(i) for i, cls in enumerate(unique_classes)}

random.seed(CONFIG["seed"])
sample_pool = random.sample(selected, min(N_SAMPLES, len(selected)))

fig, axes = plt.subplots(ROWS, COLS, figsize=(COLS * 4, ROWS * 3.2))
axes = axes.flatten()

for ax, s in zip(axes, sample_pool):
    try:
        img = Image.open(s["img_path"]).convert("RGB")
    except Exception:
        ax.axis("off")
        continue

    ax.imshow(img)
    img_w, img_h = img.size

    # Load boxes from JSON
    from collections import defaultdict
    label = json.load(open(s["json_path"]))
    for frame in label.get("frames", []):
        for obj in frame.get("objects", []):
            b   = obj.get("box2d")
            cat = obj.get("category", "unknown")
            if b is None:
                continue
            x1, y1, x2, y2 = b["x1"], b["y1"], b["x2"], b["y2"]
            color = class_color.get(cat, (1, 1, 0, 1))
            rect  = Rectangle(
                (x1, y1), x2 - x1, y2 - y1,
                linewidth=1.5, edgecolor=color,
                facecolor="none", alpha=BOX_ALPHA
            )
            ax.add_patch(rect)
            ax.text(
                x1, max(y1 - 3, 0), cat,
                fontsize=6, color="white",
                bbox=dict(facecolor=color, alpha=0.7, pad=1, edgecolor="none")
            )

    ax.set_title(
        f"{s['split']} | {s['timeofday']} | {s['weather']}",
        fontsize=7, pad=3
    )
    ax.axis("off")

# hide any unused axes
for ax in axes[len(sample_pool):]:
    ax.axis("off")

fig.suptitle("Random Sample Frames with Bounding Boxes", fontsize=14, y=1.01)
plt.tight_layout()
save_fig("01_sample_frames_bbox.png")


[1] Drawing random sample frames with bounding boxes


  cmap = plt.cm.get_cmap("tab20", len(unique_classes))


  Saved → C:\Users\samar\OneDrive\Desktop\Grad Project\Datasets\down sampled Bdd100k Dataset\visualizations\01_sample_frames_bbox.png


### CLASS DISTRIBUTION — full kept dataset vs selected 30k


In [19]:
print("[2] Class distribution (full vs selected)")

full_counts     = Counter(c for s in analyzed  for c in s["categories"])
selected_counts = Counter(c for s in selected  for c in s["categories"])

all_cls  = sorted(full_counts.keys(), key=lambda x: -full_counts[x])
x        = np.arange(len(all_cls))
w        = 0.4

fig, ax = plt.subplots(figsize=(max(10, len(all_cls) * 0.55), 5))
ax.bar(x - w/2, [full_counts[c]     for c in all_cls], width=w,
       label="Full kept dataset", color=PALETTE[0], alpha=0.85)
ax.bar(x + w/2, [selected_counts[c] for c in all_cls], width=w,
       label="Selected 30k",       color=PALETTE[2], alpha=0.85)
ax.set_xticks(x)
ax.set_xticklabels(all_cls, rotation=45, ha="right", fontsize=9)
ax.set_ylabel("Number of instances")
ax.set_title("Class Distribution: Full Kept Dataset vs Selected 30k")
ax.legend()
plt.tight_layout()
save_fig("02_class_distribution.png")

[2] Class distribution (full vs selected)
  Saved → C:\Users\samar\OneDrive\Desktop\Grad Project\Datasets\down sampled Bdd100k Dataset\visualizations\02_class_distribution.png


### SPLIT SIZE BAR CHART


In [20]:
print("[3] Split sizes")

splits      = ["Train", "Val", "Test"]
full_sizes  = [sum(1 for s in analyzed if s["split"] == sp.lower()) for sp in splits]
sel_sizes   = [len(selected_by_split[sp.lower()]) for sp in splits]

fig, axes = plt.subplots(1, 2, figsize=(10, 4))
for ax, sizes, title in zip(
    axes,
    [full_sizes, sel_sizes],
    ["After Filtering (full kept)", "After Undersampling (selected 30k)"]
):
    bars = ax.bar(splits, sizes, color=PALETTE[:3], width=0.5)
    for bar, v in zip(bars, sizes):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 100,
                f"{v:,}", ha="center", fontsize=10)
    ax.set_title(title)
    ax.set_ylabel("Number of images")
    ax.set_ylim(0, max(sizes) * 1.15)

plt.suptitle("Dataset Split Sizes", fontsize=13)
plt.tight_layout()
save_fig("03_split_sizes.png")

[3] Split sizes
  Saved → C:\Users\samar\OneDrive\Desktop\Grad Project\Datasets\down sampled Bdd100k Dataset\visualizations\03_split_sizes.png


### TIME-OF-DAY DISTRIBUTION


In [21]:
print("[4] Time-of-day distribution")

tod_full = Counter(s["timeofday"] for s in analyzed)
tod_sel  = Counter(s["timeofday"] for s in selected)
tod_keys = sorted(tod_full.keys(), key=lambda x: -tod_full[x])

x = np.arange(len(tod_keys))
fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(x - 0.2, [tod_full[k] for k in tod_keys], width=0.35,
       label="Full kept", color=PALETTE[0], alpha=0.85)
ax.bar(x + 0.2, [tod_sel[k]  for k in tod_keys], width=0.35,
       label="Selected",  color=PALETTE[1], alpha=0.85)
ax.set_xticks(x)
ax.set_xticklabels(tod_keys, fontsize=10)
ax.set_ylabel("Number of images")
ax.set_title("Time-of-Day Distribution")
ax.legend()
plt.tight_layout()
save_fig("04_timeofday_distribution.png")

[4] Time-of-day distribution
  Saved → C:\Users\samar\OneDrive\Desktop\Grad Project\Datasets\down sampled Bdd100k Dataset\visualizations\04_timeofday_distribution.png


### WEATHER DISTRIBUTION


In [22]:
print("[5] Weather distribution")

wth_full = Counter(s["weather"] for s in analyzed)
wth_sel  = Counter(s["weather"] for s in selected)
wth_keys = sorted(wth_full.keys(), key=lambda x: -wth_full[x])

x = np.arange(len(wth_keys))
fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(x - 0.2, [wth_full[k] for k in wth_keys], width=0.35,
       label="Full kept", color=PALETTE[3], alpha=0.85)
ax.bar(x + 0.2, [wth_sel[k]  for k in wth_keys], width=0.35,
       label="Selected",  color=PALETTE[2], alpha=0.85)
ax.set_xticks(x)
ax.set_xticklabels(wth_keys, fontsize=10)
ax.set_ylabel("Number of images")
ax.set_title("Weather Distribution")
ax.legend()
plt.tight_layout()
save_fig("05_weather_distribution.png")

[5] Weather distribution
  Saved → C:\Users\samar\OneDrive\Desktop\Grad Project\Datasets\down sampled Bdd100k Dataset\visualizations\05_weather_distribution.png


### SCENE DISTRIBUTION


In [23]:
print("[6] Scene distribution")

scn_full = Counter(s["scene"] for s in analyzed)
scn_sel  = Counter(s["scene"] for s in selected)
scn_keys = sorted(scn_full.keys(), key=lambda x: -scn_full[x])

x = np.arange(len(scn_keys))
fig, ax = plt.subplots(figsize=(max(8, len(scn_keys) * 0.9), 4))
ax.bar(x - 0.2, [scn_full[k] for k in scn_keys], width=0.35,
       label="Full kept", color=PALETTE[4], alpha=0.85)
ax.bar(x + 0.2, [scn_sel[k]  for k in scn_keys], width=0.35,
       label="Selected",  color=PALETTE[5], alpha=0.85)
ax.set_xticks(x)
ax.set_xticklabels(scn_keys, rotation=30, ha="right", fontsize=9)
ax.set_ylabel("Number of images")
ax.set_title("Scene Type Distribution")
ax.legend()
plt.tight_layout()
save_fig("06_scene_distribution.png")

[6] Scene distribution
  Saved → C:\Users\samar\OneDrive\Desktop\Grad Project\Datasets\down sampled Bdd100k Dataset\visualizations\06_scene_distribution.png


### OBJECTS PER IMAGE HISTOGRAM


In [24]:
print("[7] Objects-per-image histogram")

objs_full = [s["num_objects"] for s in analyzed]
objs_sel  = [s["num_objects"] for s in selected]
max_objs  = max(objs_full) if objs_full else 50
bins      = range(0, min(max_objs + 2, 60))

fig, ax = plt.subplots(figsize=(10, 4))
ax.hist(objs_full, bins=bins, alpha=0.6, label="Full kept",
        color=PALETTE[0], edgecolor="white")
ax.hist(objs_sel,  bins=bins, alpha=0.6, label="Selected",
        color=PALETTE[2], edgecolor="white")
ax.set_xlabel("Number of objects in image")
ax.set_ylabel("Number of images")
ax.set_title("Objects-per-Image Distribution")
ax.legend()
plt.tight_layout()
save_fig("07_objects_per_image.png")

[7] Objects-per-image histogram
  Saved → C:\Users\samar\OneDrive\Desktop\Grad Project\Datasets\down sampled Bdd100k Dataset\visualizations\07_objects_per_image.png


### DIVERSITY SCORE DISTRIBUTION


In [25]:
print("[8] Diversity score distribution")

div_full = [s["diversity"] for s in analyzed]
div_sel  = [s["diversity"] for s in selected]

fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(div_full, bins=40, alpha=0.6, label="Full kept",
        color=PALETTE[1], edgecolor="white")
ax.hist(div_sel,  bins=40, alpha=0.6, label="Selected",
        color=PALETTE[3], edgecolor="white")
ax.set_xlabel("Shannon Diversity Score (0 = one class, 1 = perfectly diverse)")
ax.set_ylabel("Number of images")
ax.set_title("Class Diversity Score Distribution")
ax.legend()
plt.tight_layout()
save_fig("08_diversity_distribution.png")

[8] Diversity score distribution
  Saved → C:\Users\samar\OneDrive\Desktop\Grad Project\Datasets\down sampled Bdd100k Dataset\visualizations\08_diversity_distribution.png


### IMAGE SCORE DISTRIBUTION (selected vs dropped)


In [26]:
print("[9] Image score distribution (selected vs dropped)")

selected_set = set(s["img_path"] for s in selected)
scores_sel   = [s["score"] for s in analyzed if s["img_path"] in selected_set]
scores_drop  = [s["score"] for s in analyzed if s["img_path"] not in selected_set]

fig, ax = plt.subplots(figsize=(9, 4))
ax.hist(scores_drop, bins=60, alpha=0.6, label="Dropped",
        color=PALETTE[7], edgecolor="white")
ax.hist(scores_sel,  bins=60, alpha=0.7, label="Selected",
        color=PALETTE[5], edgecolor="white")
ax.axvline(
    np.percentile(scores_sel, 0) if scores_sel else 0,
    color="red", linestyle="--", linewidth=1.2, label="Min selected score"
)
ax.set_xlabel("Information Score")
ax.set_ylabel("Number of images")
ax.set_title("Score Distribution: Selected vs Dropped Images")
ax.legend()
plt.tight_layout()
save_fig("09_score_distribution.png")

[9] Image score distribution (selected vs dropped)
  Saved → C:\Users\samar\OneDrive\Desktop\Grad Project\Datasets\down sampled Bdd100k Dataset\visualizations\09_score_distribution.png


### TOP-20 CLASS CO-OCCURRENCE HEATMAP


In [27]:
print("[10] Class co-occurrence heatmap (top 20 classes)")

# Pick top-20 classes by instance count in selected set
top20 = [c for c, _ in Counter(
    c for s in selected for c in s["categories"]
).most_common(20)]

n = len(top20)
cooc = np.zeros((n, n), dtype=int)
cls_idx = {c: i for i, c in enumerate(top20)}

for s in selected:
    present = set(c for c in s["categories"] if c in cls_idx)
    for a in present:
        for b in present:
            cooc[cls_idx[a]][cls_idx[b]] += 1

# Zero the diagonal so self-count doesn't dominate color scale
np.fill_diagonal(cooc, 0)

fig, ax = plt.subplots(figsize=(11, 9))
im = ax.imshow(cooc, cmap="YlOrRd", aspect="auto")
plt.colorbar(im, ax=ax, label="Co-occurrence count")
ax.set_xticks(range(n))
ax.set_yticks(range(n))
ax.set_xticklabels(top20, rotation=45, ha="right", fontsize=8)
ax.set_yticklabels(top20, fontsize=8)
ax.set_title("Top-20 Class Co-occurrence in Selected Dataset\n(how often two classes appear in the same image)", fontsize=12)
plt.tight_layout()
save_fig("10_class_cooccurrence_heatmap.png")

[10] Class co-occurrence heatmap (top 20 classes)
  Saved → C:\Users\samar\OneDrive\Desktop\Grad Project\Datasets\down sampled Bdd100k Dataset\visualizations\10_class_cooccurrence_heatmap.png


### Done


In [28]:
print(f"\nAll visualizations saved to:\n  {VIZ_DIR}")
print("""
Files produced:
  01_sample_frames_bbox.png       — random frames with drawn bounding boxes
  02_class_distribution.png       — instance counts per class (full vs selected)
  03_split_sizes.png              — train/val/test image counts
  04_timeofday_distribution.png   — daytime / night / dawn-dusk breakdown
  05_weather_distribution.png     — clear / rainy / snowy / foggy etc.
  06_scene_distribution.png       — city / highway / residential etc.
  07_objects_per_image.png        — histogram of object density
  08_diversity_distribution.png   — Shannon diversity score per image
  09_score_distribution.png       — selected vs dropped by information score
  10_class_cooccurrence_heatmap   — which classes appear together most often
""")



All visualizations saved to:
  C:\Users\samar\OneDrive\Desktop\Grad Project\Datasets\down sampled Bdd100k Dataset\visualizations

Files produced:
  01_sample_frames_bbox.png       — random frames with drawn bounding boxes
  02_class_distribution.png       — instance counts per class (full vs selected)
  03_split_sizes.png              — train/val/test image counts
  04_timeofday_distribution.png   — daytime / night / dawn-dusk breakdown
  05_weather_distribution.png     — clear / rainy / snowy / foggy etc.
  06_scene_distribution.png       — city / highway / residential etc.
  07_objects_per_image.png        — histogram of object density
  08_diversity_distribution.png   — Shannon diversity score per image
  09_score_distribution.png       — selected vs dropped by information score
  10_class_cooccurrence_heatmap   — which classes appear together most often



## YOLO Model Testing

In [1]:
from ultralytics import YOLO
from pathlib import Path
import time

In [2]:
# Path to best.pt downloaded from Colab/Drive
WEIGHTS_PATH = r"C:\Users\samar\OneDrive\Desktop\Grad Project\Code\tamakkan_runs\yolo11m_bdd100k_v1\weights\best.pt"

# Source: image file, video file, or a folder of images
SOURCE = r"C:\Users\samar\OneDrive\Desktop\Grad Project\Code\Datasets\Dashcam_clips\10_Sec_Clip.mp4"

# Output folder (annotated results saved here)
OUTPUT_DIR = r"C:\Users\samar\OneDrive\Desktop\Grad Project\Code\tamakkan_inference_results"

In [3]:
CONF_THRESHOLD = 0.25   # minimum confidence to show a detection
IOU_THRESHOLD  = 0.45   # NMS threshold (suppress overlapping boxes)
IMG_SIZE       = 640    # must match training size

In [4]:
def run_inference():
    # Verify weights exist
    if not Path(WEIGHTS_PATH).exists():
        print(f"[ERROR] Weights not found: {WEIGHTS_PATH}")
        print("        Download best.pt from Google Drive first.")
        return

    if not Path(SOURCE).exists():
        print(f"[ERROR] Source not found: {SOURCE}")
        return

    print(f"Loading model : {WEIGHTS_PATH}")
    model = YOLO(WEIGHTS_PATH)

    print(f"Running inference on: {SOURCE}")
    print("(Running on CPU — video processing will be slower than real-time, that's expected)\n")

    start = time.time()

    results = model.predict(
        source    = SOURCE,
        imgsz     = IMG_SIZE,
        conf      = CONF_THRESHOLD,
        iou       = IOU_THRESHOLD,
        save      = True,               # saves annotated output
        project   = OUTPUT_DIR,
        name      = "run",
        save_txt  = True,               # also saves detections as .txt (useful later for analysis)
        save_conf = True,               # includes confidence scores in .txt files
        stream    = True,               # memory-efficient — important for long videos on CPU
        device    = "cpu",
        verbose   = False,
    )

    # Stream through results (required when stream=True)
    frame_count = 0
    for r in results:
        frame_count += 1

    elapsed = time.time() - start
    fps     = frame_count / elapsed if elapsed > 0 else 0

    print(f"\nDone!")
    print(f"  Frames processed : {frame_count}")
    print(f"  Time taken       : {elapsed:.1f}s")
    print(f"  Average FPS      : {fps:.2f}")
    print(f"  Output saved to  : {OUTPUT_DIR}/run/")


if __name__ == "__main__":
    run_inference()

Loading model : C:\Users\samar\OneDrive\Desktop\Grad Project\Code\tamakkan_runs\yolo11m_bdd100k_v1\weights\best.pt
Running inference on: C:\Users\samar\OneDrive\Desktop\Grad Project\Code\Datasets\Dashcam_clips\10_Sec_Clip.mp4
(Running on CPU — video processing will be slower than real-time, that's expected)

Results saved to [1mC:\Users\samar\OneDrive\Desktop\Grad Project\Code\tamakkan_inference_results\run[0m
246 labels saved to C:\Users\samar\OneDrive\Desktop\Grad Project\Code\tamakkan_inference_results\run\labels

Done!
  Frames processed : 246
  Time taken       : 99.3s
  Average FPS      : 2.48
  Output saved to  : C:\Users\samar\OneDrive\Desktop\Grad Project\Code\tamakkan_inference_results/run/
