<a href="https://colab.research.google.com/github/haiderdares/DeepLearningFinalProject/blob/main/sp25_dargupshe_data02_prep.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📝Introduction & Dataset Overview

# 📌 Problem Statement
**Autonomous driving systems rely on deep learning-based object detection models to identify vehicles, pedestrians, traffic signs, and signals in real-time. These models are built using Convolutional Neural Networks (CNNs) and advanced architectures like YOLO (You Only Look Once) to process road scene images and make accurate driving decisions.**

**In this project, we focus on training a deep learning-based object detection model using the Berkeley DeepDrive (BDD100K) dataset, one of the largest and most diverse self-driving datasets. By leveraging YOLOv8, a state-of-the-art deep learning model, we aim to develop a system that can efficiently detect multiple objects in complex driving environments.**

# 📌 Dataset Description

**The BDD100K dataset consists of:**

**100,000 driving scenes captured from different locations, times of day, and weather conditions.**

**Annotated bounding boxes for:**

**Cars, pedestrians, traffic signs, traffic lights, bicycles, motorcycles, and more.**

**Metadata: Additional labels such as road conditions, time of day, and weather.**

**For this project, we use a subset of BDD100K containing annotated bounding boxes for object detection.**

# 📌 Machine Learning Problem: Object Detection

**This project addresses an object detection problem, where the goal is to:**

**Detect objects in self-driving car images.**

**Classify objects (e.g., car, person, traffic sign).**

**Localize objects using bounding boxes.**

# Data Loading

In [None]:
import os


import numpy as np
import pandas as pd
import os



# List all datasets in the Kaggle input directory
print("Available datasets in /kaggle/input/:")
print(os.listdir("/kaggle/input/"))


In [None]:

dataset_path = "/kaggle/input/solesensei_bdd100k/"

# List contents
print("Contents of the dataset folder:")
print(os.listdir(dataset_path))


In [None]:


bdd100k_path = os.path.join(dataset_path, "bdd100k")  # Check inside 'bdd100k'
print("Contents of 'bdd100k' folder:")
print(os.listdir(bdd100k_path))


In [None]:
import os

# Define path to second 'bdd100k' directory
bdd100k_deep_path = os.path.join(bdd100k_path, "bdd100k")  # Nested bdd100k
print("Contents of the second 'bdd100k' folder:")
print(os.listdir(bdd100k_deep_path))


In [None]:
image_folder = os.path.join(bdd100k_deep_path, "images")

if os.path.exists(image_folder):
    print("Images folder found! Listing its contents:")
    print(os.listdir(image_folder))
else:
    print("No 'images' folder found. Check dataset structure.")


In [None]:
print("Contents of '100k' folder:", os.listdir(os.path.join(image_folder, "100k")))
print("Contents of '10k' folder:", os.listdir(os.path.join(image_folder, "10k")))


In [None]:
train_image_folder = os.path.join(image_folder, "100k", "train")  # Change to "10k" if needed

# Verify image files exist
print("Sample images in train folder:", os.listdir(train_image_folder)[:5])


In [None]:
import cv2
import matplotlib.pyplot as plt

# Select a sample image
sample_image_path = os.path.join(train_image_folder, os.listdir(train_image_folder)[0])

# Load and display the image
img = cv2.imread(sample_image_path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

plt.imshow(img)
plt.axis("off")
plt.title("Sample Image from BDD100K")
plt.show()


In [None]:
train_dataset_path = "/kaggle/input/solesensei_bdd100k/bdd100k/bdd100k/images/100k/train/"
val_dataset_path = "/kaggle/input/solesensei_bdd100k/bdd100k/bdd100k/images/100k/val/"
test_dataset_path = "/kaggle/input/solesensei_bdd100k/bdd100k/bdd100k/images/100k/test/"


In [None]:


# Define the path to annotations (Check inside labels folder)
label_folder = os.path.join(dataset_path, "bdd100k_labels_release")
print("Contents of 'bdd100k_labels_release':", os.listdir(label_folder))


In [None]:
label_subfolder = os.path.join(label_folder, "bdd100k")  # Navigate deeper
print("Contents of 'bdd100k_labels_release/bdd100k':", os.listdir(label_subfolder))


In [None]:
labels_path = os.path.join(label_subfolder, "labels")  # Navigate into labels folder
print("Contents of 'bdd100k_labels_release/bdd100k/labels':", os.listdir(labels_path))


lables

In [None]:
import json

# Define the path to the training annotation file
train_annotation_file = os.path.join(labels_path, "bdd100k_labels_images_train.json")

# Load JSON file
with open(train_annotation_file, "r") as f:
    train_annotations = json.load(f)

# Display first annotation sample
print(json.dumps(train_annotations[0], indent=4))


**Extract and Display Bounding Box Information**

In [None]:
# Extract first image annotation
first_annotation = train_annotations[0]

# Print image name
print("Image Name:", first_annotation["name"])

# Print image attributes (weather, scene, time of day)
print("Attributes:", first_annotation["attributes"])

# Print bounding boxes for objects
print("\nDetected Objects:")
for obj in first_annotation["labels"]:
    if "box2d" in obj:  # Ensure it's a valid bounding box annotation
        category = obj["category"]
        bbox = obj["box2d"]
        print(f"Object: {category}, Bounding Box: {bbox}")


**Load and Display the Image with Bounding Boxes**

**Find a Matching Image Filename**

In [None]:
# Extract all dataset filenames
dataset_filenames = set(os.listdir(image_folder))  # Convert to set for faster lookup

# Check if annotation filenames exist in the dataset
matched_files = [fname for fname in train_annotations if fname["name"] in dataset_filenames]

print(f" Matched {len(matched_files)} images directly from annotations.")




In [None]:
from difflib import get_close_matches

# Get all dataset image filenames
dataset_filenames = list(os.listdir(image_folder))

# Try to find the closest filename match for each annotation
matched_images = {}

for annotation in train_annotations[:10]:  # Limit for testing
    annotation_name = annotation["name"]
    matched = get_close_matches(annotation_name, dataset_filenames, n=1, cutoff=0.4)
    matched_images[annotation_name] = matched[0] if matched else "No Match"

# Display 5 sample matches
for annotation, matched in list(matched_images.items())[:5]:
    print(f"Annotation: {annotation} → Matched: {matched}")


**Load & Display an Image with Bounding Boxes**

In [None]:
import cv2
import matplotlib.pyplot as plt

# Select an annotation and its matched image
sample_annotation_name = list(matched_images.keys())[0]  # First annotation name
matched_image_name = matched_images[sample_annotation_name]  # Corresponding dataset image

# Define full image path
image_path = os.path.join(image_folder, matched_image_name)

# Load and display the image
img = cv2.imread(image_path)
if img is not None:
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    # Find corresponding annotation
    annotation_data = next(item for item in train_annotations if item["name"] == sample_annotation_name)

    # Draw bounding boxes
    for obj in annotation_data["labels"]:
        if "box2d" in obj:
            category = obj["category"]
            bbox = obj["box2d"]
            x1, y1, x2, y2 = int(bbox["x1"]), int(bbox["y1"]), int(bbox["x2"]), int(bbox["y2"])

            # Draw rectangle
            cv2.rectangle(img, (x1, y1), (x2, y2), (0, 255, 0), 2)
            cv2.putText(img, category, (x1, y1 - 5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

    # Show the image
    plt.figure(figsize=(10, 6))
    plt.imshow(img)
    plt.axis("off")
    plt.title(f"Bounding Boxes for {matched_image_name}")
    plt.show()
else:
    print(f" Error: Unable to load image {matched_image_name}. Check dataset paths.")


# Data Preprocessing

# 1. Handle Missing Data & Corrupted Images

**Step 1 - To Detect and Remove Corrupt Images**

In [None]:
import os
import cv2
from tqdm import tqdm

# Initialize counters
corrupt_images = []

# Check all images in the dataset
for img_name in tqdm(os.listdir(image_folder)):
    img_path = os.path.join(image_folder, img_name)

    # Try to open the image
    img = cv2.imread(img_path)

    # If image is None, it is likely corrupted
    if img is None:
        print(f" Corrupt image detected: {img_name}")
        corrupt_images.append(img_path)
        os.remove(img_path)  # Delete the corrupt image

# Summary
print(f"\n Removed {len(corrupt_images)} corrupt images.")


The error "OSError: [Errno 30] Read-only file system" means that Kaggle does not allow deleting files from /kaggle/input/.

As we cannot delete the data from Kaggle dataset, instead of deleting we are skipping it

Since we cannot delete files, we will only log corrupt images and ignore them during training.

In [None]:
import os
import cv2
from tqdm import tqdm

# Initialize counters
corrupt_images = []

# Check all images in the dataset
for img_name in tqdm(os.listdir(image_folder)):
    img_path = os.path.join(image_folder, img_name)

    # Try to open the image
    img = cv2.imread(img_path)

    # If image is None, it is likely corrupted
    if img is None:
        print(f"❌ Corrupt image detected: {img_name}")
        corrupt_images.append(img_name)

# Summary
print(f"\n⚠️ Found {len(corrupt_images)} corrupt images. Skipping them during training.")


**Step 2 - Check if all images have corresponding annotation files**

In [None]:
import os

# Define paths
image_filenames = set(os.listdir(image_folder))  # All image filenames
annotation_filenames = set([ann["name"] for ann in train_annotations])  # All annotated image names

# Find images without annotations
missing_annotations = image_filenames - annotation_filenames

# Find annotations without corresponding images
missing_images = annotation_filenames - image_filenames

# Print results
print(f"✅ Total Images: {len(image_filenames)}")
print(f"✅ Total Annotated Images: {len(annotation_filenames)}")
print(f"❌ Images Missing Annotations: {len(missing_annotations)}")
print(f"❌ Annotations Without Images: {len(missing_images)}")

# Display sample missing files
print("\n📌 Sample Missing Annotations:", list(missing_annotations)[:5])
print("📌 Sample Missing Images:", list(missing_images)[:5])


In [None]:
# Filter out annotations that don't have corresponding images
filtered_annotations = [ann for ann in train_annotations if ann["name"] in image_filenames]

# Print summary
print(f"✅ Filtered Annotations: {len(filtered_annotations)} (Only keeping annotations with images)")


# 2. Convert Annotations to YOLO Format

**we will convert BDD100K annotations from JSON format to YOLO .txt format.**

In [None]:
import os

# Define class mappings (modify if needed)
class_mapping = {
    "car": 0,
    "person": 1,
    "bike": 2,
    "traffic sign": 3,
    "traffic light": 4
}

# Define the output label directory
yolo_label_dir = "/kaggle/working/yolo_labels/"
os.makedirs(yolo_label_dir, exist_ok=True)

# Convert each annotation
for annotation in filtered_annotations:
    img_name = annotation["name"]
    yolo_label_path = os.path.join(yolo_label_dir, img_name.replace(".jpg", ".txt"))

    with open(yolo_label_path, "w") as f:
        for obj in annotation["labels"]:
            if "box2d" in obj and obj["category"] in class_mapping:
                x1, y1 = obj["box2d"]["x1"], obj["box2d"]["y1"]
                x2, y2 = obj["box2d"]["x2"], obj["box2d"]["y2"]

                # Normalize for YOLO format
                x_center = ((x1 + x2) / 2) / 1280  # Assuming image width = 1280
                y_center = ((y1 + y2) / 2) / 720   # Assuming image height = 720
                width = (x2 - x1) / 1280
                height = (y2 - y1) / 720

                # Write YOLO format
                f.write(f"{class_mapping[obj['category']]} {x_center} {y_center} {width} {height}\n")

print(f" YOLO annotations saved in: {yolo_label_dir}")


**Verify YOLO Annotations**

In [None]:
import os

# List generated YOLO label files
yolo_files = os.listdir(yolo_label_dir)

# Print total files created
print(f"✅ Total YOLO label files created: {len(yolo_files)}")

# Display a sample YOLO label file
sample_yolo_file = os.path.join(yolo_label_dir, yolo_files[0])  # Select first file
print(f"\n📌 Sample YOLO Annotation File: {sample_yolo_file}\n")

# Read and display contents of the sample file
with open(sample_yolo_file, "r") as f:
    yolo_data = f.readlines()

print("🔹 YOLO Format Annotations:")
for line in yolo_data:
    print(line.strip())  # Remove extra spaces


Class ID (e.g., 4 for traffic light, 0 for car)
Normalized bounding box coordinates (x_center, y_center, width, height)

# 3.Resize & Normalize Images for YOLO Training

**Resize all images to 640x640 (YOLO standard input size)**

**Normalize images (convert pixel values to a standard format)**

**Save preprocessed images for training.**

1. To Resize & Save Images

In [None]:
import cv2
import os
from tqdm import tqdm

# Define the output folder for resized images
resized_dir = "/kaggle/working/resized_images/"
os.makedirs(resized_dir, exist_ok=True)

# Define target size
TARGET_SIZE = (640, 640)  # Standard for YOLOv8

# Process each image
for img_name in tqdm(os.listdir(image_folder)):
    img_path = os.path.join(image_folder, img_name)
    output_path = os.path.join(resized_dir, img_name)

    # Read the image
    img = cv2.imread(img_path)
    if img is None:
        print(f"❌ Skipping corrupted image: {img_name}")
        continue  # Skip corrupt images

    # Resize image
    img_resized = cv2.resize(img, TARGET_SIZE)

    # Save resized image
    cv2.imwrite(output_path, img_resized)

print(f"\n✅ Resized images saved in: {resized_dir}")


**Verify Resized Images**

In [None]:
import cv2
import matplotlib.pyplot as plt
import random

# Select a random resized image
sample_img_name = random.choice(os.listdir(resized_dir))
sample_img_path = os.path.join(resized_dir, sample_img_name)

# Load the image
img = cv2.imread(sample_img_path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# Display the image
plt.figure(figsize=(6, 6))
plt.imshow(img)
plt.axis("off")
plt.title(f"Sample Resized Image: {sample_img_name}")
plt.show()


# 4.Data Augmentation

**Install & Import Augmentation Library**

In [None]:
!pip install albumentations


In [None]:
import albumentations as A
import cv2
import os
import numpy as np
from tqdm import tqdm

# Define augmentation pipeline
augmentation = A.Compose([
    A.HorizontalFlip(p=0.5),  # 50% chance to flip image
    A.RandomRotate90(p=0.3),  # Random 90-degree rotation
    A.RandomBrightnessContrast(p=0.3),  # Adjust brightness & contrast
    A.MotionBlur(p=0.2, blur_limit=5),  # Apply motion blur
    A.RandomScale(scale_limit=0.2, p=0.3)  # Random zoom in/out
], bbox_params=A.BboxParams(format="yolo", label_fields=["category_ids"]))


**Apply Augmentations & Save New Images**

In [None]:
# Define output folder for augmented images
augmented_images_dir = "/kaggle/working/augmented_images/"
augmented_labels_dir = "/kaggle/working/augmented_labels/"
os.makedirs(augmented_images_dir, exist_ok=True)
os.makedirs(augmented_labels_dir, exist_ok=True)

# Process each image and apply augmentation
for img_name in tqdm(os.listdir(resized_dir)):
    img_path = os.path.join(resized_dir, img_name)
    label_path = os.path.join(yolo_label_dir, img_name.replace(".jpg", ".txt"))

    # Load image
    image = cv2.imread(img_path)
    if image is None:
        print(f"❌ Skipping corrupt image: {img_name}")
        continue

    # Load YOLO labels
    if not os.path.exists(label_path):
        print(f"⚠️ No annotation file found for: {img_name}")
        continue

    with open(label_path, "r") as f:
        labels = f.readlines()

    # Convert bounding boxes to YOLO format for Albumentations
    bboxes = []
    category_ids = []
    for label in labels:
        class_id, x_center, y_center, width, height = map(float, label.strip().split())
        bboxes.append([x_center, y_center, width, height])
        category_ids.append(class_id)

    # Apply augmentation
    augmented = augmentation(image=image, bboxes=bboxes, category_ids=category_ids)

    # Save augmented image
    aug_img_name = f"aug_{img_name}"
    aug_img_path = os.path.join(augmented_images_dir, aug_img_name)
    cv2.imwrite(aug_img_path, augmented["image"])

    # Save new YOLO label file
    aug_label_path = os.path.join(augmented_labels_dir, aug_img_name.replace(".jpg", ".txt"))
    with open(aug_label_path, "w") as f:
        for bbox, class_id in zip(augmented["bboxes"], augmented["category_ids"]):
            f.write(f"{int(class_id)} {' '.join(map(str, bbox))}\n")

print("\n✅ Data Augmentation Completed! Augmented images saved.")


**Verify Augmented Images**

In [None]:
import random
import matplotlib.pyplot as plt

# Select a random augmented image
aug_sample = random.choice(os.listdir(augmented_images_dir))
aug_sample_path = os.path.join(augmented_images_dir, aug_sample)

# Load and display the image
aug_img = cv2.imread(aug_sample_path)
aug_img = cv2.cvtColor(aug_img, cv2.COLOR_BGR2RGB)

plt.figure(figsize=(6,6))
plt.imshow(aug_img)
plt.axis("off")
plt.title(f"Sample Augmented Image: {aug_sample}")
plt.show()


# 5.Split Data into Train, Validation & Test Sets

Now, we need to split the augmented dataset into:

80% → Training (train/)
10% → Validation (val/)
10% → Testing (test/)

In [None]:
import os
import shutil
import random

# Define split ratios
train_ratio = 0.80
val_ratio = 0.10
test_ratio = 0.10

# Define paths for train, val, test folders
split_dirs = {
    "train": "/kaggle/working/final_dataset/train/",
    "val": "/kaggle/working/final_dataset/val/",
    "test": "/kaggle/working/final_dataset/test/"
}

# Create directories
for split, path in split_dirs.items():
    os.makedirs(path + "images/", exist_ok=True)
    os.makedirs(path + "labels/", exist_ok=True)

# Get list of all augmented images
all_aug_images = os.listdir(augmented_images_dir)
random.shuffle(all_aug_images)  # Shuffle for randomness

# Split data
train_split = int(len(all_aug_images) * train_ratio)
val_split = int(len(all_aug_images) * (train_ratio + val_ratio))

train_images = all_aug_images[:train_split]
val_images = all_aug_images[train_split:val_split]
test_images = all_aug_images[val_split:]

# Function to move images & labels
def move_files(image_list, split_type):
    for img_name in image_list:
        src_img = os.path.join(augmented_images_dir, img_name)
        dest_img = os.path.join(split_dirs[split_type] + "images/", img_name)

        src_label = os.path.join(augmented_labels_dir, img_name.replace(".jpg", ".txt"))
        dest_label = os.path.join(split_dirs[split_type] + "labels/", img_name.replace(".jpg", ".txt"))

        shutil.copy(src_img, dest_img)  # Copy image
        if os.path.exists(src_label):  # Copy label only if it exists
            shutil.copy(src_label, dest_label)

# Move files to respective folders
move_files(train_images, "train")
move_files(val_images, "val")
move_files(test_images, "test")

print(f"✅ Data split completed!")
print(f"📌 Training Images: {len(train_images)}")
print(f"📌 Validation Images: {len(val_images)}")
print(f"📌 Testing Images: {len(test_images)}")


In [None]:
print("📌 Training set sample images:", os.listdir(split_dirs["train"] + "images/")[:5])
print("📌 Validation set sample images:", os.listdir(split_dirs["val"] + "images/")[:5])
print("📌 Testing set sample images:", os.listdir(split_dirs["test"] + "images/")[:5])


#  6.Convert Dataset into Tensor Format

In [None]:
!pip install torch torchvision


In [None]:
import torch
import torchvision.transforms as transforms
from torch.utils.data import Dataset, DataLoader
import cv2
import os
import numpy as np


**Define a PyTorch Dataset Class**

In [None]:
class YOLODataset(Dataset):
    def __init__(self, image_dir, label_dir, transform=None):
        self.image_dir = image_dir
        self.label_dir = label_dir
        self.image_filenames = os.listdir(image_dir)
        self.transform = transform

    def __len__(self):
        return len(self.image_filenames)

    def __getitem__(self, index):
        img_name = self.image_filenames[index]
        img_path = os.path.join(self.image_dir, img_name)
        label_path = os.path.join(self.label_dir, img_name.replace(".jpg", ".txt"))

        # Load image
        image = cv2.imread(img_path)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

        # Convert image to tensor
        if self.transform:
            image = self.transform(image)

        # Load YOLO labels
        bboxes = []
        if os.path.exists(label_path):
            with open(label_path, "r") as f:
                for line in f.readlines():
                    class_id, x_center, y_center, width, height = map(float, line.strip().split())
                    bboxes.append([class_id, x_center, y_center, width, height])

        # Convert bounding boxes to tensor
        bboxes = torch.tensor(bboxes, dtype=torch.float32)

        return image, bboxes


**Define Image Transformations**

In [None]:
transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((640, 640)),
    transforms.ToTensor(),  # This Converts the image to PyTorch tensor (0-1 range)
])


# Save Preprocessed data

**Create Dataset & DataLoader**

In [None]:
# Define dataset paths (training set)
train_images_dir = "/kaggle/working/final_dataset/train/images/"
train_labels_dir = "/kaggle/working/final_dataset/train/labels/"

# Create dataset
train_dataset = YOLODataset(image_dir=train_images_dir, label_dir=train_labels_dir, transform=transform)

# Create DataLoader for batching
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True, collate_fn=lambda x: x)

print(f"✅ Dataset loaded with {len(train_dataset)} images in tensor format.")


**Display a Sample Tensor**

In [None]:
import matplotlib.pyplot as plt

# Get a sample from the dataset
image_tensor, bbox_tensor = train_dataset[0]

# Convert image tensor to numpy for visualization
image_np = image_tensor.permute(1, 2, 0).numpy()

# Display the image
plt.figure(figsize=(6,6))
plt.imshow(image_np)
plt.axis("off")
plt.title(f"Sample Tensor Image (Shape: {image_tensor.shape})")
plt.show()

# Print the bounding box tensor
print("Bounding Box Tensor:", bbox_tensor)


# what you had to learn to convert your data into tensors


**Why Convert Data into Tensors?**

Deep learning models, especially those built with PyTorch, do not process raw image files directly. Instead, they require numerical data in tensor format.
For this project, I had to convert self-driving car images and YOLO object detection labels into tensors, making them compatible with deep learning models.

**Understanding Tensors in PyTorch**

A tensor is a multi-dimensional numerical structure similar to an array but optimized for GPU acceleration.

Types of Tensors Used in This Project
✅ Image Tensors → Represent pixel values in the format (C, H, W)
✅ Bounding Box Tensors → Store object location in the format (class_id, x_center, y_center, width, height)



**Steps to Convert Data into Tensors**

1️⃣ Load & Transform Images:

Used OpenCV (cv2) to load images and convert from BGR to RGB.
Resized images to 640x640 and converted them to PyTorch tensors using torchvision.transforms.
2️⃣ Convert YOLO Annotations to Tensors:

Read YOLO .txt labels (<class_id> <x_center> <y_center> <width> <height>).
Converted bounding boxes into PyTorch tensors and normalized coordinates (0-1 range).
3️⃣ Create a PyTorch Dataset Class:

Built a custom YOLODataset class to dynamically load images & labels.
Applied transformations and returned image & bounding box tensors for training.
✅ Handled challenges like missing annotations, tensor shape mismatches, and bounding box normalization.


By completing this step, I gained hands-on experience in:
How deep learning models process tensor data instead of raw files.
Using OpenCV, PyTorch, and torchvision to load & transform images.
Efficiently handling YOLO bounding box annotations as tensors.
Building a structured PyTorch dataset for deep learning training.
