## **Problem Statement 1:**
Autonomous vehicles (AV) and intelligent transport systems (ITS) are the future of road transport. Automatic detection of vehicles on the road in real-time helps AV technology and makes ITS more intelligent in terms of vehicle tracking, vehicle counting, and road incident response.

## **Objective 1:**
As the first part of this project, you need to develop an AI model using a deep learning framework that predicts the type of vehicle present in an image as  well as localizes the vehicle by rectangular bounding box.

1. Create a parent folder for custom model training and child folders to store data
2. Prepare the dataset for model training and keep the following points in mind while
preparing it
• This dataset contains many images, and depending on the compute power of the VM, it
might take a very long time to unzip this huge amount of data.
3. Create an CNN architecture for object detection of your choice to train an object detection
model. Please note that algorithm or architecture selection is a very important aspect of ML
model training, and you must pick the one that works the best for your dataset.
4. Evaluate the model and check the test results
5. Run inferences on sample images and see if vehicles are detected accurately

## **Step 1: Create a Class to Prepare Dataset**

With this structure:

- You can automatically feed batches of images + labels + bounding boxes into your model

- You can apply augmentations and transformations on the fly

- It allows clean separation of data logic from training logic

In [None]:
from torch.utils.data import Dataset
from PIL import Image
import os
import torch


class VehicleDataset(Dataset):
    def __init__(self, df, image_dir, transform=None, class_to_idx=None):
        self.df = df                            # DataFrame with image names, labels, and bounding boxes
        self.image_dir = image_dir              # Directory where image files are stored
        self.transform = transform              # Any image transforms (resize, tensor conversion, etc.)
        self.class_to_idx = class_to_idx        # Mapping from class name to integer (e.g., 'car': 0)

    def __len__(self):
        return len(self.df)                     # Total number of samples

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        img_path = os.path.join(self.image_dir, row['image_id']) #Loads the image file from disk
        image = Image.open(img_path).convert("RGB")    # Load and convert to RGB

        label = self.class_to_idx[row['class']]        # Encode class label to an integer
        bbox = torch.tensor([row['x_min'], row['y_min'], row['x_max'], row['y_max']], dtype=torch.float32) #Converts bounding box into a tensor


        if self.transform:
            image = self.transform(image)              # Apply transformations (resize, normalize, etc.)

        return image, label, bbox                      # Return one sample



## **Step 2: Clean CSV and Filter Missing Files**

In [None]:
import pandas as pd
import os

# Load labels
column_names = ['image_id', 'class', 'x_min', 'y_min', 'x_max', 'y_max']
df = pd.read_csv('/content/drive/MyDrive/AV-ITS Capstone Project /labels.csv',header=None,names=column_names)

# Folder where your images are stored
image_dir = '/content/drive/MyDrive/AV-ITS Capstone Project /Images'

#The image_id in labels.csv is not padded with 0s. So to not cause errors between Images filenames and labels image_id we are padding to 8 digits
df['image_id'] = df['image_id'].astype(str).str.zfill(8) + ".jpg"

# There might be rows in labels files for which there are no images in Image folder. So Only keep rows where image file exists
df = df[df['image_id'].apply(lambda x: os.path.exists(os.path.join(image_dir, x)))]

# Encode class labels
class_names = df['class'].unique()
class_to_idx = {cls: i for i, cls in enumerate(sorted(class_names))}


In [None]:
print("Number of rows in df:", len(df))
import os

print("Sample images in folder:")
print(os.listdir(image_dir)[:5])

print("Sample filenames in df:")
print(df['image_id'].head(10))


Number of rows in df: 17967
Sample images in folder:
['00004646.jpg', '00004645.jpg', '00004647.jpg', '00004620.jpg', '00004614.jpg']
Sample filenames in df:
0    00000000.jpg
1    00000000.jpg
2    00000000.jpg
3    00000000.jpg
4    00000000.jpg
5    00000001.jpg
6    00000001.jpg
7    00000001.jpg
8    00000001.jpg
9    00000001.jpg
Name: image_id, dtype: object


## **Step 3: Define CNN Model (Dual Head)**

In [None]:
import torch.nn as nn
import torch.nn.functional as F

class ObjectClassifierAndLocalizer(nn.Module):
    def __init__(self, num_classes):
        super().__init__()

        # Shared CNN backbone
        self.backbone = nn.Sequential(
            nn.Conv2d(3, 16, 3, stride=1, padding=1),  # Conv layer
            nn.ReLU(),
            nn.MaxPool2d(2),  # Downsample

            nn.Conv2d(16, 32, 3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),

            nn.Conv2d(32, 64, 3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )

        # 🔹 Flatten and shared dense layer
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(64 * 28 * 28, 512)  # assuming input images are 224x224

        # 🔹 Output heads
        self.class_head = nn.Linear(512, num_classes)  # classification output
        self.bbox_head = nn.Linear(512, 4)             # bounding box output

    def forward(self, x):
        x = self.backbone(x)
        x = self.flatten(x)
        x = F.relu(self.fc1(x))

        # Two parallel outputs
        class_output = self.class_head(x)     # class logits (e.g., [0.2, 1.5, -0.6, ...])
        bbox_output = self.bbox_head(x)       # 4 values: [x_min, y_min, x_max, y_max]

        return class_output, bbox_output


## **Step 3 : Using Pre-trained model - ResNet-Based Dual Head Model (Classification + Localization) for faster result**

In [None]:
import torch.nn as nn
import torchvision.models as models
import torch.nn.functional as F

class ResNetClassifierLocalizer(nn.Module):
    def __init__(self, num_classes):
        super().__init__()

        # Load pretrained ResNet18
        resnet = models.resnet18(pretrained=True)

        # Remove final fully connected layer (fc) to use as feature extractor
        self.backbone = nn.Sequential(*list(resnet.children())[:-1])  # output: (batch, 512, 1, 1)

        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(512, 256)

        # Classification head
        self.class_head = nn.Linear(256, num_classes)

        # Bounding box head (x_min, y_min, x_max, y_max)
        self.bbox_head = nn.Linear(256, 4)

    def forward(self, x):
        x = self.backbone(x)             # (B, 512, 1, 1)
        x = self.flatten(x)              # (B, 512)
        x = F.relu(self.fc1(x))          # (B, 256)

        class_output = self.class_head(x)
        bbox_output = self.bbox_head(x)

        return class_output, bbox_output


## **Step 4: Transform, Dataloader, and Training Loop**

In [None]:
from torchvision import transforms
from torch.utils.data import DataLoader
import torch

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    #transforms.Resize((128,128)), #Reduced the size of the image further to check if the model reduce its training time
    transforms.ToTensor(),
])

# Create Dataset and DataLoader
dataset = VehicleDataset(df, image_dir=image_dir, transform=transform, class_to_idx=class_to_idx)

# Split train/test
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.2, stratify=df['class'])

train_dataset = VehicleDataset(train_df, image_dir, transform, class_to_idx)
test_dataset = VehicleDataset(test_df, image_dir, transform, class_to_idx)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)


## **Step 5: Train the Model (loss = classification + localization**

In [None]:
#model = ObjectClassifierAndLocalizer(num_classes=len(class_names))
model = ResNetClassifierLocalizer(num_classes=len(class_names))
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
cls_loss_fn = nn.CrossEntropyLoss()
bbox_loss_fn = nn.MSELoss()

# Training loop
for epoch in range(3):
    model.train()
    total_loss = 0
    for images, labels, bboxes in train_loader:
        preds_cls, preds_bbox = model(images)

        loss_cls = cls_loss_fn(preds_cls, labels)
        loss_bbox = bbox_loss_fn(preds_bbox, bboxes)

        loss = loss_cls + loss_bbox

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print(f"Epoch {epoch+1}: Loss = {total_loss:.4f}")


Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|██████████| 44.7M/44.7M [00:00<00:00, 172MB/s]


Epoch 1: Loss = 24373996.2422
Epoch 2: Loss = 8283502.4795
Epoch 3: Loss = 7955033.1855


## **Step 6: Put Model in Evaluation Model**

In [None]:
model.eval()

In [None]:
import torch

correct = 0
total = 0
test_loss = 0

model.eval()
with torch.no_grad():
    for images, labels, bboxes in test_loader:
        preds_cls, preds_bbox = model(images)

        # Classification accuracy
        _, predicted = torch.max(preds_cls, 1)
        correct += (predicted == labels).sum().item()
        total += labels.size(0)

        # Optional: localization loss
        loss_cls = cls_loss_fn(preds_cls, labels)
        loss_bbox = bbox_loss_fn(preds_bbox, bboxes)
        test_loss += (loss_cls + loss_bbox).item()


In [None]:
print(f"Test Accuracy (Classification): {100 * correct / total:.2f}%")
print(f"Test Loss (Classification + BBox): {test_loss:.4f}")
