# **URBE *Perception*** 🚘 - *real-time vehicle detection for self-driving cars in Rome*

> *Refer to the notebook* [📙](https://colab.research.google.com/drive/1sCqnwYm9Dodk1YodD1asVpRMBBdT-8r1?usp=share_link) *on **dataset** creation if you haven't already and to the one were official YOLOv5 models are **finetuned** on my custom dataset* [📗](https://colab.research.google.com/drive/1Kb_M6O7NMIdwVFk4EDGM-1NH7fwyMK1p?usp=share_link).

## Introduction

**Why edge computing and self-driving cars?**

The significance of edge computing in self-driving cars lies in its ability to process data and make decisions in real-time at the edge of the network, near the source of the data. This is imperative for self-driving cars, as they require prompt and precise decision-making based on information obtained from sensors such as cameras, radar, and lidar.

In a cloud computing software architecture, data is sent to a central server for processing and then the results are sent back to the device. This approach doesn't work for self-driving cars, as the latency, or delay, of transmitting data back and forth between the car and a central server can be dangerous in a real-time driving scenario. With edge computing, the data is processed locally on the car, reducing latency and improving response times. It is indeed a key technology for enabling self-driving cars to operate safely and effectively.

The popularity of this approach to constructing intelligent systems is growing, especially in light of a potentially challenging future in terms of the scarcity of materials for performing extensive computations. For this reason many researchers in the AI field are pursuing this direction (https://news.mit.edu/2023/autonomous-vehicles-carbon-emissions-0113).

**The idea**

As it can be imagined, the computational system behind a self-driving car is huge and extremely complex; it integrates many technologies, including sensing (lidars, cameras, radars), localization, decision making and **perception** on which my work is focused. <br> **Urbe** stands for "*city*" and it is used to be referred to the city of Rome. In fact my final goal is to build a real-time system which runs on embedded devices (as *Nvidia Jetson Nano* or *Google Coral* ) and which detects *vehicles*, *pedestrians* and *motorbikes* on the streets of Rome. In effect, a real submodule for autonomous cars. But for this project I only built an object detection system with an eye toward  the  inference time and most importantly towards the application scenario, Rome. 

## Imports & Downloads

In [None]:
# install the requirements
%pip install -r requirements.txt > /dev/null
# set to false if you already have the dataset
download_dataset = False 
if download_dataset:
    %cd dataset
    !bash download_dataset.sh
    %cd ..

In [None]:
from src.hyperparameters import Hparams
from src.data_module import URBE_DataModule
from src.model import URBE_Perception
from src.loss import YOLO_Loss
from src.train import train_model

from dataclasses import asdict
import matplotlib.pyplot as plt
import wandb
import json
import torchvision.transforms as T
import pytorch_lightning as pl
import gc
from collections import Counter
from tqdm import tqdm
import os
import numpy as np
import random
import json
from PIL import Image
from torchvision import transforms
from torch import nn
import cv2
from torchmetrics.detection.mean_ap import MeanAveragePrecision

# reproducibility stuff
import numpy as np
import random
import torch
np.random.seed(0)
random.seed(0)
torch.cuda.manual_seed(0)
torch.manual_seed(0)
torch.backends.cudnn.deterministic = True  # Note that this Deterministic mode can have a performance impact
torch.backends.cudnn.benchmark = False
_ = pl.seed_everything(0)
# to have a better workflow using notebook https://stackoverflow.com/questions/5364050/reloading-submodules-in-ipython
# these commands allow to update the .py codes imported instead of re-importing everything every time.
%load_ext autoreload
%autoreload 2
#%env WANDB_NOTEBOOK_NAME = ./notebook.ipynb
gc.collect()

In [None]:
# login wandb to have the online logger. It is really useful since it stores all the plots and evolution of the model
# check also https://docs.wandb.ai/guides/integrations/lightning
wandb.login()

## Dataset

In [None]:
hparams = asdict(Hparams())
URBE_Data = URBE_DataModule(hparams)
URBE_Data.setup()
print(len(URBE_Data.data_train))
print(len(URBE_Data.data_val))
print(len(URBE_Data.data_test))
print("TOTAL: "+str(len(URBE_Data.data_train)+len(URBE_Data.data_val)+len(URBE_Data.data_test))+" images")

### Bounding Boxes Visualization

It's needed of course for showing the results at the end of the project and during  training of the validation set, but it was essential in the *data processing* phase for understanding the qualities of the datasets' bounding boxes annotations and in general to recognize each different characteristic of the data. <br> *(I tried **Scalabel**, **FiftyOne**, but **WandB** is the best choice)* 

> Let's test the *dataloaders* and see some samples from a training batch!

In [None]:
def draw_bbox(label):
  ris = { "predictions" : {"box_data" : [] , "class_labels" : {0 : "vehicle" , 1 : "person", 2 : "motorbike"}} }
  for ann in label: # for each bbox of the particular image
    if ann.sum()==0: # we appended this [0,0,0,0,0] type of list for having the same batch size for all the samples!
      break
    position = {"minX": ((ann[1]-(ann[3]/2))*1280).item(), "maxX": ((ann[1]+(ann[3]/2))*1280).item(), "minY": ((ann[2]-(ann[4]/2))*720).item(), "maxY": ((ann[2]+(ann[4]/2))*720).item()}
    class_id = int(ann[0])
    box_caption = ris["predictions"]["class_labels"][class_id]
    x = {"position" : position, "domain" : "pixel", "class_id" : class_id, "box_caption" : box_caption}
    ris["predictions"]["box_data"].append(x)
  return ris

In [None]:
# we take one batch from the training set
batch = next(iter(URBE_Data.train_dataloader()))

user_name = "lavallone"
project_name = "VISIOPE_project"
version_name = "dataset"
run = wandb.init(entity=user_name, project=project_name, name = version_name, mode = "online")

transform = T.ToPILImage()
images_list = [transform(img) for img in batch["img"]]
images_list = [img.resize((1280, 720)) for img in images_list]

my_data = []
for i,label in enumerate(batch["labels"]):
    bbox_list = draw_bbox(label) # label is a list of lists
    my_data.append([batch["id"][i], wandb.Image(images_list[i], boxes=bbox_list)])
table = wandb.Table(columns=['ID', 'Image'], data=my_data)
print("logging the table...")
wandb.log({"dataloaders testing": table})

### Statistics 📊

Before starting with the real development of the detection system, we want to plot the statistics of our data. 
> Since using the dataloaders  for all our dataset is costly and painful, it will be use the "*annotations.json*" file as a source for the dataset statistics.



In [None]:
# function for plotting data --> three group because train/val/test
def three_group_bar(columns, data, title, percentage=True): # both columns and data are lists (data is list of a single list)
  labels = columns
  
  train = data[0]
  val = data[1]
  test = data[2]
  
  color_list = []
  for _ in range(len(data)):
    color = [random.randrange(0, 255)/255, random.randrange(0, 255)/255, random.randrange(0, 255)/255, 1]
    color_list.append(color)
    
  x = np.arange(len(labels))
  width = 0.15  # the width of the bars
  fig, ax = plt.subplots(figsize=(12, 5), layout='constrained')
  rects1 = ax.bar(x - width, train, width, label='Train', color=color_list[0])
  rects2 = ax.bar(x, val, width, label='Val', color=color_list[1])
  rects3 = ax.bar(x + width, test, width, label='Test', color=color_list[2])
  # Add some text for labels, title and custom x-axis tick labels, etc.
  ax.set_title(title)
  ax.set_xticks(x, labels)
  ax.legend()
  if percentage:
    rects1_labels = [('%.2f' % i) + "%" for i in train]
    rects2_labels = [('%.2f' % i) + "%" for i in val]
    rects3_labels = [('%.2f' % i) + "%" for i in test]
  else:
    rects1_labels = train
    rects2_labels = val
    rects3_labels = test
  
  ax.bar_label(rects1, rects1_labels, padding=3)
  ax.bar_label(rects2, rects2_labels, padding=3)
  ax.bar_label(rects3, rects3_labels, padding=3)

In [None]:
# setup
d = json.load(open("dataset/URBE_dataset/labels/COCO/annotations.json"))
annotations = d["annotations"]
images = d["images"]

train_image_id_list = [f.split("_")[-1][:-4] for f in os.listdir("dataset/URBE_dataset/images/train/")]
val_image_id_list = [f.split("_")[-1][:-4] for f in os.listdir("dataset/URBE_dataset/images/val/")]
test_image_id_list = [f.split("_")[-1][:-4] for f in os.listdir("dataset/URBE_dataset/images/test/")]

**Number of classes**

In [None]:
data = []

# TRAIN
classes_list = [ann["category_id"] for ann in tqdm(annotations) if ann["image_id"] in train_image_id_list]
c = Counter(classes_list)
tot = c[0] + c[1] + c[2]
data.append([(c[0]/tot)*100, (c[1]/tot)*100, (c[2]/tot)*100])

# VAL
classes_list = [ann["category_id"] for ann in tqdm(annotations) if ann["image_id"] in val_image_id_list]
c = Counter(classes_list)
tot = c[0] + c[1] + c[2]
data.append([(c[0]/tot)*100, (c[1]/tot)*100, (c[2]/tot)*100])

# TEST
classes_list = [ann["category_id"] for ann in tqdm(annotations) if ann["image_id"] in test_image_id_list]
c = Counter(classes_list)
tot = c[0] + c[1] + c[2]
data.append([(c[0]/tot)*100, (c[1]/tot)*100, (c[2]/tot)*100])

In [None]:
data = [[82.46814899865309, 17.16686171034909, 0.364989290997814], [82.85404948638728, 16.814104764671193, 0.33184574894152646], [82.80439305749428, 16.843565364727365, 0.35204157777836304]]
columns = ["vehicle", "person", "motorbike"]
three_group_bar(columns, data, "train/val/test Classes Distribution")

**Time of the day**

In [None]:
data = []

# TRAIN
time_list = [img["timeofday"] for img in tqdm(images) if img["id"] in train_image_id_list]
c = Counter(time_list)
tot = c["daytime"] + c["Day"] + c["night"] + c["Night"] + c["dawn/dusk"] + c["Dawn/Dusk"]
data.append([ ((c["daytime"]+c["Day"])/tot)*100, ((c["night"]+c["Night"])/tot)*100, ((c["dawn/dusk"]+c["Dawn/Dusk"])/tot)*100 ])

# VAL
time_list = [img["timeofday"] for img in tqdm(images) if img["id"] in val_image_id_list]
c = Counter(time_list)
tot = c["daytime"] + c["Day"] + c["night"] + c["Night"] + c["dawn/dusk"] + c["Dawn/Dusk"]
data.append([ ((c["daytime"]+c["Day"])/tot)*100, ((c["night"]+c["Night"])/tot)*100, ((c["dawn/dusk"]+c["Dawn/Dusk"])/tot)*100 ])

# TEST
time_list = [img["timeofday"] for img in tqdm(images) if img["id"] in test_image_id_list]
c = Counter(time_list)
tot = c["daytime"] + c["Day"] + c["night"] + c["Night"] + c["dawn/dusk"] + c["Dawn/Dusk"]
data.append([ ((c["daytime"]+c["Day"])/tot)*100, ((c["night"]+c["Night"])/tot)*100, ((c["dawn/dusk"]+c["Dawn/Dusk"])/tot)*100 ])

In [None]:
data = [[57.89522657485811, 35.27379733879222, 6.830976086349678], [58.056361763879785, 34.95927347626627, 6.984364759853946], [58.73741141365162, 34.61395001864976, 6.64863856769862]]
columns = ["day", "night", "dawn/dusk"]
three_group_bar(columns, data, "train/val/test TimeOfDay Distribution")

## Model

We organized the dataset in order to be compatible with the COCO dataset. We did it initially because all the *YOLO* architectures were trained/tested on it.

We'll now focus more on the **YOLOv5** model, considered one of the best ones at the moment in terms of the  *accuracy*/*time inference* trade-off and with a very *pytorch-detailed* documentation.

Our goal is to achieve the best performances on our custom "*URBE_dataset*". In order to realize this we need to perform the following steps:
- build a custom YOLOv5 architecture (based on the official repo), to be able to use the *pretrained weights* on the COCO dataset for the **backbone** and the **neck** part. 
- thanks to the **autoanchor** algorithm implemented by Glenn Jocher (one of the authors of YOLOv5), we compute the best new anchors that fit our dataset. This contributes significantly to enhancing the overall model.
- adding only basic **augmentations** on the images and also on the bounding boxes thanks to ***Albumentation*** library. We decided to not apply *Mosaic Augmentation* (that is one of main suggested augmentation techniques for YOLOv5) because the custom dataset already conveys to the model a big enough generalization capability.
- trying to attach the **Decoupled Head** at the end of model (as it was added in YOLOv6 and subsequent architectures) and see if there's an improvement.
- playing around with different versions of the **IoU loss** (GIoU, DIoU or CIoU).

The *git repos* from which I took some informations about building the model architecture are:
- https://github.com/ultralytics/yolov5
- https://github.com/AlessandroMondin/YOLOV5m
- https://github.com/Iywie/pl_YOLO

### Autoanchor

Anchors in YOLO models are predefined bounding boxes used to represent the shape and size of the objects in an image. These anchors are used as a reference to compare the predicted bounding boxes from the model with the actual bounding boxes around the objects. 

Glenn Jocher introduced the idea of learning anchor boxes based on the distribution of bounding boxes in the custom dataset with *K-means* and *genetic* learning algorithms. This is very important for custom tasks, because the distribution of bounding box sizes and locations may be dramatically different than the preset bounding box anchors in the COCO dataset. 

> *The autoanchor algorithm is automatically computed before training (train code made publicly available by the YOLOv5 authors) starts.* 

We are indeed going to make the annotations (that are in COCO format) compatible with the "*YOLOv5 text format*". Then we are going to "train" a YOLOv5 architecture on our dataset (but actually we'll only leverage the functionality of autoanchor method). 

In [None]:
import json
import os
from tqdm import tqdm
annotations = json.load(open("dataset/URBE_dataset/labels/COCO/annotations.json", "r"))["annotations"]

# save image_id for each images in dataset/YOLOv5_format/train and also the .txt name where the labels will be written!
image_id_list = []
txt_labels_list = []
for f in os.listdir("dataset/YOLOv5_format/images/1")+os.listdir("dataset/YOLOv5_format/images/2")+os.listdir("dataset/YOLOv5_format/images/3")+os.listdir("dataset/YOLOv5_format/images/4"):
    image_id = f.split("_")[-1][:-4]
    txt_labels_name = f[:-4]+".txt"
    image_id_list.append(image_id)
    txt_labels_list.append(txt_labels_name)

# filtering of only the annotations of the images in dataset/YOLOv5_format/train
filter_annotations = list(filter(lambda x: x["image_id"] in image_id_list, annotations))

# for each images in dataset/YOLOv5_format/train
for image_id, txt_labels_name in list(zip(image_id_list, txt_labels_list)):
    image_labels = list( map(lambda x: [x["category_id"], x["bbox"][0], x["bbox"][1], x["bbox"][2], x["bbox"][3]], list(filter(lambda x: x["image_id"] == image_id, filter_annotations)) ) )
    
    txt_file_name = "dataset/YOLOv5_format/labels/" + txt_labels_name

    line_to_write = []
    for line in image_labels:
        x1 = float(line[1])
        y1 = float(line[2])
        w = float(line[3])
        h = float(line[4])
        c1 = round(((x1 + w/2) / 1280), 2)
        c2 = round(((y1 + h/2) / 720), 2)
        w = round(w/1280, 2)
        h = round(h/720, 2)
        line_to_write.append(" ".join([str(line[0]), str(c1), str(c2), str(w), str(h)]))
    with open(txt_file_name, 'w') as f:
        f.write("\n".join(line_to_write))

> <a href="https://imgur.com/Iebkt2Y"><img src="https://i.imgur.com/Iebkt2Y.png" width=65 height=25 title="source: imgur.com" /></a> After downloading the YOLOv5 format dataset on *Roboflow*, we "train" it using the YOLOv5 official code.

We followed the [colab Roboflow tutorial](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/train-yolov5-object-detection-on-custom-data.ipynb) about training YOLOv5 with a dataset already uploaded in Roboflow in order to perform the autoanchor algorithm and see if the default given anchors fit the dataset. The answer was positive and therefore we didn't have to change them. 

<a href="https://imgur.com/zy6z9o9"><img src="https://i.imgur.com/zy6z9o9.png" title="source: imgur.com" /></a>

### Pretrained weights

The aim is to load the *pretrained* weights of YOLOv5 architecture in our models. We create four "*.pt*" files because we have to deal with two versions of the YOLOv5 model (**medium** and **small**) and with two different types of HEADs (**Simple** and **Decoupled**). For this reason, we're going to load only the pretrained weights of the BACKBONE and the NECK. 

In [None]:
import sys
sys.path.insert(0, '../yolov5')

**YOLOv5m**

In [None]:
# YOLOv5m - Simple HEAD
hparams = asdict(Hparams())
model = URBE_Perception(hparams)

my_weights = model.state_dict()
pretrained_weights = torch.load("pretrained/ultralytics_yolov5m.pt")["model"].state_dict()

# manually loading ultralytics weights in my architecture
state_dict = model.state_dict()
layers_loaded = []
for layer, weight in list(pretrained_weights.items())[:-7]:
    for my_layer, my_weight in list(state_dict.items())[:-7]:
        if weight.shape == my_weight.shape:
            if my_layer not in layers_loaded:
                state_dict[my_layer] = weight
                layers_loaded.append(my_layer)
                break

torch.save(state_dict, "pretrained/yolov5m_nh_simple.pt")
#model.load_state_dict(torch.load("pretrained/yolov5m_nh_simple.pt"))

In [None]:
# YOLOv5m - Decoupled HEAD
hparams = asdict(Hparams())
model = URBE_Perception(hparams)

my_weights = model.state_dict()
pretrained_weights = torch.load("pretrained/ultralytics_yolov5m.pt")["model"].state_dict()

# manually loading ultralytics weights in my architecture
state_dict = model.state_dict()
layers_loaded = []
for layer, weight in list(pretrained_weights.items())[:-7]:
    for my_layer, my_weight in list(state_dict.items())[:-109]:
        if weight.shape == my_weight.shape:
            if my_layer not in layers_loaded:
                state_dict[my_layer] = weight
                layers_loaded.append(my_layer)
                break

torch.save(state_dict, "pretrained/yolov5m_nh_decoupled.pt")
model.load_state_dict(torch.load("pretrained/yolov5m_nh_decoupled.pt"))

**YOLOv5n**

In [None]:
# YOLOv5n - Simple HEAD
hparams = asdict(Hparams())
model = URBE_Perception(hparams)

my_weights = model.state_dict()
pretrained_weights = torch.load("pretrained/ultralytics_yolov5n.pt")["model"].state_dict()

# manually loading ultralytics weights in my architecture
state_dict = model.state_dict()
layers_loaded = []
for layer, weight in list(pretrained_weights.items())[:-7]:
    for my_layer, my_weight in list(state_dict.items())[:-7]:
        if weight.shape == my_weight.shape:
            if my_layer not in layers_loaded:
                state_dict[my_layer] = weight
                layers_loaded.append(my_layer)
                break

torch.save(state_dict, "pretrained/yolov5n_nh_simple.pt")
#model.load_state_dict(torch.load("pretrained/yolov5n_nh_simple.pt"))

In [None]:
# YOLOv5n - Decoupled HEAD
hparams = asdict(Hparams())
model = URBE_Perception(hparams)

my_weights = model.state_dict()
pretrained_weights = torch.load("pretrained/ultralytics_yolov5n.pt")["model"].state_dict()

# manually loading ultralytics weights in my architecture
state_dict = model.state_dict()
layers_loaded = []
for layer, weight in list(pretrained_weights.items())[:-7]:
    for my_layer, my_weight in list(state_dict.items())[:-109]:
        if weight.shape == my_weight.shape:
            if my_layer not in layers_loaded:
                state_dict[my_layer] = weight
                layers_loaded.append(my_layer)
                break

torch.save(state_dict, "pretrained/yolov5n_nh_decoupled.pt")
#model.load_state_dict(torch.load("pretrained/yolov5n_nh_decoupled.pt"))

### Training

In [None]:
user_name = "lavallone"
project_name = "VISIOPE_project"
version_name = "yolov5n_decoupled"
run = wandb.init(entity=user_name, project=project_name, name = version_name, mode = "online")

hparams = asdict(Hparams())
data = URBE_DataModule(hparams)
model = URBE_Perception(hparams)

if hparams["load_pretrained"]:
    if hparams["first_out"] == 48:
        if hparams["head"] == "simple":
            model.load_state_dict(torch.load("pretrained/yolov5m_nh_simple.pt"))
        else:
            model.load_state_dict(torch.load("pretrained/yolov5m_nh_decoupled.pt"))
    else:
        if hparams["head"] == "simple":
            model.load_state_dict(torch.load("pretrained/yolov5n_nh_simple.pt"))
        else:
            model.load_state_dict(torch.load("pretrained/yolov5n_nh_decoupled.pt"))
            
# RESUME logic is embedded within the trainer
trainer = train_model(data, model, experiment_name = version_name, \
   patience=15, metric_to_monitor="map_50", mode="max", epochs = 100)

wandb.finish()

## Inference

The ultimate objective is to create a *real-time detection model* that can be integrated into the intricate self-driving system software architecture. In addition to performance metrics, the inference time of the model is also crucial, perhaps even more so.

### Strategies for decrease Inference Time

We mainly leverage three methods:

- **Floating Point 16 precision (FP16)**. Lowering weights network precision, such as the 16-bit floating-point, enables the model to process inputs faster. This capability can also be levareaged during training enabling the deployment of large neural networks since they require less memory and they run faster (achieving upto +3X speedups on modern GPUs).
- **Pruning**. We use the *l1_unstructured* pruning method which acts by zeroing out the units with the lowest L1-norm. We simply apply it after the training phase (*post-pruning method*) by applying *0.3* and *0.5* sparsity amounts.
- **Quantization**. Model quantization is another performance optimization technique that allows speeding up inference and decreasing memory requirements by performing computations and storing tensors at lower bitwidths than floating-point precision. This is particularly beneficial during model deployment. We use *Quantization Aware Training (QAT)*, which mimics the effects of quantization during training: the computations are carried-out in floating-point precision but the subsequent quantization effect is taken into account. The weights and activations are quantized into lower precision only for inference, when training is completed.

>  My idea was to apply **QAT** on my custom trained models because was easily applicable by adding the *QuantizationAwareTraining()* pytorch-lightning built-in callbacks to the pl.Trainer class. But due to an internal problem of the callback (see https://github.com/Lightning-AI/lightning/issues/16609) it was not possible to do it.

<a href="https://imgur.com/VtrnS2h"><img src="https://i.imgur.com/VtrnS2h.png" width=750 height=80 title="source: imgur.com" /></a>

> For this reason, both for my *custom* trained models and for the official *finetuned* YOLOv5 models on my *URBE dataset*, only **fp16** and **post-pruning** techinques are used to reduce the inference time.

### How to compute *inference* time

In [None]:
# input
hparams = asdict(Hparams())
hparams["max_number_images"] = 16
hparams["batch_size"] = 1
URBE_Data = URBE_DataModule(hparams)
URBE_Data.setup()
batch = next(iter(URBE_Data.train_dataloader()))
device = torch.device("cuda")
img_input = batch["img"].to(device)

# model
finetuned = False
if finetuned:
   model = torch.hub.load('ultralytics/yolov5', 'yolov5n')
else:
   hparams = asdict(Hparams())
   model = URBE_Perception(hparams)
   model_ckpt = "models/yolov5n_decoupled-epoch=47-map_50=0.2797.ckpt"
   model = URBE_Perception.load_from_checkpoint(model_ckpt, strict=False, device = "cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

fp16 = False
if fp16:
   model = model.half()
   img_input = img_input.half()
   
prune = False
amount = 0.3
if prune:
   import torch.nn.utils.prune as prune
   for name, m in model.named_modules():
      if isinstance(m, nn.Conv2d):
         prune.l1_unstructured(m, name='weight', amount=amount)
         prune.remove(m, 'weight')

# how to correctly compute inference time (therefore fps) for a model
# https://towardsdatascience.com/the-correct-way-to-measure-inference-time-of-deep-neural-networks-304a54e5187f
starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
repetitions = 300
timings = np.zeros((repetitions,1))

# GPU-WARM-UP
for _ in range(10):
   _ = model(img_input)
   
# MEASURE PERFORMANCE
with torch.no_grad():
  for rep in range(repetitions):
     starter.record()
     _ = model(img_input)
     ender.record()
     # WAIT FOR GPU SYNC
     torch.cuda.synchronize()
     curr_time = starter.elapsed_time(ender)
     timings[rep] = curr_time

mean_syn = np.sum(timings) / repetitions
inference_time = mean_syn
fps = 1 / (inference_time/1000)
num_param = sum(p.numel() for p in model.parameters())

print("---------------------------------------")
print(f"Parameters: {num_param}")
print(f"Inference time: {inference_time:.3f} ms")
print(f"Frame Per Second: {fps:.3f}")
print("---------------------------------------")

### TensorRT <a href="https://imgur.com/Unhn3jE"><img width=55 height= 25 src="https://i.imgur.com/Unhn3jE.png" title="source: imgur.com" /></a>

TensorRT is a deep learning inference optimizer and runtime library developed by NVIDIA. It is designed to optimize and accelerate the inference of deep learning models on NVIDIA GPUs, making it suitable for deployment in various embedded environments. Considering our application for self-driving cars and the need for real-time inference, this is a step we cannot do without.


> We tried to convert our trained models to TensorRT format and run inference on them!

In [None]:
%cd ../torch2trt
!python setup.py install
%cd ../VISIOPE_project

In [None]:
%cd ../torch2trt
import torch
from torch2trt import torch2trt

model.to("cuda")
# create example data
x = torch.zeros((1, 3, 640, 640)).cuda()

# convert to TensorRT feeding sample data as input
model_trt = torch2trt(model, [x]) # we can now execute the returned TRTModule just like the original PyTorch model :)
%cd ../VISIOPE_project

In [None]:
# SAVE 
torch.save(model_trt.state_dict(), 'model_trt.pth')

# and LOAD
%cd ../torch2trt
from torch2trt import TRTModule
model_trt = TRTModule()
model_trt.load_state_dict(torch.load('model.pth'))
%cd ../VISIOPE_project

## Evaluation

### Mean Average Precision

In [None]:
def evaluate_performance(model, data, device, fp16):
    model.eval()
    dataset = data.test_dataloader() # TEST SET
    
    mAP = MeanAveragePrecision() # in this way the IoU thresholds are taken from the stepped range [0.5,...,0.95] with step 0.05

    with torch.no_grad():
        mAP_list = []
        for i, batch in enumerate(tqdm(iter(dataset))): # tqdm let us to visualize dataset reading process
            imgs = batch["img"]
            imgs = imgs.to(device)
            if fp16:
                imgs = imgs.half()
            out = model(imgs)
            
            targets = [YOLO_Loss.transform_targets(out, bboxes, torch.tensor(URBE_Perception.ANCHORS), URBE_Perception.STRIDE) for bboxes in batch["labels"]]
            # I want targets to be the same shape as predictions --> (bs, 3 , 80/40/20, 80/40/20, 6)
            t1 = torch.stack([target[0] for target in targets], dim=0).to(device,non_blocking=True)
            t2 = torch.stack([target[1] for target in targets], dim=0).to(device,non_blocking=True)
            t3 = torch.stack([target[2] for target in targets], dim=0).to(device,non_blocking=True)
            targets = [t1, t2, t3]
            
            pred_boxes = model.cells_to_bboxes(out, model.head.anchors, model.head.stride, device, is_pred=True)
            true_boxes = model.cells_to_bboxes(targets, model.head.anchors, model.head.stride, device, is_pred=False)
            _, _, pred_boxes = model.non_max_suppression(pred_boxes, iou_threshold=model.hparams.nms_iou_thresh, threshold=model.hparams.conf_threshold, max_detections=50, is_pred=True, filenames=batch["file_name"])
            true_boxes = model.non_max_suppression(true_boxes, iou_threshold=model.hparams.nms_iou_thresh, threshold=model.hparams.conf_threshold, max_detections=50, is_pred=False)
            
            pred_dict_list = []
            for b in range(len(pred_boxes)):
                if pred_boxes[b].numel() == 0: # if the model hasn't predict any bboxes
                    pred_dict_list.append( dict(boxes=torch.tensor([]).to(device), scores=torch.tensor([]).to(device), labels=torch.tensor([]).to(device),) )
                else:
                    pred_dict_list.append( dict(boxes=pred_boxes[b][..., 2:], scores=pred_boxes[b][..., 1], labels=pred_boxes[b][..., 0],) )
            true_dict_list = [ dict(boxes=true_boxes[i][..., 2:], labels=true_boxes[i][..., 0],) for i in range(len(true_boxes)) ]
            
            mAP.update(pred_dict_list, true_dict_list)
            
            mAP_50_95 = mAP.compute()["map"]
            mAP_50 = mAP.compute()["map_50"]
            
            print(f"Batch: {i+1}")
            print(f"mAP_50_95: {mAP_50_95:.3f}")
            print(f"mAP_50: {mAP_50:.3f}")
            mAP_list.append((mAP_50_95, mAP_50))
        
        print()
        all_map_50 = [e[1] for e in mAP_list]
        print(f"AVERAGE TEST SET mAP@0.5: {np.array(all_map_50).mean()}")
        print()
        all_map_50_95 = [e[0] for e in mAP_list]
        print(f"AVERAGE TEST SET mAP@[0.5:0.95]: {np.array(all_map_50_95).mean()}")

In [None]:
# define which model we want to test
hparams = asdict(Hparams())
model = URBE_Perception(hparams)
model_ckpt = "models/yolov5n_decoupled-epoch=47-map_50=0.2797.ckpt"
model = URBE_Perception.load_from_checkpoint(model_ckpt, strict=False, device = "cuda" if torch.cuda.is_available() else "cpu")
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

fp16 = True
if fp16:
   model = model.half()
   
prune = True
amount = 0.5
if prune:
   import torch.nn.utils.prune as prune
   for name, m in model.named_modules():
      if isinstance(m, nn.Conv2d):
         prune.l1_unstructured(m, name='weight', amount=amount)
         prune.remove(m, 'weight')

# if we want to test without training before we need to setup the data
trained = True
if not trained:
    hparams = asdict(Hparams())
    data = URBE_DataModule(hparams)
    data.setup()

evaluate_performance(model, data, device, fp16)

### Visualization

In [None]:
# define which model we want to test
hparams = asdict(Hparams())
model = URBE_Perception(hparams)
best_ckpt = "models/yolov5m_decoupled-epoch=27-map_50=0.4074.ckpt"
model = URBE_Perception.load_from_checkpoint(best_ckpt, strict=False, device = "cuda" if torch.cuda.is_available() else "cpu")
model.hparams.nms_iou_thresh = 0.15
model.hparams.conf_threshold = 0.65

In [None]:
COLORS = np.array([
                    [173, 255, 47],
                    [186, 85, 211],
                    [255, 215, 0]
                  ])

def vis(img, boxes, scores, cls_ids, class_names={0 : "vehicle", 1 : "person", 2 : "motorbike"}):

    for i in range(len(boxes)):
        box = boxes[i]
        cls_id = int(cls_ids[i])
        score = scores[i]
        
        x0 = int(box[0]*(1280/640))
        y0 = int(box[1]*(720/640))
        x1 = int(box[2]*(1280/640))
        y1 = int(box[3]*(720/640))

        color = (COLORS[cls_id]).astype(np.uint8).tolist()
        text = '{} : {:.1f}'.format(class_names[cls_id], score * 100)
        txt_color = (0, 0, 0)
        font = cv2.FONT_HERSHEY_SIMPLEX
        
        txt_size = cv2.getTextSize(text, font, 0.4, 1)[0]
        cv2.rectangle(img, (x0, y0), (x1, y1), color, 2)

        txt_bk_color = (COLORS[cls_id] * 0.7).astype(np.uint8).tolist()
        cv2.rectangle(
                        img,
                        (x0, y0 + 1),
                        (x0 + txt_size[0] + 1, y0 + int(1.5*txt_size[1])),
                        txt_bk_color,
                        -1
                     )
        cv2.putText(img, text, (x0, y0 + txt_size[1]), font, 0.4, txt_color, thickness=1)

    return img

In [None]:
def visualize_frame(output, img_info):
    img = img_info["raw_img"]
    if output is None: # no predictions for the frame
        return img
    
    output = output.cpu()
    bboxes = output[:, 0:4]
    cls = output[:, 5].int()
    scores = output[:, 4].float()

    vis_res = vis(img, bboxes, scores, cls)
    return vis_res

In [None]:
def make_inference(model, img):
    img_info = {}
    if isinstance(img, str):
        img_info["file_name"] = os.path.basename(img)
        img = cv2.imread(img)
    else:
        img_info["file_name"] = None
    
    height, width = img.size
    img_info["height"] = height # 720
    img_info["width"] = width # 1280
    img_info["raw_img"] = img # PIL Image

    transform = transforms.Compose([
                                    transforms.Resize((model.hparams.img_size, model.hparams.img_size)),
                                    transforms.ToTensor()
                                   ])
    img = transform(img).unsqueeze(0)
    img = img.float()
    img = img.to(model.device)

    with torch.no_grad():
        #### FORWARD PHASE ####
        outputs = model(img)
        pred_boxes = model.cells_to_bboxes(outputs, model.head.anchors, model.head.stride, model.device, is_pred=True)
        _, _, pred_boxes = model.non_max_suppression(pred_boxes, iou_threshold=model.hparams.nms_iou_thresh, threshold=model.hparams.conf_threshold, max_detections=20, is_pred=True, filenames=["frame"])
        
        if pred_boxes[0].numel() == 0: # if the model hasn't predict any bboxes
            outputs = [None]
        else:
            outputs = torch.cat((pred_boxes[0][..., 2:], pred_boxes[0][..., 1:2], pred_boxes[0][..., 0:1],), dim=-1).unsqueeze(0)
        #### ------------- ####
    return outputs, img_info

In [None]:
save_results = True
cap = cv2.VideoCapture("video/Streets_of_Rome.mp4")

model.eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

if save_results:
    video_writer = cv2.VideoWriter("video/ris.mp4", cv2.VideoWriter_fourcc("m","p","4","v"), 30, (1280, 720)) # fps and dimension of the output video is set
# for each video frame   
while True:
    ret_val, frame = cap.read()
    if ret_val:
        # convert from np.array to PIL Image for inference
        img = Image.fromarray(frame) # PIL Image
        outputs, img_info = make_inference(model, img)
        # but for visualization I don't need PIL Image
        img_info["raw_img"] = frame # np.array
        result_frame = visualize_frame(outputs[0], img_info)
        if save_results:
            video_writer.write(frame)
        else:
            cv2.namedWindow("Urbe Perception", cv2.WINDOW_NORMAL)
            cv2.imshow("Urbe Perception", result_frame)
            ch = cv2.waitKey(30) # 50 milliseconds per frame --> FPS~20
            if ch == 27 or ch == ord("q") or ch == ord("Q"):
                break
    else:
        break

cap.release()
if save_results:
    video_writer.release()
cv2.destroyAllWindows()

## Results

**Custom models**

 <table>
  <tr>
    <th><center> Model </center></th>
    <th><center>  </center></th>
    <th><center> mAP<br><t style="font-size:12px;">50</t> </center></th>
    <th><center> mAP<br><t style="font-size:12px;">50-95</t> </center></th>
    <th><center> Speed<br><t style="font-size:12px;">RTX3060 b1</t> <br><t style="font-size:12px;">(ms)</t> </center></th>
    <th><center> Speed<br><t style="font-size:12px;">RTX3060 b16</t> <br><t style="font-size:12px;">(ms)</t> </center></th>
    <th><center> params <br><t style="font-size:12px;">(M)</t> </center></th>
  <tr>
    <td><center><i><b>YOLOv5_m</b><br>simple head</center></td>
    <td> <i>base<br>fp16<br>0.3 pruning<br>0.5 pruning</td>
    <td><center>0.416<br><b>0.417</b><br>0.401<br>0.305</center></td>
    <td><center><b>0.206</b><br>0.205<br>0.197<br>0.150</center></td>
    <td><center>17.448 <i>~ 57 FPS<br><b>17.081 <i>~ 58 FPS</b><br>17.211 <i>~ 58 FPS<br>16.909 <i>~ 59 FPS</center></td>
    <td><center>264.478<br><b>194.542</b><br>197.870<br>195.424</td>
    <td><center><t style="font-size:18px;">20.87</t></center></td>
  </tr>
  <tr>
    <td><center><i><b>YOLOv5_m</b><br>decoupled head</center></td>
    <td> <i>base<br>fp16<br>0.3 pruning<br>0.5 pruning</td>
    <td><center><b>0.473</b><br>0.472<br>0.448<br>0.347</center></td>
    <td><center>0.248<br><b>0.251</b><br>0.230<br>0.149</center></td>
    <td><center>27.580 <i>~ 36 FPS<br><b>26.122 <i>~ 38 FPS</b><br>26.772 <i>~ 37 FPS<br>26.323 <i>~ 38 FPS</center></td>
    <td><center>404.726<br>290.056<br><b>268.198</b><br>293.334</td>
    <td><center><t style="font-size:18px;">25.1</t></center></td>
  </tr>
  <tr>
    <td><center><i><b>YOLOv5_n</b><br>simple head</center></td>
    <td> <i>base<br>fp16<br>0.3 pruning<br>0.5 pruning</td>
    <td><center><b>0.338</b><br><b>0.338</b><br>0.213<br>0.002</center></td>
    <td><center><b>0.151</b><br>0.150<br>0.096<br>0.001</center></td>
    <td><center><b>16.269 <i>~ 61 FPS</b><br>18.148 <i>~ 55 FPS<br>18.300 <i>~ 55 FPS<br>17.848 <i>~ 56 FPS</center></td>
    <td><center>99.638<br>72.208<br>74.088<br><b>70.801</b></center></td>
    <td><center><t style="font-size:18px;">2.33</t></center></td>
  </tr>
  <tr>
    <td><center><i><b>YOLOv5_n</b><br>decoupled head</center></td>
    <td> <i>base<br>fp16<br>0.3 pruning<br>0.5 pruning</td>
    <td><center><b>0.353</b><br>0.349<br>0.250<br>0.006</center></td>
    <td><center><b>0.165</b><br><b>0.165</b><br>0.102<br>0.002</center></td>
    <td><center>21.629 <i>~ 46 FPS<br>23.407 <i>~ 43 FPS<br><b>20.935 <i>~ 48 FPS</b><br>21.515 <i>~ 46 FPS</center></td>
    <td><center>119.691<br>89.465<br><b>82.417</b><br>90.662</center></td>
    <td><center><t style="font-size:18px;">2.8</t></center></td>
  </tr>
</table>

**Finetuned official models**

<table>
  <tr>
    <th><center> Model </center></th>
    <th><center>  </center></th>
    <th><center> mAP<br><t style="font-size:12px;">50</t> </center></th>
    <th><center> mAP<br><t style="font-size:12px;">50-95</t> </center></th>
    <th><center> Speed<br><t style="font-size:12px;">RTX3060 b1</t> <br><t style="font-size:12px;">(ms)</t> </center></th>
    <th><center> Speed<br><t style="font-size:12px;">RTX3060 b16</t> <br><t style="font-size:12px;">(ms)</t> </center></th>
    <th><center> params <br><t style="font-size:12px;">(M)</t> </center></th>
  <tr>
    <td><center><i><b>YOLOv5_m</b><br>(overfit)</center></td>
    <td> <i>base<br>fp16<br>0.3 pruning<br>0.5 pruning</td>
    <td><center><b>0.707</b><br>0.706<br>0.642<br>0.346</center></td>
    <td><center><b>0.376</b><br><b>0.376</b><br>0.334<br>0.164</center></td>
    <td><center>17.057 <i>~ 58 FPS<br><b>14.213 <i>~ 70 FPS</b><br>14.229 <i>~ 70 FPS<br>14.714 <i>~ 68 FPS</center></td>
    <td><center>242.669<br>145.945<br><b>143.637</b><br>150.257</td>
    <td><center><t style="font-size:18px;">21.17</t></center></td>
  </tr>
  <tr>
    <td><center><i><b>YOLOv5_m</b><br>(best)</center></td>
    <td> <i>base<br>fp16<br>0.3 pruning<br>0.5 pruning</td>
    <td><center><b>0.556</b><br><b>0.556</b><br>0.539<br>0.320</center></td>
    <td><center>0.275<br><b>0.276</b><br>0.261<br>0.137</center></td>
    <td><center>*</center></td>
    <td><center>*</center></td>
    <td><center>*</center></td>
  </tr>
  <tr>
    <td><center><i><b>YOLOv5_n</b></center></td>
    <td> <i>base<br>fp16<br>0.3 pruning<br>0.5 pruning</td>
    <td><center><b>0.469</b><br><b>0.469</b><br>0.293<br>0.063</center></td>
    <td><center><b>0.225</b><br><b>0.225</b><br>0.133<br>0.020</center></td>
    <td><center><b>7.559 <i>~ 132 FPS</b><br>8.528 <i>~ 117 FPS<br>8.623 <i>~ 116 FPS<br>9.214 <i>~ 108 FPS</center></td>
    <td><center>50.021<br><b>33.307</b><br>34.161<br>33.778</center></td>
    <td><center><t style="font-size:18px;">1.86</t></center></td>
  </tr>
</table>