# Introduction

Welcome to the guide for speeding up your YOLO model on your specific hardware. 

The aim of this guide is to show how the nebullvm library can be succesfully used for speeding up your Yolo model
on whatever hardware you own (No need of Nvidia GPUs!).

The notebook has been created by the team at Nebuly and for any question about it please contact the info service at info@nebuly.ai

The Notebook has been tested with pytorch 1.11. It may not work with previous versions.

# Standard YOLO

Install YOLO

In [None]:
! pip install -r https://raw.githubusercontent.com/ultralytics/yolov5/master/requirements.txt

Let's start with downloading the model from the torch hub and running a first inference on an image.

In [None]:
import copy
import time
import types

import torch

In [None]:
# Load Model
model = torch.hub.load('ultralytics/yolov5', 'yolov5s', pretrained=True, force_reload=True)

# Images
imgs = ['https://ultralytics.com/images/zidane.jpg']  # batch of images

In [None]:
times = []
for _ in range(100):
    starting_time = time.time()
    # Inference
    results = model(imgs)
    times.append((time.time()-starting_time)*1000)
yolo_vanilla_time = sum(times) / len(times)
print(f"{yolo_vanilla_time} ms")

In [None]:
type(model)

In [None]:
#results.print()
results.show()

Here we are! We got a good prediction, but it took a while :) Let's see if we are able to speed up the model a little bit without losing in performance.

## Optimization with nebullvm

In [None]:
from nebullvm import optimize_torch_model

First thing, we need to slightly modify the forward method of YOLO since the last layer of the YOLOv5 implementation creates some troubles to some of the DL compilers running on the core of nebullvm. Note that on some types of hardware the nebullvm can optimize the whole net without any problem. However, since we are aiming to be hardware agnostinc in this Notebook, we will split the body of the newtork from its head (the last layer) and optimize just the body.

In [None]:
core_model = copy.deepcopy(model.model.model)

In [None]:
def _forward_once(self, x, profile=False, visualize=False):
    y, dt = [], []  # outputs
    for m in self.model:
        if m.f != -1:  # if not from previous layer
            x = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f]  # from earlier layers
        if profile:
            self._profile_one_layer(m, x, dt)
        x = m(x)  # run
        y.append(x if m.i in self.save else None)  # save output
        if visualize:
            feature_visualization(x, m.type, m.i, save_dir=visualize)
    self.last_y = y
    return x
core_model._forward_once = types.MethodType(_forward_once, core_model)

The reimplementation of the forward method is needed since we need to store the ys for giving to the head the right tensors as input.

In [None]:
class CoreModelWrapper(torch.nn.Module):
    def __init__(self, core_model, output_idxs):
        super().__init__()
        self.core = core_model
        self.idxs = output_idxs
        
    def forward(self, *args, **kwargs):
        x = self.core(*args, **kwargs)
        return tuple(x if j == -1 else self.core.last_y[j] for j in self.idxs)

In [None]:
list_of_layers = list(core_model.model.children())
last_layer = list_of_layers.pop(-1)

core_model.model = torch.nn.Sequential(*list_of_layers)
core_wrapper = CoreModelWrapper(core_model, last_layer.f)

Now we are ready for optimizing the body of YOLOv5 using the `nebullvm` function `optimize_torch_model`.

In [None]:
# Optimize without Quantization
# model_optimized = optimize_torch_model(
#     model=core_wrapper,
#     batch_size=1,
#     input_sizes=[(3, 384, 640)],
#     save_dir=".",
#     use_torch_api=True,
# )

In [None]:
from PIL import Image
import requests
import numpy as np

In [None]:
img_name = "zidane.png"
Image.open(requests.get(imgs[0], stream=True).raw).save(img_name)

In [None]:
def read_and_crop(im, original_model, img_size):
    p  =  next(original_model.parameters())
    im = Image.open(requests.get(im, stream=True).raw if str(im).startswith('http') else im)
    max_y, max_x = im.size
    ptr_x = np.random.choice(max_x-img_size[0])
    ptr_y = np.random.choice(max_y-img_size[1])
    im = np.array(im.crop((ptr_y, ptr_x, ptr_y + img_size[1], ptr_x + img_size[0])))
    x = np.expand_dims(im, axis=0)
    x = np.ascontiguousarray(np.array(x).transpose((0, 3, 1, 2)))  # stack and BHWC to BCHW
    x = torch.from_numpy(x).to(p.device).type_as(p) / 255  # uint8 to fp16/32
    return x

In [None]:
input_data = [((read_and_crop(img_name, core_model, (384, 640)),), None) for _ in range(500)]

In [None]:
# Optimize with Quantization
model_optimized = optimize_torch_model(
    model=core_wrapper,
    save_dir=".",
    dataloader=input_data,
    use_torch_api=True,
    perf_loss_ths=3,
)

Now let's regroup together the optimized body and the head of YOLO.

In [None]:
class OptimizedYolo(torch.nn.Module):
    def __init__(self, optimized_core, head_layer):
        super().__init__()
        self.core = optimized_core
        self.head = head_layer
    
    def forward(self, x, *args, **kwargs):
        x = list(self.core(x)) # it's a tuple
        return self.head(x)

In [None]:
final_core = OptimizedYolo(model_optimized, last_layer)

In [None]:
model.model.model = final_core

Finally we can check the speedup.

In [None]:
times = []
for _ in range(100):
    st = time.time()
    results = model(imgs)
    times.append((time.time() - st)*1000)
yolo_optimized_time = sum(times) / len(times)
print(f"{yolo_optimized_time} ms")

In [None]:
results.show()

What an amazing result, right?!? Stay tuned for more cool content from the Nebuly team :) 

## Post the result

You can now share the result you gor with the whole community. Just copy and paste the text below on the community main channel!

In [None]:
your_username = "Put here your username"

In [None]:
# Decomment the following line for installing gputil (if you are running on an NVIDIA GPU)
#!pip install gputil

In [None]:
import cpuinfo
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
cpu_info = cpuinfo.get_cpu_info()['brand_raw']
gpu_info = "no"
if torch.cuda.is_available():
    import GPUtil
    gpus = GPUtil.getGPUs()
    gpu_info = list(gpus)[0].name

In [None]:
message = f"""
Hello, I'm {your_username}!
I've tested nebullvm on the following setup:
Hardware: {cpu_info} CPU and {gpu_info} GPU.
Model: YOLOv5s
Vanilla performance: {round(yolo_vanilla_time, 2)}ms
Optimized performance: {round(yolo_optimized_time, 2)}ms
Acceleration: {round(yolo_vanilla_time/yolo_optimized_time, 1)}x
"""
print(message)