![nebullvm nebuly AI accelerate inference optimize DeepLearning](https://user-images.githubusercontent.com/38586138/201391643-a80407e5-2c28-409c-90c9-327795cd27e8.png)

# Accelerate PyTorch YOLO with Speedster



Hi and welcome ðŸ‘‹

In this notebook we will discover how in just a few steps you can speed up the response time of deep learning model inference using the Speedster app from the open-source library nebullvm.

With Speedster's latest API, you can speed up models up to 10 times without any loss of accuracy (option A), or accelerate them up to 20-30 times by setting a self-defined amount of accuracy/precision that you are willing to trade off to get even lower response time (option B). To accelerate your model, Speedster takes advantage of various optimization techniques such as deep learning compilers (in both option A and option B), quantization, half accuracy, and so on (option B).

Let's jump to the code.

In [None]:
%env CUDA_VISIBLE_DEVICES=0

### Install Speedster

Install Speedster:

In [None]:
!pip install speedster

Install deep learning compilers:

In [None]:
!python -m nebullvm.installers.auto_installer  --backends torch-full --compilers all

### Install and test YOLO

Let's install YOLO.

In [None]:
! pip install -r https://raw.githubusercontent.com/ultralytics/yolov5/master/requirements.txt

We start by downloading the model from the Torch hub.

In [None]:
import copy
import time
import types

import torch

In [None]:
# Load Model
model = torch.hub.load('ultralytics/yolov5', 'yolov5s', pretrained=True, force_reload=True)

# Images
imgs = ['https://ultralytics.com/images/zidane.jpg']  # batch of images

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

## Optimization with Speedster

In [None]:
from speedster import optimize_model, save_model, load_model

First, we need to slightly modify YOLO's forward method. 

The last layer of the YOLOv5 implementation can create problems on certain hardware for some deep learning compilers that run on the Speedster core. Since Speedster aims to be hardware agnostic, we circumvent any potential obstacles by splitting the network body from the head (the last layer) and optimizing only the body.

In [None]:
core_model = copy.deepcopy(model.model.model)

In [None]:
def _forward_once(self, x, profile=False, visualize=False):
    y, dt = [], []  # outputs
    for m in self.model:
        if m.f != -1:  # if not from previous layer
            x = y[m.f] if isinstance(m.f, int) else [x if j == -1 else y[j] for j in m.f]  # from earlier layers
        if profile:
            self._profile_one_layer(m, x, dt)
        x = m(x)  # run
        y.append(x if m.i in self.save else None)  # save output
        if visualize:
            feature_visualization(x, m.type, m.i, save_dir=visualize)
    self.last_y = y
    return x
core_model._forward_once = types.MethodType(_forward_once, core_model)

The reimplementation of the forward method is needed since we need to store the ys for giving to the head the right tensors as input.

In [None]:
class CoreModelWrapper(torch.nn.Module):
    def __init__(self, core_model, output_idxs):
        super().__init__()
        self.core = core_model
        self.idxs = output_idxs
        
    def forward(self, *args, **kwargs):
        x = self.core(*args, **kwargs)
        return tuple(x if j == -1 else self.core.last_y[j] for j in self.idxs)

In [None]:
list_of_layers = list(core_model.model.children())
last_layer = list_of_layers.pop(-1)

core_model.model = torch.nn.Sequential(*list_of_layers)
core_wrapper = CoreModelWrapper(core_model, last_layer.f)

Now we are ready for optimizing the body of YOLOv5 using the `Speedster` function `optimize_model`.

Speedster was built to be very easy to use. To optimize a model, you only need to specify the model, the batch size and input size for each input tensor, and a directory in which to save the optimized model. In the example, we chose the same directory in which this notebook runs.

With the latest API, there are two ways to use Speedster:

- Option A: Accelerate the model up to ~10 times without losing in performances (accuracy/precision/etc.)
- Option B: Accelerate the model up to ~30 times with a pre-defined maximum loss in performances
    
To learn more about how to use Speedster, check out the <a href="https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster#-speedster" target="_blank" style="text-decoration: none;"> readme on GitHub </a>.

In this example, we provide the code to run option B.

In [None]:
from PIL import Image
import requests
import numpy as np

In [None]:
img_name = "zidane.png"
Image.open(requests.get(imgs[0], stream=True).raw).save(img_name)

In [None]:
def read_and_crop(im, original_model, img_size):
    p  =  next(original_model.parameters())
    im = Image.open(requests.get(im, stream=True).raw if str(im).startswith('http') else im)
    max_y, max_x = im.size
    ptr_x = np.random.choice(max_x-img_size[0])
    ptr_y = np.random.choice(max_y-img_size[1])
    im = np.array(im.crop((ptr_y, ptr_x, ptr_y + img_size[1], ptr_x + img_size[0])))
    x = np.expand_dims(im, axis=0)
    x = np.ascontiguousarray(np.array(x).transpose((0, 3, 1, 2)))  # stack and BHWC to BCHW
    x = torch.from_numpy(x).to(p.device).type_as(p) / 255  # uint8 to fp16/32
    return x

In [None]:
input_data = [((read_and_crop(img_name, core_model, (384, 640)),), None) for _ in range(500)]

In [None]:
model_optimized = optimize_model(
    model=core_wrapper,
    input_data=input_data,
    optimization_time="unconstrained",
    metric_drop_ths=0.3
)

Now let's regroup together the optimized body and the head of YOLO.

In [None]:
class OptimizedYolo(torch.nn.Module):
    def __init__(self, optimized_core, head_layer):
        super().__init__()
        self.core = optimized_core
        self.head = head_layer
    
    def forward(self, x, *args, **kwargs):
        x = list(self.core(x)) # it's a tuple
        return self.head(x)

In [None]:
final_core = OptimizedYolo(model_optimized, last_layer)

Let's compare the original model performance with the optimized one:

In [None]:
from nebullvm.tools.benchmark import benchmark

original_model = torch.hub.load('ultralytics/yolov5', 'yolov5s', pretrained=True, force_reload=True)
original_core = original_model.model.model

print("Benchmark original model")
benchmark(original_core, input_data)

print("Benchmark optimized model")
benchmark(final_core, input_data)

We can finally change the original model with the optimized one in the original model object, and make sure that it works properly by performing a prediction on the sample image:

In [None]:
model.model.model = final_core
results = model(imgs)
results.show()

## Save and reload the optimized model

We can easily save to disk the optimized model with the following line:

In [None]:
save_model(model_optimized, "model_save_path")

We can then load again the model:

In [None]:
model_optimized = load_model("model_save_path")
final_core = OptimizedYolo(model_optimized, last_layer)
model.model.model = final_core

What an amazing result, right?!? Stay tuned for more cool content from the Nebuly team :) 

<center> 
    <a href="https://discord.com/invite/RbeQMu886J" target="_blank" style="text-decoration: none;"> Join the community </a> |
    <a href="https://nebuly.gitbook.io/nebuly/welcome/questions-and-contributions" target="_blank" style="text-decoration: none;"> Contribute to the library </a>
</center>

<center> 
    <a href="https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster#key-concepts" target="_blank" style="text-decoration: none;"> How speedster works </a> â€¢
    <a href="https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster#documentation" target="_blank" style="text-decoration: none;"> Documentation </a> â€¢
    <a href="https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster#quick-start" target="_blank" style="text-decoration: none;"> Quick start </a> 
</center>