<a href="https://colab.research.google.com/github/peace-and-harmony/image-matting/blob/main/notebooks/modnet_demo_benchmark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MODNet benchmark demo

In this demo, the differece between Pytorch and ONNX runtime is compared.

**Note** Use Runtime: CPU as base of the benchmark
- Runtime -> Change runtime type -> Hardware accelerator -> None

 ---

The model type and inference runtime are listed in the Table:

Model name  | Inference type | Runtime per image (ms)
-------------------|------------------|-----------------
checkpoint.pth       | Pytorch  | 1208.5716
checkpoint.onnx       | ONNX runtime    | 834.5359 
checkpoint-simplified.onnx       | ONNX runtime     | 824.7673

---

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


## Load the test dataset for inference runtime benchmark

In [None]:
%cd /content
!ls /content/drive/MyDrive/Cropper
!cp /content/drive/MyDrive/Cropper/cropper_validation.zip /content
!unzip cropper_validation.zip

In [None]:
%cd /content/valid_validation
!mkdir test_run
!cp -r image test_run

/content/valid_validation


In [None]:
import glob
len(glob.glob('/content/valid_validation/mask/*.jpeg'))

1127

## Clone MODNet repository

In [None]:
%cd /content
!git clone https://github.com/ZHKKKe/MODNet.git

/content
Cloning into 'MODNet'...
remote: Enumerating objects: 249, done.[K
remote: Counting objects: 100% (43/43), done.[K
remote: Compressing objects: 100% (32/32), done.[K
remote: Total 249 (delta 21), reused 24 (delta 9), pack-reused 206[K
Receiving objects: 100% (249/249), 60.76 MiB | 32.56 MiB/s, done.
Resolving deltas: 100% (82/82), done.


In [None]:
import os
# dowload the pre-trained checkpoint.pth for image matting
pretrained_pth = '/content/MODNet/pretrained/checkpoint.pth'
if not os.path.exists(pretrained_pth):
  !gdown --id 1-5PaqUxnZdJil9tKETllhVE6T9uH1O36 \
          -O /content/MODNet/pretrained/checkpoint.pth

Downloading...
From: https://drive.google.com/uc?id=1-5PaqUxnZdJil9tKETllhVE6T9uH1O36
To: /content/MODNet/pretrained/checkpoint.pth
52.3MB [00:00, 143MB/s]


## Generate .onnx


### install the requirements for converting MODNet pretrained model to .onnx format

In [None]:
%cd /content/MODNet
!pip install -r onnx/requirements.txt

### Export to cpu-based .onnx

In [None]:
%ls /content/MODNet/pretrained/

checkpoint.pth  README.md


In [None]:
%cd /content/MODNet
import os
import torch
import torch.nn as nn
from torch.autograd import Variable

from MODNet.src.models.modnet import MODNet
from MODNet.onnx import modnet_onnx


# general input
input_name = '/content/MODNet/pretrained/checkpoint.pth'
# check input arguments
if not os.path.exists(input_name):
    print('Cannot find checkpoint path: {0}'.format(ckpt_path))
    exit()

# define model & load checkpoint
modnet = modnet_onnx.MODNet(backbone_pretrained=False)

# prepare dummy_input
batch_size = 1
height = 512
width = 512

# dummy_input: input tensor x. The values in this can be random as long as it is the right type and size
if torch.cuda.is_available():
  device = torch.device('cuda')
  print('using gpu!')
  dummy_input = Variable(torch.randn(batch_size, 3, height, width)).cuda()
  modnet = nn.DataParallel(modnet).cuda()

else:
  device = torch.device('cpu')
  print('using cpu')
  dummy_input = Variable(torch.randn(batch_size, 3, height, width))
  modnet = nn.DataParallel(modnet)

state_dict = torch.load(input_name, map_location=device)
modnet.load_state_dict(state_dict['state_dict'])
modnet.eval() # set the model to inference mode

if torch.cuda.is_available():
  output_name = '/content/MODNet/pretrained/modnet-gpu.onnx'
else:
  output_name = '/content/MODNet/pretrained/modnet-cpu.onnx'


# export to onnx model
torch.onnx.export(
    modnet.module, dummy_input, output_name, export_params = True, 
    input_names = ['input'], output_names = ['output'], 
    dynamic_axes = {'input': {0:'batch_size', 2:'height', 3:'width'}, 'output': {0: 'batch_size', 2: 'height', 3: 'width'}}, opset_version=12)

%ls /content/MODNet/pretrained/

/content/MODNet
using cpu




checkpoint.pth  modnet-cpu.onnx  README.md


.onnx is a binary protobuf file which contains both the network structure and parameters of the model

If generate modnet-gpu.onnx, save modnet-gpu.onnx to gdrive for further TensorRT inference, see details in the current repository/notebooks.

In [None]:
# !cp /content/MODNet/pretrained/modnet-gpu.onnx /content/drive/MyDrive/

### Output of PyTorch and ONNX match test
  - the output of PyTorch and ONNX Runtime runs match numerically with the given precision (rtol=1e-03 and atol=1e-05)
  - onnx exporter is correct

In [None]:
import onnxruntime
import numpy as np


# Input to the model
x = torch.randn(1, 3, 512, 512, requires_grad=True)
torch_out = modnet(x)

ort_session = onnxruntime.InferenceSession("/content/MODNet/pretrained/modnet-cpu.onnx")

def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

# compute ONNX Runtime output prediction
ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(x)}
ort_outs = ort_session.run(None, ort_inputs)

# compare ONNX Runtime and PyTorch results
np.testing.assert_allclose(to_numpy(torch_out), ort_outs[0], rtol=1e-03, atol=1e-05)

print("Exported model has been tested with ONNXRuntime, and the result looks good!")



Exported model has been tested with ONNXRuntime, and the result looks good!


## ONNX Runtime inference

Use converted checkpoint.onnx for inference

In [None]:
import glob
import os

%cd /content/MODNet/
img_name_list = glob.glob('/content/valid_validation/image/' + os.sep + '*')

!mkdir -p /content/valid_validation/image/test_run
!cp /content/valid_validation/image/* /content/valid_validation/image/test_run

/content/MODNet


In [None]:
import torchvision
from torchvision import transforms

val_data = "/content/valid_validation/image"

TRANSFORM_IMG = transforms.Compose([
     transforms.Resize((512, 512)),
                transforms.ToTensor(),
                transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

val = torchvision.datasets.ImageFolder(val_data, transform=TRANSFORM_IMG)

In [None]:
# ONNX Runtime Inference

from torch.utils.data import Dataset, DataLoader

import onnxruntime as rt  
import time
from tqdm.notebook import tqdm

n_runs = 50

sess_options = rt.SessionOptions()

sess_options.intra_op_num_threads = 4
sess_options.execution_mode = rt.ExecutionMode.ORT_SEQUENTIAL
sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL

session = rt.InferenceSession('/content/MODNet/pretrained/modnet-cpu.onnx', sess_options=sess_options)

input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name

dynamic = False
if dynamic:
  bsize = (1,2,4,8,16,32,64)
else:
  bsize = (1,)

start_full = time.time()
for batch_size in bsize:
    runtimes = []
    for _ in tqdm(range(n_runs)):
        dataloader = DataLoader(dataset=val, batch_size=batch_size, shuffle=True, num_workers=2)
        batch = next(iter(dataloader))
        batch = tuple(t.to('cpu') for t in batch)

        start = time.time()
        pred = session.run([output_name], {input_name: batch[0].numpy()})[0]
        end = time.time()
        runtimes.append((end-start)*1000)

    print(f"inference cost for batch_size {batch_size}: {round(sum(runtimes)/len(runtimes), 4)} ms")

end_full = time.time()
overall_cost = (end_full - start_full)
print(f"overall inference execution cost: {round(overall_cost, 4)} seconds")

  0%|          | 0/50 [00:00<?, ?it/s]

inference cost for batch_size 1: 834.5359 ms
overall inference execution cost: 52.8482 seconds


# Pytorch comparison
Use pretrained checkpoint.pth via Pytorch for inference




In [None]:
import time
from tqdm.notebook import tqdm

dynamic = False

# load MODNet and pretrained checkpoint
input_name = '/content/MODNet/pretrained/checkpoint.pth'
# check input arguments
if not os.path.exists(input_name):
    print('Cannot find checkpoint path: {0}'.format(ckpt_path))
    exit()

# define model & load checkpoint
modnet = modnet_onnx.MODNet(backbone_pretrained=False)

if torch.cuda.is_available():
  device = torch.device('cuda')
  print('using gpu!')
  modnet = nn.DataParallel(modnet).cuda()

else:
  device = torch.device('cpu')
  print('using cpu')
  modnet = nn.DataParallel(modnet)

state_dict = torch.load(input_name, map_location=device)
modnet.load_state_dict(state_dict['state_dict'])
modnet.eval() # set the model to inference mode

if dynamic:
  bsize = (1,2,4,8,16,32,64)
else:
  bsize = (1,)

n_runs = 50

start_full = time.time()
for batch_size in bsize:
    runtimes = []
    for _ in tqdm(range(n_runs)):
        dataloader = DataLoader(dataset=val, batch_size=batch_size, shuffle=True, num_workers=2)
        batch = next(iter(dataloader))
        batch = tuple(t.to('cpu') for t in batch)

        start = time.time()
        matte = modnet(batch[0])
        end = time.time()
        runtimes.append((end-start)*1000)

    print(f"inference cost for batch_size {batch_size}: {round(sum(runtimes)/len(runtimes), 4)}ms")

end_full = time.time()
overall_cost = (end_full - start_full)
print(f"overall inference execution cost: {round(overall_cost, 4)} seconds")

using cpu


  0%|          | 0/50 [00:00<?, ?it/s]



inference cost for batch_size 1: 1208.5716ms
overall inference execution cost: 74.6324 seconds


## ONNX Simplifier
replaces the redundant operators with their constant outputs to simplify onnx model.

In [None]:
%cd /content/
!pip3 install -U pip && pip3 install onnx-simplifier

In [None]:
!python3 -m onnxsim /content/MODNet/pretrained/modnet-cpu.onnx /content/MODNet/pretrained/modnet-cpu-simplified.onnx --input-shape 1,3,512,512

Simplifying...
Note: The input shape of the simplified model will be overwritten by the value of '--input-shape' argument. Pass '--dynamic-input-shape' if it is not what you want. Run 'python3 -m onnxsim -h' for details.
Checking 0/3...
Checking 1/3...
Checking 2/3...
Ok!


In [None]:
!du /content/MODNet/pretrained/modnet-cpu.onnx

25284	/content/MODNet/pretrained/modnet-cpu.onnx


In [None]:
!du /content/MODNet/pretrained/modnet-cpu-simplified.onnx

25276	/content/MODNet/pretrained/modnet-cpu-simplified.onnx


In [None]:
# ONNX Runtime Inference for simplified: modnet-cpu-simplified.onnx

from torch.utils.data import Dataset, DataLoader

import onnxruntime as rt  
import time
from tqdm.notebook import tqdm

n_runs = 50

sess_options = rt.SessionOptions()

sess_options.intra_op_num_threads = 4
sess_options.execution_mode = rt.ExecutionMode.ORT_SEQUENTIAL
sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL

session = rt.InferenceSession('/content/MODNet/pretrained/modnet-cpu-simplified.onnx', sess_options=sess_options)

input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name

dynamic = False
if dynamic:
  bsize = (1,2,4,8,16,32,64)
else:
  bsize = (1,)

start_full = time.time()
for batch_size in bsize:
    runtimes = []
    for _ in tqdm(range(n_runs)):
        dataloader = DataLoader(dataset=val, batch_size=batch_size, shuffle=True, num_workers=2)
        batch = next(iter(dataloader))
        batch = tuple(t.to('cpu') for t in batch)

        start = time.time()
        pred = session.run([output_name], {input_name: batch[0].numpy()})[0]
        end = time.time()
        runtimes.append((end-start)*1000)

        #print(pred.shapredictions, bitmask = torch.max(pred_torch, 1)pe)
    print(f"inference cost for batch_size {batch_size}: {round(sum(runtimes)/len(runtimes), 4)} ms")

end_full = time.time()
overall_cost = (end_full - start_full)
print(f"overall inference execution cost: {round(overall_cost, 4)} seconds")

  0%|          | 0/50 [00:00<?, ?it/s]

inference cost for batch_size 1: 824.7673 ms
overall inference execution cost: 54.7681 seconds


## Model Visualization via Netron

In [None]:
!pip install -q netron

[?25l[K     |▎                               | 10 kB 19.7 MB/s eta 0:00:01[K     |▌                               | 20 kB 12.5 MB/s eta 0:00:01[K     |▊                               | 30 kB 9.6 MB/s eta 0:00:01[K     |█                               | 40 kB 8.5 MB/s eta 0:00:01[K     |█▎                              | 51 kB 5.0 MB/s eta 0:00:01[K     |█▌                              | 61 kB 5.6 MB/s eta 0:00:01[K     |█▊                              | 71 kB 5.6 MB/s eta 0:00:01[K     |██                              | 81 kB 6.2 MB/s eta 0:00:01[K     |██▎                             | 92 kB 4.7 MB/s eta 0:00:01[K     |██▌                             | 102 kB 5.1 MB/s eta 0:00:01[K     |██▊                             | 112 kB 5.1 MB/s eta 0:00:01[K     |███                             | 122 kB 5.1 MB/s eta 0:00:01[K     |███▎                            | 133 kB 5.1 MB/s eta 0:00:01[K     |███▌                            | 143 kB 5.1 MB/s eta 0:00:01[K   

In [None]:
import netron
import portpicker
from google.colab import output

port = portpicker.pick_unused_port()

# Read the model file and start the netron browser.
with output.temporary():
  netron.start('/content/MODNet/pretrained/modnet-cpu.onnx', address=port, browse=True)

output.serve_kernel_port_as_iframe(port, height='800')

<IPython.core.display.Javascript object>

In [None]:
import netron
import portpicker
from google.colab import output

port = portpicker.pick_unused_port()

# Read the model file and start the netron browser.
with output.temporary():
  netron.start('/content/MODNet/pretrained/modnet-cpu-simplified.onnx', address=port, browse=True)

output.serve_kernel_port_as_iframe(port, height='800')

<IPython.core.display.Javascript object>