## How to analyze, benchmark, compare in pytorch ?

To devleop new model in pytorch, Analyzing existing model is mandatory work to do. But I could not find any one-in-whole packages for these purpose. To analyze model, we need to see input/output, param, flops, latency and throughput for both in model level and in layer level. I spent a lot of time googling about this, these are the tools that I decide to use for each purpose. I hope this could help you.

- Want to see model's structure in clean view? use `torchsummary`
- Want to analyze model's latency/param/flops in layer? use `deepspeed.profiling.flops_profiler.get_model_profile()`. Other library seems not support these 3 important factors.
    - `torch.profile`: This library supports latency/param/flops/memory, but in cuda kernel operation level. It seems to good for who want to optimize module in lower level.
    - `fvcore.profile`: This library supports flops/param only. For me, it looks like upgraded version of `torchsummary`

- Want to compare simply latency with others? use `torch.utils.benchmark`.
    - Latency is `1/throughput`, which means that latency is affected by throughput factor such as batch size.
- Want to compare overall performance(params/flops/latency/throughput)? use `benchmark.py` in `timm` github

### Procedure to analyze model
1. Check model works as you intended by using `torchsummary`
2. Check model's param/flops/latency and which is bottleneck of your model by using profiler.
3. Compare model with other baseline model.
    - Find how to ease your bottleneck by comparing different method.
4. Finally check your models' param/flops/latency/thoughput and start experiment on your dataset.

### 1. Model Architecture 

In [1]:
import os

os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
os.environ['CUDA_VISIBLE_DEVICES'] = '4'

import torchvision.models as models
from torchsummary import summary

model = models.resnet50().cuda()

summary(model, input_size=(3, 224, 224))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 112, 112]           9,408
       BatchNorm2d-2         [-1, 64, 112, 112]             128
              ReLU-3         [-1, 64, 112, 112]               0
         MaxPool2d-4           [-1, 64, 56, 56]               0
            Conv2d-5           [-1, 64, 56, 56]           4,096
       BatchNorm2d-6           [-1, 64, 56, 56]             128
              ReLU-7           [-1, 64, 56, 56]               0
            Conv2d-8           [-1, 64, 56, 56]          36,864
       BatchNorm2d-9           [-1, 64, 56, 56]             128
             ReLU-10           [-1, 64, 56, 56]               0
           Conv2d-11          [-1, 256, 56, 56]          16,384
      BatchNorm2d-12          [-1, 256, 56, 56]             512
           Conv2d-13          [-1, 256, 56, 56]          16,384
      BatchNorm2d-14          [-1, 256,

### 2. Param/Flops/Latency in layer

In [8]:
import os

os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
os.environ['CUDA_VISIBLE_DEVICES'] = '4'

import torch
import torchvision.models as models
from deepspeed.profiling.flops_profiler import get_model_profile

model = models.resnet18().cuda()

macs, _ = get_model_profile(
    model=model,
    input_res=(1, 3, 224, 224),
    print_profile=True,
    detailed=True,
    warm_up=10,
    as_string=False,
    output_file=None,
    ignore_modules=None
)


-------------------------- DeepSpeed Flops Profiler --------------------------
Profile Summary at step 10:
Notations:
data parallel size (dp_size), model parallel size(mp_size),
number of parameters (params), number of multiply-accumulate operations(MACs),
number of floating point operations (flops), floating point operations per second (FLOPS),
fwd latency (forward propagation latency), bwd latency (backward propagation latency),
step (weights update latency), iter latency (sum of fwd, bwd and step latency)

params per gpu:                                               11.69 M 
params of model = params per GPU * mp_size:                   1       
fwd MACs per GPU:                                             182.22 G
fwd flops per GPU = 2 * fwd MACs per GPU:                     364.44 G
fwd flops of model = fwd flops per GPU * mp_size:             1       
fwd latency:                                                  8.52 ms 
fwd FLOPS per GPU = fwd flops per GPU / fwd latency:      

### 3. Benchmark in torch

In [3]:
import os

os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
os.environ['CUDA_VISIBLE_DEVICES'] = '4'

import torch
import torch.nn as nn
import torch.utils.benchmark as benchmark
import torchvision.models as models
from itertools import product


def get_upsample_module(mode, upsample, ch):
    if mode == 'deconv':
        return nn.Sequential(
            torch.nn.ConvTranspose2d(ch, ch, upsample, stride=upsample, dilation=1, groups=ch, bias=False),
            torch.nn.BatchNorm2d(ch)
        ).cuda()
    else:
        return nn.Upsample(scale_factor=upsample, mode=mode).cuda()

def get_downsample_module(kernel, ch):
    return nn.Sequential(
        torch.nn.Conv2d(ch, ch, kernel, stride=1, padding=1, dilation=1, groups=ch, bias=False),
        torch.nn.BatchNorm2d(ch)
    ).cuda()

use_benchmark = True
amp = True
channel_last = True

torch.backends.cudnn.benchmark = use_benchmark

results = []

batch_size = [8, 16, 32, 64]
channel_size = [1024, 2048]
image_size = [16]
mode = ['nearest', 'conv', 'deconv', 'bilinear']
scale_factors = [2]

for b, c, n in product(batch_size, channel_size, image_size):
    label = f'Upsample (benchmark={use_benchmark}, amp={amp}, channel_last={channel_last})'
    sub_label = f'[{b}, {c}, {n}, {n}]'
    x = torch.rand((b, c, n, n)).cuda()

    for method, upsample in product(mode, scale_factors):
        if method == 'conv':
            upsample += 1
            model = get_downsample_module(upsample, c)
        else:
            model = get_upsample_module(method, upsample, c)

        if channel_last:
            x = x.to(memory_format=torch.channels_last)
            model = model.to(memory_format=torch.channels_last)
        
        with torch.autocast('cuda', amp), torch.no_grad():
            results.append(benchmark.Timer(
                stmt='model(x)',
                setup='from __main__ import model',
                globals={'x': x},
                label=label,
                sub_label=sub_label,
                description=f"{method}(scale={upsample})",
                num_threads=4
            ).blocked_autorange(min_run_time=1))

compare = benchmark.Compare(results)
compare.colorize()
compare.print()