## How to analyze, benchmark, compare in pytorch ?

To devleop new model in pytorch, Analyzing existing model is mandatory work to do. But I could not find any one-in-whole packages for these purpose. To analyze model, we need to see input/output, param, flops, latency and throughput for both in model level and in layer level. I spent a lot of time googling about this, these are the tools that I decide to use for each purpose. I hope this could help you.

- Want to see model's structure in clean view? use `torchsummary`
- Want to analyze model's latency/param/flops in layer? use `deepspeed.profiling.flops_profiler.get_model_profile()`. Other library seems not support these 3 important factors.
    - `torch.profile`: This library supports latency/param/flops/memory, but in cuda kernel operation level. It seems to good for who want to optimize module in lower level.
    - `fvcore.profile`: This library supports flops/param only. For me, it looks like upgraded version of `torchsummary`

- Want to compare simply latency with others? use `torch.utils.benchmark`.
- Want to compare overall performance(params/flops/latency/throughput)? use `benchmark.py` in `timm` github

### 1. Model Architecture 

In [1]:
import os

os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
os.environ['CUDA_VISIBLE_DEVICES'] = '4'

import torchvision.models as models
from torchsummary import summary

model = models.resnet50().cuda()

summary(model, input_size=(3, 224, 224))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 112, 112]           9,408
       BatchNorm2d-2         [-1, 64, 112, 112]             128
              ReLU-3         [-1, 64, 112, 112]               0
         MaxPool2d-4           [-1, 64, 56, 56]               0
            Conv2d-5           [-1, 64, 56, 56]           4,096
       BatchNorm2d-6           [-1, 64, 56, 56]             128
              ReLU-7           [-1, 64, 56, 56]               0
            Conv2d-8           [-1, 64, 56, 56]          36,864
       BatchNorm2d-9           [-1, 64, 56, 56]             128
             ReLU-10           [-1, 64, 56, 56]               0
           Conv2d-11          [-1, 256, 56, 56]          16,384
      BatchNorm2d-12          [-1, 256, 56, 56]             512
           Conv2d-13          [-1, 256, 56, 56]          16,384
      BatchNorm2d-14          [-1, 256,

### 2. Param/Flops/Latency in layer

In [2]:
import os

os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
os.environ['CUDA_VISIBLE_DEVICES'] = '4'

import torch
import torchvision.models as models
from deepspeed.profiling.flops_profiler import get_model_profile

model = models.resnet18().cuda()

macs, _ = get_model_profile(
    model=model,
    input_res=(1, 3, 224, 224),
    print_profile=True,
    detailed=True,
    warm_up=10,
    as_string=False,
    output_file=None,
    ignore_modules=None
)


-------------------------- DeepSpeed Flops Profiler --------------------------
Profile Summary at step 10:
Notations:
data parallel size (dp_size), model parallel size(mp_size),
number of parameters (params), number of multiply-accumulate operations(MACs),
number of floating point operations (flops), floating point operations per second (FLOPS),
fwd latency (forward propagation latency), bwd latency (backward propagation latency),
step (weights update latency), iter latency (sum of fwd, bwd and step latency)

params per gpu:                                               11.69 M 
params of model = params per GPU * mp_size:                   1       
fwd MACs per GPU:                                             1.82 G  
fwd flops per GPU = 2 * fwd MACs per GPU:                     3.64 G  
fwd flops of model = fwd flops per GPU * mp_size:             1       
fwd latency:                                                  7.5 ms  
fwd FLOPS per GPU = fwd flops per GPU / fwd latency:      

### 3. Benchmark in torch

In [3]:
import torch
import torch.utils.benchmark as benchmark
from itertools import product


def batched_dot_mul_sum(a, b):
    return a.mul(b).sum(-1)

def batched_dot_bmm(a, b):
    a = a.unsqueeze(1)
    b = b.unsqueeze(-1)
    return a.bmm(b).flatten(-3)


if __name__ == '__main__':
    x = torch.rand([10, 64])
    assert batched_dot_mul_sum(x, x).allclose(batched_dot_bmm(x, x))

    results = []

    size = [1, 64, 256]

    for b, n in product(size, size):
        label = 'matrix multiplication'
        sub_label = f'[{b}, {n}]'
        x = torch.rand((b, n))

        for num_threads in [1, 4, 8, 16]:
            results.append(benchmark.Timer(
                stmt='batched_dot_mul_sum(x, x)',
                setup='from __main__ import batched_dot_mul_sum',
                globals={'x': x},
                label=label,
                sub_label=sub_label,
                description='mul',
                num_threads=num_threads
            ).blocked_autorange(min_run_time=1))

            results.append(benchmark.Timer(
                stmt='batched_dot_bmm(x, x)',
                setup='from __main__ import batched_dot_bmm',
                globals={'x': x},
                label=label,
                sub_label=sub_label,
                description='bmm',
                num_threads=num_threads
            ).blocked_autorange(min_run_time=1))

    compare = benchmark.Compare(results)
    compare.print()