# Benchmarks

## Setup

Requires CUDA and a GPU.

In [None]:
!pip install datasets transformers

In [None]:
%%bash
git clone https://github.com/NVIDIA/apex
cd apex
git checkout 810ffae374a2b9cb4b5c5e28eaeca7d7998fca0c 
pip install -r requirements.txt
# Use the commented `pip install` line instead if the pip version < 24
# pip install -v --no-build-isolation --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
cd ..

In [None]:
from apex.optimizers.fused_lamb import FusedLAMB
import torch
from torch.utils.benchmark import Compare, Fuzzer, FuzzedParameter, FuzzedTensor, ParameterAlias, Timer
from transformers import set_seed

from pytorch_fused_lamb import Lamb

from tests.reference import Lamb as ReferenceLamb

In [2]:
torch.__version__

'2.4.0a0+git1346ebf'

In [3]:
torch.version.cuda

'11.8'

## Description

Optimizers update a list of paramters in place. In a naive implementation, the optimizer loops over the list of paramter tensors and performs the update ops for each tensor individually. If the list of parameters is large, this introduces significant overhead since kernel launches are costly. `torch._foreach` allows to fuse this procedure horizontally over the list of parameters. Still, every update op launches its own kernel, which is inefficient. To vertically fuse this, `torch.compile` can be used

The following benchmark compares a vertically and horizontally fused implementation of the LAMB optimizer with a reference implementation in PyTorch and the fused CUDA kernel of the nvidia/apex library.

Note: currently, torch optimizers do not support the `fullgraph=True`, `mode="max-autotune"` and `mode="max-autotune"` options of `torch.compile` (see https://github.com/pytorch/pytorch/pull/118987). Therefore, the fused LAMB implementation does not get the full benefit of fusion (e.g. kernel launch overhead reduction using CUDA graphs).

## Result

Canonical benchmark code from: https://pytorch.org/tutorials/recipes/compiling_optimizer.html

In [4]:
device = "cuda"
seed = 123
set_seed(seed)

In [5]:
model = torch.nn.Sequential(
    *[torch.nn.Linear(1024, 1024, False, device="cuda") for _ in range(100)]
)
inputs = torch.rand(8, 1024, device="cuda")
output = model(inputs)
output.sum().backward()

In [6]:
opt = Lamb(model.parameters(), lr=1e-3)
reference_opt = ReferenceLamb(model.parameters(), lr=1e-3)
fused_opt = FusedLAMB(model.parameters(), lr=1e-3)


@torch.compile(fullgraph=False)
def fn():
    opt.step()


# Let's define a helpful benchmarking function:

def benchmark_torch_function_in_microseconds(f, sub_label, *args, **kwargs):
    t0 = Timer(
        stmt="f(*args, **kwargs)", globals={"args": args, "kwargs": kwargs, "f": f},
        sub_label=sub_label,
        description="runtime",
    )
    return t0.blocked_autorange()


# Warmup runs to compile the function
for _ in range(5):
    fn()

reference_runtime = benchmark_torch_function_in_microseconds(reference_opt.step, sub_label="reference")
fused_runtime = benchmark_torch_function_in_microseconds(fused_opt.step, sub_label="apex")
compiled_runtime = benchmark_torch_function_in_microseconds(fn, sub_label="compiled")


compare = Compare([reference_runtime, fused_runtime, compiled_runtime])

W0412 00:05:49.738000 140064798246720 torch/_logging/_internal.py:1016] [0/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored


In [7]:
compare.print()

[------------  -----------]
                 |  runtime
1 threads: ----------------
      reference  |    26.2 
      apex       |    10.4 
      compiled   |    24.1 

Times are in milliseconds (ms).



The results show that the fused implementation is ~ 10% faster than the reference implementation, but more than twice as slow as the nvidia/apex CUDA kernel.

## Results with Adam

For comparison, the same benchmark is run for a different optimizer (Adam), for which a vertically and horizontally fused PyTorch implementation already exists.

In [13]:
from apex.optimizers.fused_adam import FusedAdam

In [14]:
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
fused_opt = FusedAdam(model.parameters(), lr=1e-3)


@torch.compile(fullgraph=False)
def fn():
    opt.step()


# Let's define a helpful benchmarking function:

def benchmark_torch_function_in_microseconds(f, sub_label, *args, **kwargs):
    t0 = Timer(
        stmt="f(*args, **kwargs)", globals={"args": args, "kwargs": kwargs, "f": f},
        sub_label=sub_label,
        description="runtime",
    )
    return t0.blocked_autorange()


# Warmup runs to compile the function
for _ in range(5):
    fn()

reference_runtime = benchmark_torch_function_in_microseconds(opt.step, sub_label="reference")
fused_runtime = benchmark_torch_function_in_microseconds(fused_opt.step, sub_label="apex")
compiled_runtime = benchmark_torch_function_in_microseconds(fn, sub_label="compiled")


compare = Compare([reference_runtime, fused_runtime, compiled_runtime])

In [16]:
compare.print()

[------------  -----------]
                 |  runtime
1 threads: ----------------
      reference  |    63.5 
      apex       |     7.1 
      compiled   |     5.9 

Times are in milliseconds (ms).



Interestingly, for the Adam case, the fused implementation is even faster than the nvidia/apex CUDA kernel. This suggests that the fused LAMB implementation could still be further improved to reach similar results. One big improvement would be to replace the following line once `torch._foreach_where` is available (https://github.com/pytorch/pytorch/issues/117884):

```
        trust_ratio = tuple(torch.where(torch.isinf(ratio), 1., ratio) for ratio in trust_ratio)
```