In [1]:
# Matthew Giarra <matthew.giarra@jhuapl.edu>

import torch
import torchvision
import apex.amp as amp
import time # for timing execution

# Number of images per batch
batch_size = 100

# Number of iterations
niter = 30

# Results vectors
results_list = []
results_names = []

# Make data
input_batch_cpu = torch.rand([batch_size, 3, 224, 224], dtype=torch.float32)
input_batch_gpu_full  = input_batch_cpu.to('cuda')
input_batch_gpu_half  = input_batch_gpu_full.half()

# # # # # Inference with automatic mixed precision (AMP) via APEX  # # # # #  

# Run each of the APEX AMP optimization levels
for opt_level in ["O3", "O2", "O1", "O0"]:
    model = torchvision.models.resnet50(pretrained=False).eval().to('cuda')
    model_amp = amp.initialize(model, opt_level=opt_level)
    
    # Warm up
    with torch.no_grad():
        for t in range(3):
            output_gpu = model_amp(input_batch_gpu_half)

    # Run inference on the batch of images
    # torch.no_grad() turns off gradient calculations for faster performance
    tic = time.perf_counter()
    with torch.no_grad():
        for t in range(niter):
            output_gpu = model_amp(input_batch_gpu_half)
    # Execution time
    toc = time.perf_counter()
    print("AMP (opt level %s): %0.2f seconds" % (opt_level, toc-tic))
    
    # Results
    results_list.append(toc-tic)
    results_names.append('AMP ' + opt_level)
        
    
    
# # # # # Inference with full precision (Float32) # # # # #        

# Load the model
model = torchvision.models.resnet50(pretrained=False).eval().to('cuda')

# Warm up
with torch.no_grad():
    for t in range(3):
        output_gpu = model(input_batch_gpu_full)
      
# Run inference on the batch of images
# torch.no_grad() turns off gradient calculations for faster performance
tic = time.perf_counter()
with torch.no_grad():
    for t in range(niter):
        output_gpu = model(input_batch_gpu_full)
toc = time.perf_counter()
print("Float32: %0.2f seconds" % (toc-tic))

# Results
results_list.append(toc-tic)
results_names.append('Float32')

print("\nResults summary (%d images)\n===============" % (batch_size * niter) )
for name, result in zip(results_names, results_list):
    print("%s: %0.2f seconds  (%0.2fx full precision speed)" % (name, result, results_list[-1]/result))

Selected optimization level O3:  Pure FP16 training.
Defaults for this optimization level are:
enabled                : True
opt_level              : O3
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : False
master_weights         : False
loss_scale             : 1.0
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O3
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : False
master_weights         : False
loss_scale             : 1.0
AMP (opt level O3): 9.62 seconds
Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights      

## Current results using Jetson AGX Xavier

### Configuration
- Platform: NVIDIA Jetson AGX Xavier
- Jetpack SDK: 4.4 ([L4T R32.4.3](https://developer.nvidia.com/embedded/jetpack))
- Power mode: [MAXN](https://www.jetsonhacks.com/2018/10/07/nvpmodel-nvidia-jetson-agx-xavier-developer-kit/) (`$ sudo nvpmodel -m 0`) 
- Docker source image: `nvcr.io/nvidia/l4t-ml:r32.4.3-py3`  ([link](https://ngc.nvidia.com/catalog/containers/nvidia:l4t-ml))
- Apex build: [full build](https://github.com/NVIDIA/apex#quick-start)

### Results
| Precision| Execution time (sec) | Throughput (FPS) | Speed-up |
|:----------:|:----------------------:|:----------:|:--------:|
|   AMP O3 |        9.72        |   308   |     1.00 |
|   AMP O2 |        9.67        |   310   |     1.01 |
|   AMP O1 |        9.65        |   310   |     1.01 |
|   AMP O0 |        9.72        |   308   |     1.00 |
|   Float32 |        9.70        |   309   |     1.00 |

## Comparison with published results
These results are much worse than what's posted on the [NVIDIA Developer Blog](https://developer.nvidia.com/blog/jetson-xavier-nx-the-worlds-smallest-ai-supercomputer/). Specifically, the throughput is around 16% of their reported throughput using Resnet50 on images of the same size (1941 FPS for 224x224 images). 


![Image](https://developer.download.nvidia.com/devblogs/inferencing-performance.png)

## Why I think it should be faster

Our results indicate that inference using mixed precision (or even pure `Float16`) on `Resnet50` yields no speedup compared to inference using `Float32` precision. I've experimented with various batch sizes, and the results are not exceptionally different from what's above. This is contrary to my expectation. I think we should see a speed-up because the tensor cores should be invoked under the following circumstances, which I believe I've met: 

1. Device has tensor cores ([NVIDIA Jetson AGX has Volta architecture](http://info.nvidia.com/rs/156-OFN-742/images/Jetson_AGX_Xavier_New_Era_Autonomous_Machines.pdf))
2. Much of the work in the feed-forward process consists of convolutional layers, which [should invoke tensor cores for FP16 operations](https://nvidia.github.io/apex/amp.html#o1-mixed-precision-recommended-for-typical-use).

3. I'm using cuDNN 8.0, and [for "cudnn 7.3 and later, convolutions should use TensorCores for FP16 inputs"](https://discuss.pytorch.org/t/cnn-fp16-slower-than-fp32-on-tesla-p100/12146/4):

    ```python
    >>> import torch
    >>> print(torch.backends.cudnn.version())
    8000
    ```      


4. The number of input and output channels in each `Conv2d` layer is a multiple of 8 (except the 3-channel input to the first layer), which is a [requirement for tensor cores](https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9926-tensor-core-performance-the-ultimate-guide.pdf). So are the dimensions of the fully connected `linear` layers. You can verify this by inspecting the output of the following commands:

    ```python
    >>> import torchvision
    >>> print(torchvision.models.resnet50())

    ResNet(
      (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
      (layer1): Sequential(
        (0): Bottleneck(
          (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
          (downsample): Sequential(
            (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          )
        )
        (1): Bottleneck(
          (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu): ReLU(inplace=True)
        )
          .
          .
          .

      (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
      (fc): Linear(in_features=2048, out_features=1000, bias=True)
    )
    ```
etc. 


