# Inference in Deep Learning Models ( PyTorch)

Code and explanations taken from [here](https://deci.ai/blog/measure-inference-time-deep-neural-networks/)

More specifically, when calling a function using a GPU, the operations are enqueued to the specific device, but not necessarily to other devices. This allows us to execute computations in parallel on the CPU or another GPU.

<img src='https://deci.ai//wp-content/uploads/2020/05/Figure-1_white.png'>

Asynchronous execution offers huge advantages for deep learning, such as the ability to decrease run-time by a large factor. For example, at the inference of multiple batches, the second batch can be preprocessed on the CPU while the first batch is fed forward through the network on the GPU. Clearly, it would be beneficial to use asynchronism whenever possible at inference time.

When you calculate time with the “time” library in Python, the measurements are performed on the CPU device. Due to the asynchronous nature of the GPU, the line of code that stops the timing will be executed before the GPU process finishes. 

### GPU warm-up

**GPU tend to stay in a passive state and not turned on ! So there is a wake-up time to boot the GPU when necessary called WAKEUP TIME**

A modern GPU device can exist in one of several different power states. When the GPU is not being used for any purpose and persistence mode (i.e., which keeps the GPU on) is not enabled, the GPU will automatically reduce its power state to a very low level, sometimes even a complete shutdown. In lower power state, the GPU shuts down different pieces of hardware, including memory subsystems, internal subsystems, or even compute cores and caches.

**the invocation of any program that attempts to interact with the GPU will cause the driver to load and/or initialize the GPU**

#### In the code below
 Before we make any time measurements, we run some dummy examples through the network to do a ‘GPU warm-up.’ This will automatically initialize the GPU and prevent it from going into power-saving mode when we measure time.

 Next, we use tr.cuda.event to measure time on the GPU. It is crucial here to use torch.cuda.synchronize(). This line of code performs synchronization between the host and device (i.e., GPU and CPU), so the time recording takes place only after the process running on the GPU is finished. This overcomes the issue of unsynchronized execution.

In [None]:
import torch
import numpy as np
from pathlib import Path
segformer_path = 
model = torch.

In [25]:

EfficientNet = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_efficientnet_b0', pretrained=True)

model = EfficientNet
device = torch.device("cuda")
model.to(device)
dummy_input = torch.randn(1, 3,224,224, dtype=torch.float).to(device)

# INIT LOGGERS
starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
repetitions = 300
timings=np.zeros((repetitions,1))
#GPU-WARM-UP
for _ in range(10):
    _ = model(dummy_input)
# MEASURE PERFORMANCE
with torch.no_grad():
    for rep in range(repetitions):
        starter.record()
        _ = model(dummy_input)
        ender.record()
        # WAIT FOR GPU SYNC
        torch.cuda.synchronize()
        curr_time = starter.elapsed_time(ender)
        timings[rep] = curr_time

mean_syn = np.sum(timings) / repetitions
std_syn = np.std(timings)
print("Mean Inference Time: {} ms for 1 single batch".format(mean_syn))

Using cache found in /home/helldiver/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub


Mean Inference Time: 19.99448523203532 ms for 1 single batch


# Measuring Throughput

**How many inputs (images for Computer Vision) can the model output in a second based on the batch_size**

The throughput of a neural network is defined as the maximal number of input instances the network can process in a unit of time (e.g., a second). Maybe the process is within a video or some other aspect. This allows you to determine the optimal batch size and the test to perform

In [24]:
model = EfficientNet
device = torch.device("cuda")
model.to(device)
optimal_batch_size = 16
dummy_input = torch.randn(optimal_batch_size, 3,224,224, dtype=torch.float).to(device)

total_batches=100
total_time = 0
with torch.no_grad():
    for rep in range(total_batches):
        starter, ender = torch.cuda.Event(enable_timing=True),   torch.cuda.Event(enable_timing=True)
        starter.record()
        _ = model(dummy_input)
        ender.record()
        torch.cuda.synchronize()
        curr_time = starter.elapsed_time(ender)/1000
        total_time += curr_time
Throughput =   (total_batches*optimal_batch_size)/total_time
print('Final Throughput: {} ---> The number of examples our network can process in one second'.format(Throughput))
print("\nGiven that the network works with {} batches in parallel".format(optimal_batch_size))
print("\nTotal time: {}".format(total_time))

Final Throughput: 131.8970836079158 ---> The number of examples our network can process in one second

Given that the network works with 16 batches in parallel

Total time: 12.130670036315921
