# Optimize PyTorch Models for Inference

In this notebook we will be generating synthetic data to be used for inference with sample computer vision and NLP workloads. We will first use stock PyTorch models to generate predictions. Then, with minimal code changes using Intel® Extension for PyTorch (IPEX), we will see how speedups can be gained over stock PyTorch on Intel® hardware. We will also see how quantization features from Intel® Extension for PyTorch (IPEX) can be used to reduce the inference time of a model.

# Key Takeaways

- Get started with Intel® Extension for PyTorch (IPEX) for drop-in acceleration
- Learn how to use the *optimize* method from Intel® Extension for PyTorch (IPEX) to apply optimizations at Python frontend to the given model (nn.Module)
- Learn how to use Quantization features from Intel® Extension for PyTorch (IPEX) to convert model to INT8
- Learn how to use Intel® Extension for PyTorch (IPEX) Launch Script module to set additional configurations on top of the previously mentioned optimizations to boost performance

## 1. Computer Vision Workload - Faster R-CNN, Resnet50 Backbone

Faster R-CNN is a convolutional neural network used for object detection. We are going to use the **optimize** method from Intel® Extension for PyTorch (IPEX) to apply optimizations. Following this, we will also use TorchScript to obtain performance gains.

Let's start by importing all the necessary packages and modules

In [None]:
import time
import torch
import torchvision
import os
import matplotlib.pyplot as plt

**Prepare Sample Data**

Let's generate a random image using torch to test performance

In [None]:
# set the device to cpu
device = 'cpu'
# generate a random image to observe speedup on
image = torch.randn(1, 3, 1200, 1200)

In [None]:
# explore image shape

print(image.shape)

**Helper Functions**

Some functions to help us with loading the model and summarizing the optimizations. The functions below will help us record the time taken to run and, plot comparison charts.

In [None]:
def load_model_eval_mode():
    """
    Loads model and returns it in eval mode
    """
    model = torchvision.models.detection.fasterrcnn_resnet50_fpn(weights=weights, progress=True,
        num_classes=91, weights_backbone=weights_backbone).to(device)
    model = model.eval()
    
    return model

def get_average_inference_time(model, image):
    """
    does a model warm up and times the model runtime
    """
    with torch.no_grad():
        # warm up
        for _ in range(25):
            model(image)

        # measure
        import time
        start = time.time()
        for _ in range(25):
            output = model(image)
        end = time.time()
        average_inference_time = (end-start)/25*1000
    
    return average_inference_time

def plot_speedup(inference_time_stock, inference_time_optimized):
    """
    Plots a bar chart comparing the time taken by stock PyTorch model and the time taken by
    the model optimized by Intel® Extension for PyTorch (IPEX)
    """
    data = {'stock_pytorch_time': inference_time_stock, 'optimized_time': inference_time_optimized}
    model_type = list(data.keys())
    times = list(data.values())

    fig = plt.figure(figsize = (10, 5))

    # creating the bar plot
    plt.bar(model_type, times, color ='blue',
            width = 0.4)

    plt.ylabel("Runtime (ms)")
    plt.title(f"Speedup acheived - {inference_time_stock/inference_time_optimized:.2f}x")
    plt.show()
    



**Baseline PyTorch Model**

A baseline model is the simplest version of the model that can be loaded from the PyTorch hub. Let's load the baseline for Faster R-CNN model and get predictions.

In [None]:
# model configs
weights = torchvision.models.detection.FasterRCNN_ResNet50_FPN_Weights.DEFAULT
weights_backbone = torchvision.models.ResNet50_Weights.DEFAULT

**Input Image Memory Format**

There are two ways to represent image data that are inputs to a CNN model. Channels-First, and Channels-Last. In Channels-First, the channels dimension comes first followed by height and width. For example - (3, 224, 224) or NCHW where N is batch size, C is channels, H is height, and W is width. In Channels-Last, the channels dimension comes last. For example - (224, 223, 3) or NHWC.

**Channels-First**

PyTorch uses channels-first by default

In [None]:
# send the input to the device and pass it through the network to
# get the detections and predictions

model = load_model_eval_mode()

inference_time_stock = get_average_inference_time(model, image)

print(f"time taken for forward pass: {inference_time_stock} ms")

**Channels-Last**

Channels-Last memory format is a different way of ordering NCHW tensors allowing us to make Channels-Last memory format optimizations on Intel® hardware

In [None]:
model = load_model_eval_mode()
model = model.to(memory_format=torch.channels_last)
image_channels_last = image.to(memory_format=torch.channels_last)

inference_time_stock = get_average_inference_time(model, image_channels_last)

print(f"time taken for forward pass: {inference_time_stock} ms")

Now that we have timed the stock PyTorch model, let's add minimal code changes from Intel® Extension for PyTorch (IPEX) to obtain speedups. The minimal code changes are highlighted in the following cell

**Intel® Extension for PyTorch (IPEX)**

As described above, Intel® Extension for PyTorch (IPEX) provides us with the ability to make minimal code changes to apply optimizations over stock PyTorch models using Intel® hardware. The simple code changes are indicated below.

In [None]:
model = load_model_eval_mode()
model = model.to(memory_format=torch.channels_last)
image_channels_last = image.to(memory_format=torch.channels_last)
#################### code changes ####################
import intel_extension_for_pytorch as ipex
model = ipex.optimize(model)
######################################################

In [None]:
inference_time_optimized = get_average_inference_time(model, image_channels_last)

print(f"time taken for forward pass: {inference_time_optimized} ms")

In [None]:
# plot performance gain bar chart

plot_speedup(inference_time_stock, inference_time_optimized)

> **_NOTE:_**  If a below par performance is observed, please restart the notebook kernel.

**TorchScript**

TorchScript is a way to create serializable and optimizable models from PyTorch code.

In [None]:
model = load_model_eval_mode()
model = model.to(memory_format=torch.channels_last)
with torch.no_grad():
    model.backbone = torch.jit.trace(model.backbone, image_channels_last, strict=False)
    model.backbone = torch.jit.freeze(model.backbone)
    inference_time_optimized = get_average_inference_time(model, image_channels_last)

print(f"time taken for forward pass: {inference_time_optimized} ms")

In [None]:
# plot performance gain bar chart

plot_speedup(inference_time_stock, inference_time_optimized)

## 2. NLP Workload - DistilBERT Base Uncased

DistilBERT is a transformer model, smaller and faster than BERT. We will use the Quantization feature from Intel® Extension for PyTorch (IPEX) to convert the model into INT8 for faster inference.

In [None]:
from transformers import DistilBertTokenizer, DistilBertModel, logging
logging.set_verbosity_error()

**Helper Functions**

Similar functions as before to help us load the model and summarize the optimizations

In [None]:
def load_model_eval_mode():
    """
    Loads model and returns it in eval mode
    """
    model = DistilBertModel.from_pretrained('distilbert-base-uncased-distilled-squad')
    model.eval()
    
    return model

def get_average_inference_time(model, inputs):
    """
    does a model warm up and times the model runtime
    """
    with torch.no_grad():
        # warm up
        for _ in range(25):
            model(**inputs)

        # measure
        import time
        start = time.time()
        for _ in range(25):
            outputs = model(**inputs)
        end = time.time()
        average_inference_time = (end-start)/25*1000
    
    return average_inference_time

Generate sample text and tokenize using the transformers tokenizer

In [None]:
# tokenizer for distilbert
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased-distilled-squad')

# sample data
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"

In [None]:
model = load_model_eval_mode()

inputs = tokenizer(question, text, return_tensors="pt")

inference_time_stock = get_average_inference_time(model, inputs)

print(f"time taken for forward pass: {inference_time_stock} ms")

**Quantization**

Quantization allows us to perform operations and store tensors at a lower precision than FP32, like INT8 for example. This compact model and data representation results in a lower memory requirement.

Let's import the quantization modules

In [None]:
from intel_extension_for_pytorch.quantization import prepare, convert
import intel_extension_for_pytorch as ipex

**Static Quantization**  
 Static quantization quantizes the weights and activations of the model. It fuses activations into preceding layers where possible. It requires calibration with a representative dataset to determine optimal quantization parameters for activations.

In [None]:
model = load_model_eval_mode()

inputs = tokenizer(question, text, return_tensors="pt")

jit_inputs  = tuple((inputs['input_ids'], inputs['attention_mask']))

qconfig = ipex.quantization.default_static_qconfig
prepared_model = prepare(model, qconfig, example_inputs=jit_inputs, inplace=False)

for i in range(2):
    calibration_output = prepared_model(**inputs)

model = convert(prepared_model)
with torch.no_grad():
    model = torch.jit.trace(model, jit_inputs, strict=False)
    model = torch.jit.freeze(model)
    y = model(**inputs)
    y = model(**inputs)

In [None]:
inference_time_optimized = get_average_inference_time(model, inputs)

print(f"time taken for forward pass: {inference_time_optimized} ms")

In [None]:
# plot performance gain bar chart

plot_speedup(inference_time_stock, inference_time_optimized)

**Dynamic Quantization**  
 In dynamic quantization the weights are quantized ahead of time but the activations are dynamically quantized during inference

In [None]:
model = load_model_eval_mode()

inputs = tokenizer(question, text, return_tensors="pt")

jit_inputs  = tuple((inputs['input_ids'], inputs['attention_mask']))

qconfig = ipex.quantization.default_dynamic_qconfig
prepared_model = prepare(model, qconfig, example_inputs=jit_inputs, inplace=False)

model = convert(prepared_model)
with torch.no_grad():
    model = torch.jit.trace(model, jit_inputs, strict=False)
    model = torch.jit.freeze(model)
    y = model(**inputs)
    y = model(**inputs)

In [None]:
inference_time_optimized = get_average_inference_time(model, inputs)

print(f"time taken for forward pass: {inference_time_optimized} ms")

In [None]:
# plot performance gain bar chart

plot_speedup(inference_time_stock, inference_time_optimized)

**Intel® Extension for PyTorch (IPEX) Launch Script**

Default primitives of PyTorch and Intel® Extension for PyTorch (IPEX) are highly optimized, there are things users can do improve performance. Setting configuration options properly contributes to a performance boost. However, there is no unified configuration that is optimal to all topologies. Users need to try different combinations by themselves.

**Single instance for inference**

The launch script is provided as a module of Intel® Extension for PyTorch (IPEX). Below are some of those configurations that can be set using the launch script for a single instance. The launch script can be run as a shell command from a Jupyter notebook or from the shell itself.

To explore the features of the launch script module, we will be using a ResNet-50 model, which is a a convolutional neural network that is 50 layers deep.The model script is present in the scripts folder

It is recommended that the user check the output of [htop](https://htop.dev/) in an accompanying terminal to check the usage of cores while running the cells below. The output from htop looks as shown below.

![htop](https://intel.github.io/intel-extension-for-pytorch/latest/_images/1ins_phy.gif)

By running the below command, One main worker thread will be launched, then it will launch threads on 2 other physical cores.

In [None]:
!source /opt/intel/oneapi/setvars.sh;conda activate pytorch;python -m intel_extension_for_pytorch.cpu.launch --ninstances 1 --ncore_per_instance 3 --log_path ./logs ./scripts/resnet50.py

Similarly by increasing the number of cores, we can see an improvement in the inference time as shown below 

In [None]:
!source /opt/intel/oneapi/setvars.sh;conda activate pytorch;python -m intel_extension_for_pytorch.cpu.launch --ninstances 1 --ncore_per_instance 6 --log_path ./logs ./scripts/resnet50.py

We saw a small example usage of the launch script module. This [documentation](https://intel.github.io/intel-extension-for-pytorch/cpu/1.12.100+cpu/tutorials/performance_tuning/launch_script.html) provides many more examples to use the launch script. As mentioned earlier, each deep learning topology can benefit from custom tuning to achieve the best performance on top of the optimizations we have discussed so far.