# Transformers Inference Optimization for NLP

Take a look on three things that can be done after training to improve inference speed:
* [TorchScript](https://pytorch.org/docs/stable/jit.html)
* [Dynamic Quantization](https://pytorch.org/docs/stable/quantization.html)
* [ONNX](https://pytorch.org/docs/stable/onnx.html) and [ONNX Runntime](https://github.com/microsoft/onnxruntime)

In [2]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import torch
from transformers import AutoTokenizer
from scipy.special import softmax

# TorchScript

TorchScript is a way to create serializable and optimizable models from PyTorch code. The models can be run independently from Python environment, such as C++.

To trace our model, we must define model input first. 

*Note: For GPU inference we must change device to 'cuda'.*

In [3]:
MODEL_NAME = 'distilbert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
def sentence_input(sentence: str, max_len: int = 512, device = 'cpu'):
    encoded = tokenizer.encode_plus(sentence, add_special_tokens=True, 
                                    pad_to_max_length=True, max_length=max_len, 
                                    return_tensors="pt",).to(device)
    model_input = (encoded['input_ids'], encoded['attention_mask'])
    return model_input

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [4]:
test_sentence = "Super Cute: First of all, I LOVE this product. When I bought it my husband jokingly said that it looked cute and small in the picture, but was really HUGE in real life. Don't tell him I said so, but he was right. It is huge and the cord is really long. Although I wish it was smaller, I still love it. It works really well when we travel and need to plug a lot of things in and although the length is annoying, it's very useful."
model_input = sentence_input(test_sentence)

### Converting model - CPU and GPU

In [9]:
import torch.nn as nn
import torch

class DistilBert(nn.Module):
    def __init__(self, pretrained_model_name: str, num_classes: int = None):

        super().__init__()

        config = AutoConfig.from_pretrained(
             pretrained_model_name)

        self.distilbert = AutoModel.from_pretrained(pretrained_model_name,
                                                    config=config)
        self.pre_classifier = nn.Linear(config.dim, config.dim)
        self.classifier = nn.Linear(config.dim, num_classes)
        self.dropout = nn.Dropout(config.seq_classif_dropout)

    def forward(self, features, attention_mask=None, head_mask=None):

        assert attention_mask is not None, "attention mask is none"
        distilbert_output = self.distilbert(input_ids=features,
                                            attention_mask=attention_mask,
                                            head_mask=head_mask)

        hidden_state = distilbert_output[0]  # (bs, seq_len, dim)

        pooled_output = hidden_state[:, 0]  # (bs, dim)
        pooled_output = self.pre_classifier(pooled_output)  # (bs, dim)
        pooled_output = nn.ReLU()(pooled_output)  # (bs, dim)
        pooled_output = self.dropout(pooled_output)  # (bs, dim)
        logits = self.classifier(pooled_output)  # (bs, dim)

        return logits

In [11]:
from transformers import AutoConfig, AutoTokenizer, AutoModel
model = DistilBert(pretrained_model_name=MODEL_NAME,
                                           num_classes=2)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




In [12]:
from catalyst.dl.utils import trace
def load_chechpoint(model, path):
    mod = trace.load_checkpoint(path)
    model.load_state_dict(mod['model_state_dict'])
    return model

  from pandas import Panel


In [13]:
model = load_chechpoint(model, '../input/sentiment-all-models/last 0.9622.pth')

In [14]:
model.eval()

traced_cpu = torch.jit.trace(model, model_input)
torch.jit.save(traced_cpu, "cpu.pth")

#to load
cpu_model = torch.jit.load("cpu.pth")

In [15]:
print(cpu_model.graph)

graph(%self.1 : __torch__.DistilBert,
      %input_ids.1 : Tensor,
      %argument_2.1 : Tensor):
  %15 : int = prim::Constant[value=0]() # <ipython-input-9-d0a4bb00ec17>:49:0
  %16 : int = prim::Constant[value=9223372036854775807]() # <ipython-input-9-d0a4bb00ec17>:49:0
  %17 : int = prim::Constant[value=1]() # <ipython-input-9-d0a4bb00ec17>:49:0
  %4 : __torch__.torch.nn.modules.linear.___torch_mangle_82.Linear = prim::GetAttr[name="classifier"](%self.1)
  %6 : __torch__.torch.nn.modules.dropout.___torch_mangle_83.Dropout = prim::GetAttr[name="dropout"](%self.1)
  %8 : __torch__.torch.nn.modules.linear.___torch_mangle_81.Linear = prim::GetAttr[name="pre_classifier"](%self.1)
  %10 : __torch__.transformers.modeling_distilbert.DistilBertModel = prim::GetAttr[name="distilbert"](%self.1)
  %13 : Tensor = prim::CallMethod[name="forward"](%10, %input_ids.1, %argument_2.1) # :0:0
  %18 : Tensor = aten::slice(%13, %15, %15, %16, %17) # <ipython-input-9-d0a4bb00ec17>:49:0
  %input.1 : Tensor 

# Dynamic Quantization

Post Training Dynamic Quantization: This is the form of quantization where the weights are quantized ahead of time but the activations are dynamically quantized during inference.

Dynamic quantization support in PyTorch converts a float model to a quantized model with static int8 or float16 data types for the weights and dynamic quantization for the activations. The activations are quantized dynamically (per batch) to int8 when the weights are quantized to int8.

The mapping is performed by converting the floating point tensors using:

![](https://pytorch.org/docs/stable/_images/math-quantizer-equation.png)

In [16]:
quantized_model = torch.quantization.quantize_dynamic(model)

In [17]:
print(quantized_model)

DistilBert(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): DynamicQuantizedLinear(in_features=768, out_features=768, qscheme=torch.per_tensor_affine)
            (k_lin): DynamicQuantizedLinear(in_features=768, out_features=768, qscheme=torch.per_tensor_affine)
            (v_lin): DynamicQuantizedLinear(in_features=768, out_features=768, qscheme=torch.per_tensor_affine)
            (out_lin): DynamicQuantizedLinear(in_features=768, out_features=768, qscheme=torch.per_tensor_affine)
          )
          (sa_layer_norm): LayerNorm

Model size decreased from 255 to 132 MB. If we calculate the total size of word embedding table ~ 4 (Bytes/FP32) * 30522(Vocabulary Size) * 768(Embedding Size) = 90 MB. Then the model size reduced from 165 to 42MB (INT8 Model)

# ONNX Runtime

ONNX provides an open source format for AI models, both deep learning and traditional ML. It defines an extensible computation graph model, as well as definitions of built-in operators and standard data types.

ONNX Runtime is a performance-focused engine(written in C++) for ONNX models, which inferences efficiently across multiple platforms and hardware

In [19]:
torch.onnx.export(model, model_input, "model_512.onnx",
                  export_params=True,
                  input_names=["input_ids", "attention_mask"],
                  output_names=["targets"],
                  dynamic_axes={
                      "input_ids": {0: "batch_size"},
                      "attention_mask": {0: "batch_size"},
                      "targets": {0: "batch_size"}
                  },
                  verbose=True)

graph(%input_ids : Long(1, 512),
      %attention_mask : Long(1, 512),
      %distilbert.embeddings.word_embeddings.weight : Float(30522, 768),
      %distilbert.embeddings.position_embeddings.weight : Float(512, 768),
      %distilbert.embeddings.LayerNorm.weight : Float(768),
      %distilbert.embeddings.LayerNorm.bias : Float(768),
      %distilbert.transformer.layer.0.attention.q_lin.bias : Float(768),
      %distilbert.transformer.layer.0.attention.k_lin.bias : Float(768),
      %distilbert.transformer.layer.0.attention.v_lin.bias : Float(768),
      %distilbert.transformer.layer.0.attention.out_lin.bias : Float(768),
      %distilbert.transformer.layer.0.sa_layer_norm.weight : Float(768),
      %distilbert.transformer.layer.0.sa_layer_norm.bias : Float(768),
      %distilbert.transformer.layer.0.ffn.lin1.bias : Float(3072),
      %distilbert.transformer.layer.0.ffn.lin2.bias : Float(768),
      %distilbert.transformer.layer.0.output_layer_norm.weight : Float(768),
      %distilbe

To check that the model is well formed

In [20]:
import onnx
onnx_model = onnx.load('model_512.onnx')
onnx.checker.check_model(onnx_model, full_check=True)
onnx.helper.printable_graph(onnx_model.graph)

'graph torch-jit-export (\n  %input_ids[INT64, batch_sizex512]\n  %attention_mask[INT64, batch_sizex512]\n) initializers (\n  %821[FLOAT, 768x768]\n  %822[INT64, 1]\n  %823[INT64, 1]\n  %824[INT64, 1]\n  %825[FLOAT, 768x768]\n  %826[INT64, 1]\n  %827[INT64, 1]\n  %828[INT64, 1]\n  %829[FLOAT, 768x768]\n  %830[INT64, 1]\n  %831[INT64, 1]\n  %832[INT64, 1]\n  %833[INT64, 1]\n  %834[INT64, 1]\n  %835[INT64, 1]\n  %836[INT64, 1]\n  %837[FLOAT, 768x768]\n  %838[FLOAT, 768x3072]\n  %839[FLOAT, 3072x768]\n  %840[FLOAT, 768x768]\n  %841[INT64, 1]\n  %842[INT64, 1]\n  %843[INT64, 1]\n  %844[FLOAT, 768x768]\n  %845[INT64, 1]\n  %846[INT64, 1]\n  %847[INT64, 1]\n  %848[FLOAT, 768x768]\n  %849[INT64, 1]\n  %850[INT64, 1]\n  %851[INT64, 1]\n  %852[INT64, 1]\n  %853[INT64, 1]\n  %854[INT64, 1]\n  %855[INT64, 1]\n  %856[FLOAT, 768x768]\n  %857[FLOAT, 768x3072]\n  %858[FLOAT, 3072x768]\n  %859[FLOAT, 768x768]\n  %860[INT64, 1]\n  %861[INT64, 1]\n  %862[INT64, 1]\n  %863[FLOAT, 768x768]\n  %864[INT64, 

In [21]:
from onnxruntime_tools import optimizer
optimized_model_512 = optimizer.optimize_model("model_512.onnx", model_type='bert', 
                                               num_heads=12, hidden_size=768,
                                              use_gpu=False, opt_level=99)

optimized_model_512.save_model_to_file("optimized_512.onnx")

For GPU Inference, we can use following methods:
* change_input_to_int32() - int32 will be used as input, can get better performance.
* change_input_output_float32_to_float16() - half-precision will be used in computation.
* convert_model_float32_to_float16() - decreasing model size (255MB -> 128MB)

In order to run the model with ONNX Runtime, we need to create an inference session for the model.

In [22]:
import onnxruntime as ort
print(ort.get_device())
OPTIMIZED_512 = ort.InferenceSession('./optimized_512.onnx')

CPU


In [23]:
def to_numpy(tensor):
    if tensor.requires_grad:
        return tensor.detach().cpu().numpy()
    return tensor.cpu().numpy()

def prediction_onnx(model, sentence: str, max_len: int = 512):
    encoded = tokenizer.encode_plus(sentence, add_special_tokens=True, 
                                    pad_to_max_length=True, max_length=max_len,
                                    return_tensors="pt",)
    # compute ONNX Runtime output prediction
    input_ids = to_numpy(encoded['input_ids'])
    attention_mask = to_numpy(encoded['attention_mask'])
    onnx_input = {"input_ids": input_ids, "attention_mask": attention_mask}
    logits = model.run(None, onnx_input)
    preds = softmax(logits[0][0])
    print(f"Class: {['Negative' if preds.argmax() == 0 else 'Positive'][0]}, Probability: {preds.max():.4f}")

In [24]:
prediction_onnx(OPTIMIZED_512, test_sentence)

Class: Positive, Probability: 0.9975
