### Overview 

Inference speed of huge models is an issue in deep learning. There are many ways to boost the inference speed. This notebook discusses few of the vast possibilites to increase inference speed.

There are primarily three methods as per the survey to increase inference speed namely :

1. **Quantization**
2. **ONXX runtime**
3. **Pruning**

##### 1.Quantization

1. Quantization refers to techniques for performing computations and storing tensors at lower bitwidths than floating point precision.

2. A quantized model executes some or all of the operations on tensors with integers rather than floating point values.

3. This allows for a more compact model representation and the use of high performance vectorized operations on many hardware platforms. 

4. `PyTorch supports INT8 quantization compared to typical FP32 models allowing for a 4x reduction in the model size and a 4x reduction in memory bandwidth requirements.`

#### 2.ONNX Runtime

The `open neural network exchange (onnx)` runtime provides an easy way to run machine learned models with high performance on CPU or GPU without dependencies on the training framework. Machine learning frameworks are usually optimized for batch training rather than for prediction, which is a more common scenario in applications, sites, and services. At a high level, you can:

1. Train a model using your favorite framework.

2. Convert or export the model into ONNX format.

3. Load and run the model using ONNX Runtime.

#### 3.Pruning

1. If we take an insider look, there are few neurons that do not activate at all and that means they don’t impact output at all. On the other hand, few neurons’ activation strength could be very low and hence has a very small impact on the final output.

2. There is a way where we decide weight thresholds, and if weight associated with a neuron falls in that range, we simply, prune that neuron.

3. To prune a neuron, we need to set incoming and outgoing weights as zero for that neuron. As incoming and outgoing weights are zero, that neuron won’t have any impact on the final output.`pruning will make the model sparse which is easier to compress than a dense model.`

4. `After pruning, it’s advisable to fine-tune the neural network a bit before using it in production, particularly when the threshold range is slightly higher and we cared more about removing more weights.`

5. Pruning would slightly decrease performance but will increase inference latency. `According to the Deep Compression research,  pruning can decrease the number of effective neurons by 9 to 13 times.`

#### Quick Demo

We will demonstrate inference speed boost taking a pretrained hugging face model with quantization and onnx runtime

1. ONNX runtime
2. quantization
3. quantization+onnx runtime

##### ONNX runtime
1. Conversion of models to onnx format
2. Inference on onnx runtime

##### Loading necessary packages

In [None]:
from pathlib import Path
from transformers.convert_graph_to_onnx import convert
from transformers import AutoTokenizer, AutoModelForQuestionAnswering,AutoModel
import onnxruntime as ort
import torch.nn as nn
import torch.nn.utils.prune as prune

#### Creating ONNX format of both long former and pubmedbert

In [None]:
#Loading tokenizer of longformer model and biomedical pubmed bert.
tokenizer = AutoTokenizer.from_pretrained("mrm8488/longformer-base-4096-finetuned-squadv2")
tokenizer_pubmed = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract")
#Converting the models to onxx format to speed up model inference.
#1.longformer 2.Pubmedbert
convert(framework="pt", tokenizer=tokenizer,model='mrm8488/longformer-base-4096-finetuned-squadv2',
        output=Path("path/longformer.onnx"), opset=12)
convert(framework="pt", tokenizer=tokenizer_pubmed,model='microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract',
        output=Path("path/pubmedbert.onnx"), opset=12)

##### Quantizing the onnx models

In [None]:
from transformers.convert_graph_to_onnx import quantize
quantize(Path('path/longformer.onnx'))
quantize(Path('path/pubmedbert.onnx'))

#### Inference Session

In [None]:
session = ort.InferenceSession('pubmedbert.onnx')
session_quantized = ort.InferenceSession("pubmed-quantized.onnx")

In [None]:
corpus = corpus_creation("data_path")
encoded_input = tokenizer_pubmed(corpus, padding=True, truncation=True, max_length=500, return_tensors='pt')
encoded_inputs_onnx = {k: v.cpu().detach().numpy() for k, v in encoded_input.items()}

#### Action Time

In [None]:
t3 = datetime.now()
sequence, pooled = session.run(None, encoded_inputs_onnx)
t4 = datetime.now()
print("model with onnx runntime : ",(t4-t3).total_seconds())

t5 = datetime.now()
sequence, pooled = session_quantized.run(None, encoded_inputs_onnx)
t6 = datetime.now()
print("model with onnx runntime and quantization enabled : ",(t6-t5).total_seconds())

#### How about space occupied by the models

In [None]:
import os
def convert_bytes(num):
    """
    this function will convert bytes to MB.... GB... etc
    """
    for x in ['bytes', 'KB', 'MB', 'GB', 'TB']:
        if num < 1024.0:
            return "%3.1f %s" % (num, x)
        num /= 1024.0
def file_size(file_path):
    """
    this function will return the file size
    """
    if os.path.isfile(file_path):
        file_info = os.stat(file_path)
        return convert_bytes(file_info.st_size)

print("onxx model size : ",file_size("pubmed.onnx"))
print("onnx-quantized model size : ", file_size("pubmed-quantized.onnx"))

#### Conclusion 

The best bet in terms of both space and time is onnx quantized model !

#### Further research
1. Knowledge Distillation
2. Pruning Attention Heads