Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# Mobilenet Quantization with ONNX Runtime on CPU

In this tutorial, we will load a mobilenet v3 model pretrained with [PyTorch](https://pytorch.org/), export the model to ONNX, and then quantize and run with ONNXRuntime.

## 0. Prerequisites ##

If you have Jupyter Notebook, you can run this notebook directly with it. You may need to install or upgrade [PyTorch](https://pytorch.org/), [OnnxRuntime](https://microsoft.github.io/onnxruntime/), and other required packages.

Otherwise, you can setup a new environment. First, install [AnaConda](https://www.anaconda.com/distribution/). Then open an AnaConda prompt window and run the following commands:

```console
conda create -n cpu_env python=3.6
conda activate cpu_env
conda install jupyter
jupyter notebook
```
The last command will launch Jupyter Notebook and we can open this notebook in browser to continue.

### 0.1 Install packages
Let's install nessasary packages to start the tutorial. We will install PyTorch 1.8, OnnxRuntime 1.7, latest ONNX.

In [6]:
# Install or upgrade PyTorch 1.8.0 and OnnxRuntime 1.7.0 for CPU-only.
import sys
!{sys.executable} -m pip install --upgrade torch==1.8.0+cpu torchvision==0.9.0+cpu torchaudio===0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
!{sys.executable} -m pip install --upgrade onnxruntime==1.7.0


Looking in links: https://download.pytorch.org/whl/torch_stable.html
Could not fetch URL https://pypi.org/simple/torch/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/torch/ (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available.")) - skipping
Could not fetch URL https://download.pytorch.org/whl/torch_stable.html: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='download.pytorch.org', port=443): Max retries exceeded with url: /whl/torch_stable.html (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available.")) - skipping
ERROR: Could not find a version that satisfies the requirement torch==1.8.0+cpu (from versions: none)
ERROR: No matching distribution found for torch==1.8.0+cpu
Could not fetch URL https://pypi.org/simple/onnxruntime/: There was a problem confirming the ssl certificate: HTTPSConnectionPoo

### 0.2 Download pretrained model

In [3]:
from torchvision import models, datasets, transforms as T
mobilenet_v3_small = models.mobilenet_v3_small(pretrained=True)

#transform = T.Compose([T.Resize(256), T.CenterCrop(224), T.ToTensor()])
#dataset = datasets.ImageNet(".", split="", transform=transform)
#
#means = []
#stds = []
#for img in subset(dataset):
#    means.append(torch.mean(img))
#    stds.append(torch.std(img))
#
#mean = torch.mean(torch.tensor(means))
#std = torch.mean(torch.tensor(stds))

ValueError: module functions cannot set METH_CLASS or METH_STATIC

## 2. Quantization and Inference with ORT ##
In this section, we will demonstrate how to export the PyTorch model to ONNX, quantize the exported ONNX model, and infererence the quantized model with ONNXRuntime.

### 2.1 Export to ONNX model and optimize with ONNXRuntime-tools
This step will export the PyTorch model to ONNX and then optimize the ONNX model with [ONNXRuntime-tools](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers), which is an offline optimizer tool for transformers based models.

In [7]:
import onnxruntime

def export_onnx_model(args, model, tokenizer, onnx_model_path):
    with torch.no_grad():
        inputs = {'input_ids':      torch.ones(1,128, dtype=torch.int64),
                    'attention_mask': torch.ones(1,128, dtype=torch.int64),
                    'token_type_ids': torch.ones(1,128, dtype=torch.int64)}
        outputs = model(**inputs)

        symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}
        torch.onnx.export(model,                                            # model being run
                    (inputs['input_ids'],                             # model input (or a tuple for multiple inputs)
                    inputs['attention_mask'], 
                    inputs['token_type_ids']),                                         # model input (or a tuple for multiple inputs)
                    onnx_model_path,                                # where to save the model (can be a file or file-like object)
                    opset_version=11,                                 # the ONNX version to export the model to
                    do_constant_folding=True,                         # whether to execute constant folding for optimization
                    input_names=['input_ids',                         # the model's input names
                                'input_mask', 
                                'segment_ids'],
                    output_names=['output'],                    # the model's output names
                    dynamic_axes={'input_ids': symbolic_names,        # variable length axes
                                'input_mask' : symbolic_names,
                                'segment_ids' : symbolic_names})
        logger.info("ONNX Model exported to {0}".format(onnx_model_path))

export_onnx_model(configs, model, tokenizer, "bert.onnx")

# optimize transformer-based models with onnxruntime-tools
from onnxruntime_tools import optimizer
from onnxruntime_tools.transformers.onnx_model_bert import BertOptimizationOptions

# disable embedding layer norm optimization for better model size reduction
opt_options = BertOptimizationOptions('bert')
opt_options.enable_embed_layer_norm = False

opt_model = optimizer.optimize_model(
    'bert.onnx',
    'bert', 
    num_heads=12,
    hidden_size=768,
    optimization_options=opt_options)
opt_model.save_model_to_file('bert.opt.onnx')

  position_ids = self.position_ids[:, :seq_length]
  input_tensor.shape == tensor_shape for input_tensor in input_tensors


### 2.2 Quantize ONNX model
We will call [onnxruntime.quantization.quantize](https://github.com/microsoft/onnxruntime/blob/fe0b2b2abd494b7ff14c00c0f2c51e0ccf2a3094/onnxruntime/python/tools/quantization/README.md) to apply quantization on the HuggingFace BERT model. It supports dynamic quantization with IntegerOps and static quantization with QLinearOps. For activation ONNXRuntime supports only uint8 format for now, and for weight ONNXRuntime supports both int8 and uint8 format.

We apply dynamic quantization for BERT model and use int8 for weight.

In [8]:
def quantize_onnx_model(onnx_model_path, quantized_model_path):
    from onnxruntime.quantization import quantize_dynamic, QuantType
    import onnx
    onnx_opt_model = onnx.load(onnx_model_path)
    quantize_dynamic(onnx_model_path,
                     quantized_model_path,
                     weight_type=QuantType.QInt8)

    logger.info(f"quantized model saved to:{quantized_model_path}")

quantize_onnx_model('bert.opt.onnx', 'bert.opt.quant.onnx')

print('ONNX full precision model size (MB):', os.path.getsize("bert.opt.onnx")/(1024*1024))
print('ONNX quantized model size (MB):', os.path.getsize("bert.opt.quant.onnx")/(1024*1024))

ONNX full precision model size (MB): 417.6690320968628
ONNX quantized model size (MB): 106.49767780303955


### 2.3 Evaluate ONNX quantization performance and accuracy

In this step, we will evalute OnnxRuntime quantization with GLUE data set.

In [11]:
def evaluate_onnx(args, model_path, tokenizer, prefix=""):

    sess_options = onnxruntime.SessionOptions()
    sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
    sess_options.intra_op_num_threads=1
    session = onnxruntime.InferenceSession(model_path, sess_options)

    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,)
    eval_outputs_dirs = (args.output_dir, args.output_dir + '-MM') if args.task_name == "mnli" else (args.output_dir,)

    results = {}
    for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs):
        eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)

        if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
            os.makedirs(eval_output_dir)

        # Note that DistributedSampler samples randomly
        eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
        eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)

        # multi-gpu eval
        if args.n_gpu > 1:
            model = torch.nn.DataParallel(model)

        # Eval!
        logger.info("***** Running evaluation {} *****".format(prefix))
        logger.info("  Num examples = %d", len(eval_dataset))
        logger.info("  Batch size = %d", args.eval_batch_size)
        #eval_loss = 0.0
        #nb_eval_steps = 0
        preds = None
        out_label_ids = None
        for batch in tqdm(eval_dataloader, desc="Evaluating"):
            batch = tuple(t.detach().cpu().numpy() for t in batch)
            ort_inputs = {
                                'input_ids':  batch[0],
                                'input_mask': batch[1],
                                'segment_ids': batch[2]
                            }
            logits = np.reshape(session.run(None, ort_inputs), (-1,2))
            if preds is None:
                preds = logits
                #print(preds.shape)
                out_label_ids = batch[3]
            else:
                preds = np.append(preds, logits, axis=0)
                out_label_ids = np.append(out_label_ids, batch[3], axis=0)

        #print(preds.shap)
        #eval_loss = eval_loss / nb_eval_steps
        if args.output_mode == "classification":
            preds = np.argmax(preds, axis=1)
        elif args.output_mode == "regression":
            preds = np.squeeze(preds)
        #print(preds)
        #print(out_label_ids)
        result = compute_metrics(eval_task, preds, out_label_ids)
        results.update(result)

        output_eval_file = os.path.join(eval_output_dir, prefix + "_eval_results.txt")
        with open(output_eval_file, "w") as writer:
            logger.info("***** Eval results {} *****".format(prefix))
            for key in sorted(result.keys()):
                logger.info("  %s = %s", key, str(result[key]))
                writer.write("%s = %s\n" % (key, str(result[key])))

    return results


def time_ort_model_evaluation(model_path, configs, tokenizer, prefix=""):
    eval_start_time = time.time()
    result = evaluate_onnx(configs, model_path, tokenizer, prefix=prefix)
    eval_end_time = time.time()
    eval_duration_time = eval_end_time - eval_start_time
    print(result)
    print("Evaluate total time (seconds): {0:.1f}".format(eval_duration_time))

print('Evaluating ONNXRuntime full precision accuracy and performance:')
time_ort_model_evaluation('bert.opt.onnx', configs, tokenizer, "onnx.opt")
    
print('Evaluating ONNXRuntime quantization accuracy and performance:')
time_ort_model_evaluation('bert.opt.quant.onnx', configs, tokenizer, "onnx.opt.quant")

Evaluating:   0%|          | 2/408 [00:00<00:33, 11.96it/s]

Evaluating ONNXRuntime full precision accuracy and performance:


Evaluating: 100%|██████████| 408/408 [00:21<00:00, 18.74it/s]
Evaluating:   1%|          | 3/408 [00:00<00:19, 20.46it/s]

{'acc': 0.8602941176470589, 'f1': 0.9018932874354562, 'acc_and_f1': 0.8810937025412575}
Evaluate total time (seconds): 22.5
Evaluating ONNXRuntime quantization accuracy and performance:


Evaluating: 100%|██████████| 408/408 [00:19<00:00, 21.07it/s]

{'acc': 0.8578431372549019, 'f1': 0.902027027027027, 'acc_and_f1': 0.8799350821409644}
Evaluate total time (seconds): 19.5





## 3 Summary
In this tutorial, we demonstrated how to quantize a fine-tuned BERT model for MPRC task on GLUE data set. Let's summarize the main metrics of quantization.

### Model Size
PyTorch quantizes torch.nn.Linear modules only and reduce the model from 438 MB to 181 MB. OnnxRuntime quantizes not only Linear(MatMul), but also the embedding layer. It achieves almost the ideal model size reduction with quantization.

| Engine | Full Precision(MB) | Quantized(MB) |
| --- | --- | --- |
| PyTorch 1.6 | 417.7 | 173.1 |
| ORT 1.5 | 417.7 | 106.5 |

### Accuracy
OnnxRuntime achieves a little bit better accuracy and F1 score, even though it has small model size.

| Metrics | Full Precision | PyTorch 1.6 Quantization | ORT 1.5 Quantization |
| --- | --- | --- | --- |
| Accuracy | 0.86029 | 0.85784 | 0.85784 |
| F1 score | 0.90189 | 0.89931 | 0.90203 |
| Acc and F1 | 0.88109 | 0.87857 | 0.87994 |

### Performance

The evaluation data set has 408 sample. Table below shows the performance on my machine with Intel(R) Xeon(R) E5-1650 v4@3.60GHz CPU. Comparing with PyTorch full precision, PyTorch quantization achieves ~1.33x speedup, and ORT quantization achieves ~1.73x speedup. And ORT quantization can achieve ~1.33x speedup, comparing with PyTorch quantization. 
You can run the [benchmark.py](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/benchmark.py) for comparison on more models.

|Engine | Full Precision Latency(s) | Quantized(s) |
| --- | --- | --- |
| PyTorch 1.6 | 33.8 | 22.5 |
| ORT 1.5 | 26.0 | 19.5 |