Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# Bert Quantization with ONNX Runtime on CPU

In this tutorial, we will load a fine tuned [HuggingFace BERT](https://huggingface.co/transformers/) model trained with [PyTorch](https://pytorch.org/) for [Microsoft Research Paraphrase Corpus (MRPC)](https://www.microsoft.com/en-us/download/details.aspx?id=52398) task , convert the model to ONNX, and then quantize PyTorch and ONNX model respectively. Finally, we will demonstrate the performance, accuracy and model size of the quantized PyTorhc and OnnxRuntime model in the [General Language Understanding Evaluation benchmark (GLUE)](https://gluebenchmark.com/)

## 0. Prerequisites ##

If you have Jupyter Notebook, you can run this notebook directly with it. You may need to install or upgrade [PyTorch](https://pytorch.org/), [OnnxRuntime](https://microsoft.github.io/onnxruntime/), [transformer](https://huggingface.co/transformers/) and other required packages.

Otherwise, you can setup a new environment. First, install [AnaConda](https://www.anaconda.com/distribution/). Then open an AnaConda prompt window and run the following commands:

```console
conda create -n cpu_env python=3.6
conda activate cpu_env
conda install jupyter
jupyter notebook
```
The last command will launch Jupyter Notebook and we can open this notebook in browser to continue.

### 0.1 Install packages
Let install the nessasary packages firstly to start the tutorial.We will install PyTorch 1.5, OnnxRuntime 1.4,0, latest ONNX, OnnxRuntime-tools, transformers, and sklearn.

In [1]:
# Install or upgrade PyTorch 1.5.0 and OnnxRuntime 1.4.0 for CPU-only.
import sys
!{sys.executable} -m pip install --upgrade torch==1.5.0+cpu torchvision==0.6.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
!{sys.executable} -m pip uninstall -y onnxruntime
!{sys.executable} -m pip install -i https://test.pypi.org/simple/ ort-nightly
!{sys.executable} -m pip install --upgrade onnxruntime-tools

# Install other packages used in this notebook.
!{sys.executable} -m pip install --upgrade transformers
!{sys.executable} -m pip install onnx sklearn

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Requirement already up-to-date: torch==1.5.0+cpu in /home/yufeng/anaconda3/envs/pytorch/lib/python3.6/site-packages (1.5.0+cpu)
Requirement already up-to-date: torchvision==0.6.0+cpu in /home/yufeng/anaconda3/envs/pytorch/lib/python3.6/site-packages (0.6.0+cpu)
Uninstalling onnxruntime-1.3.0:
  Successfully uninstalled onnxruntime-1.3.0
Looking in indexes: https://test.pypi.org/simple/
Collecting ort-nightly
[?25l  Downloading https://test-files.pythonhosted.org/packages/12/77/87db37443dfe57245f3a2535b3217143df90436608bb8e1ce318f87177a2/ort_nightly-1.4.0.dev202007152-cp36-cp36m-manylinux2010_x86_64.whl (4.5MB)
[K     |████████████████████████████████| 4.5MB 4.9MB/s eta 0:00:01
Installing collected packages: ort-nightly
Successfully installed ort-nightly-1.4.0.dev202007152
Requirement already up-to-date: onnxruntime-tools in /home/yufeng/anaconda3/envs/pytorch/lib/python3.6/site-packages (1.3.0.1009)
Requirement alre

### 0.2 Download data and Fine-tune the BERT model for MPRC task
HuggingFace [text-classification examples]( https://github.com/huggingface/transformers/tree/master/examples/text-classification) shows details on how to fine-tune a MPRC tack with GLUE data.

#### Firstly, Let's download the GLUE data with [script](https://github.com/huggingface/transformers/blob/master/utils/download_glue_data.py) and unpack it to directory glue_data under current directory.

In [10]:
# !python download_glue_data.py --data_dir='glue_data' --tasks='MRPC' --test_labels=True
#!pwd&&ls
!wget https://github.com/huggingface/transformers/blob/master/utils/download_glue_data.py
!python download_glue_data.py --data_dir='glue_data' --tasks='MRPC'
!ls glue_data/MRPC
!curl https://download.pytorch.org/tutorial/MRPC.zip --output MPRC.zip
!unzip -n MPRC.zip

/home/yufeng/project/onnxruntime/onnxruntime/python/tools/quantization/notebooks
bert.onnx		 download_glue_data.py.5
download_glue_data.py	 glue_data
download_glue_data.py.1  MPRC.zip
download_glue_data.py.2  MRPC
download_glue_data.py.3  PyTorch_Bert-Squad_OnnxRuntime_CPU.ipynb
download_glue_data.py.4
--2020-07-16 14:40:26--  https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 151.101.40.133
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|151.101.40.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8225 (8.0K) [text/plain]
Saving to: ‘download_glue_data.py.6’


2020-07-16 14:40:26 (25.6 MB/s) - ‘download_glue_data.py.6’ saved [8225/8225]

Processing MRPC...
Local MRPC data not specified, downloading data from https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase

#### Next, we can fine-tune the model based on the [MRPC example](https://github.com/huggingface/transformers/tree/master/examples/text-classification#mrpc) with command like:

`
export GLUE_DIR=./glue_data
export TASK_NAME=MRPC
export OUT_DIR=./$TASK_NAME/
python ./run_glue.py \
    --model_type bert \
    --model_name_or_path bert-base-uncased \
    --task_name $TASK_NAME \
    --do_train \
    --do_eval \
    --do_lower_case \
    --data_dir $GLUE_DIR/$TASK_NAME \
    --max_seq_length 128 \
    --per_gpu_eval_batch_size=8   \
    --per_gpu_train_batch_size=8   \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --save_steps 100000 \
    --output_dir $OUT_DIR
`

In order to save time, we download the fined-tuned BERT model for MRPC task by PyTorch from:https://download.pytorch.org/tutorial/MRPC.zip.

In [None]:
!curl https://download.pytorch.org/tutorial/MRPC.zip --output MPRC.zip
!unzip -n MPRC.zip

## 1.Load and quantize model with PyTorch

In this section, we will load the fine-tuned model with PyTorch, quantize it and measure the performance. We reused the code from PyTorch's [BERT quantization blog](https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html) for this section.

### 1.1 Import modules and set global configurations

In this step we import the necessary Python and transformers modules for the tutorial, and then set up the global configurations, like data & model folder, GLUE task settings, thread settings, warning settings and etc.

In [11]:
from __future__ import absolute_import, division, print_function


import logging
import numpy as np
import os
import random
import sys
import time
import torch

from argparse import Namespace
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
                              TensorDataset)
from tqdm import tqdm
from transformers import (BertConfig, BertForSequenceClassification, BertTokenizer,)
from transformers import glue_compute_metrics as compute_metrics
from transformers import glue_output_modes as output_modes
from transformers import glue_processors as processors
from transformers import glue_convert_examples_to_features as convert_examples_to_features

# Setup warnings
import warnings
warnings.filterwarnings(
    action='ignore',
    category=DeprecationWarning,
    module=r'.*'
)
warnings.filterwarnings(
    action='default',
    module=r'torch.quantization'
)

# Setup logging
logger = logging.getLogger(__name__)
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
                    datefmt = '%m/%d/%Y %H:%M:%S',
                    level = logging.INFO)
# logger.warning("started")
# logging.getLogger("transformers.modeling_utils").setLevel(
#    logging.WARN)  # Reduce logging

# Disable OMP 
# Set number of threads to one for single threaded comparison, quantized models run single threaded
import ctypes
gomp_file = ctypes.util.find_library('gomp')
gomp = ctypes.cdll.LoadLibrary(gomp_file)
gomp.omp_set_num_threads(1)
torch.set_num_threads(1)

print(torch.__version__)


configs = Namespace()

# The output directory for the fine-tuned model, $OUT_DIR.
configs.output_dir = "./MRPC/"

# The data directory for the MRPC task in the GLUE benchmark, $GLUE_DIR/$TASK_NAME.
configs.data_dir = "./glue_data/MRPC"

# The model name or path for the pre-trained model.
configs.model_name_or_path = "bert-base-uncased"
# The maximum length of an input sequence
configs.max_seq_length = 128

# Prepare GLUE task.
configs.task_name = "MRPC".lower()
configs.processor = processors[configs.task_name]()
configs.output_mode = output_modes[configs.task_name]
configs.label_list = configs.processor.get_labels()
configs.model_type = "bert".lower()
configs.do_lower_case = True

# Set the device, batch size, topology, and caching flags.
configs.device = "cpu"
configs.per_gpu_eval_batch_size = 8
configs.n_gpu = 0
configs.local_rank = -1
configs.overwrite_cache = False


# Set random seed for reproducibility.
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
set_seed(42)

1.5.0+cpu


### 1.2 Load and quantize the fine-tuned BERT model 
In this step, we load the fine-tuned BERT model, and quantize it with PyTorch's dynamic quantization. And show the model size comparison between full precision and quantized model.

In [12]:
# load model
model = BertForSequenceClassification.from_pretrained(configs.output_dir)
model.to(configs.device)

# quantize model
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)
print(quantized_model)

def print_size_of_model(model):
    torch.save(model.state_dict(), "temp.p")
    print('Size (MB):', os.path.getsize("temp.p")/1e6)
    os.remove('temp.p')

print_size_of_model(model)
print_size_of_model(quantized_model)

07/16/2020 14:41:15 - INFO - transformers.tokenization_utils_base -   Model name './MRPC/' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). Assuming './MRPC/' is a path, a model identifier, or url to a directory containing tokenizer files.
07/16/2020 14:41:15 - INFO - transformers.tokenization_utils_base -   Didn't find file ./MRPC/tokenizer.json. We won't load it.
07/16/2020 14:41:15 - INFO - transformers.tokenization_utils_base -   loadin

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

### 1.3 Evaluate the accuracy and performance of PyTorch quantization
This section reused the tokenize and evaluation function from [Huggingface](https://github.com/huggingface/transformers/blob/45e26125de1b9fbae46837856b1f518a4b56eb65/examples/movement-pruning/masked_run_glue.py).

In [13]:
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

def evaluate(args, model, tokenizer, prefix=""):
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,)
    eval_outputs_dirs = (args.output_dir, args.output_dir + '-MM') if args.task_name == "mnli" else (args.output_dir,)

    results = {}
    for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs):
        eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)

        if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
            os.makedirs(eval_output_dir)

        args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
        # Note that DistributedSampler samples randomly
        eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
        eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)

        # multi-gpu eval
        if args.n_gpu > 1:
            model = torch.nn.DataParallel(model)

        # Eval!
        logger.info("***** Running evaluation {} *****".format(prefix))
        logger.info("  Num examples = %d", len(eval_dataset))
        logger.info("  Batch size = %d", args.eval_batch_size)
        eval_loss = 0.0
        nb_eval_steps = 0
        preds = None
        out_label_ids = None
        for batch in tqdm(eval_dataloader, desc="Evaluating"):
            model.eval()
            batch = tuple(t.to(args.device) for t in batch)

            with torch.no_grad():
                inputs = {'input_ids':      batch[0],
                          'attention_mask': batch[1],
                          'labels':         batch[3]}
                if args.model_type != 'distilbert':
                    inputs['token_type_ids'] = batch[2] if args.model_type in ['bert', 'xlnet'] else None  # XLM, DistilBERT and RoBERTa don't use segment_ids
                outputs = model(**inputs)
                tmp_eval_loss, logits = outputs[:2]

                eval_loss += tmp_eval_loss.mean().item()
            nb_eval_steps += 1
            if preds is None:
                preds = logits.detach().cpu().numpy()
                out_label_ids = inputs['labels'].detach().cpu().numpy()
            else:
                preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
                out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)

        eval_loss = eval_loss / nb_eval_steps
        if args.output_mode == "classification":
            preds = np.argmax(preds, axis=1)
        elif args.output_mode == "regression":
            preds = np.squeeze(preds)
        result = compute_metrics(eval_task, preds, out_label_ids)
        results.update(result)

        output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
        with open(output_eval_file, "w") as writer:
            logger.info("***** Eval results {} *****".format(prefix))
            for key in sorted(result.keys()):
                logger.info("  %s = %s", key, str(result[key]))
                writer.write("%s = %s\n" % (key, str(result[key])))

    return results


def load_and_cache_examples(args, task, tokenizer, evaluate=False):
    if args.local_rank not in [-1, 0] and not evaluate:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache

    processor = processors[task]()
    output_mode = output_modes[task]
    # Load data features from cache or dataset file
    cached_features_file = os.path.join(args.data_dir, 'cached_{}_{}_{}_{}'.format(
        'dev' if evaluate else 'train',
        list(filter(None, args.model_name_or_path.split('/'))).pop(),
        str(args.max_seq_length),
        str(task)))
    if os.path.exists(cached_features_file) and not args.overwrite_cache:
        logger.info("Loading features from cached file %s", cached_features_file)
        features = torch.load(cached_features_file)
    else:
        logger.info("Creating features from dataset file at %s", args.data_dir)
        label_list = processor.get_labels()
        if task in ['mnli', 'mnli-mm'] and args.model_type in ['roberta']:
            # HACK(label indices are swapped in RoBERTa pretrained model)
            label_list[1], label_list[2] = label_list[2], label_list[1]
        examples = processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir)
        features = convert_examples_to_features(examples,
                                                tokenizer,
                                                label_list=label_list,
                                                max_length=args.max_seq_length,
                                                output_mode=output_mode,
        )
        if args.local_rank in [-1, 0]:
            logger.info("Saving features into cached file %s", cached_features_file)
            torch.save(features, cached_features_file)

    if args.local_rank == 0 and not evaluate:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache

    # Convert to Tensors and build dataset
    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
    all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
    all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
    if output_mode == "classification":
        all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
    elif output_mode == "regression":
        all_labels = torch.tensor([f.label for f in features], dtype=torch.float)

    dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)
    return dataset

def time_model_evaluation(model, configs, tokenizer):
    eval_start_time = time.time()
    result = evaluate(configs, model, tokenizer, prefix="")
    eval_end_time = time.time()
    eval_duration_time = eval_end_time - eval_start_time
    print(result)
    print("Evaluate total time (seconds): {0:.1f}".format(eval_duration_time))

# define the tokenizer
tokenizer = BertTokenizer.from_pretrained(
    configs.output_dir, do_lower_case=configs.do_lower_case)
    
# Evaluate the original FP32 BERT model
time_model_evaluation(model, configs, tokenizer)

# Evaluate the INT8 BERT model after the dynamic quantization
time_model_evaluation(quantized_model, configs, tokenizer)

# Serialize the quantized model
quantized_output_dir = configs.output_dir + "quantized/"
if not os.path.exists(quantized_output_dir):
    os.makedirs(quantized_output_dir)
    quantized_model.save_pretrained(quantized_output_dir)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): DynamicQuantizedLinear(in_features=768, out_features=768, qscheme=torch.per_tensor_affine)
              (key): DynamicQuantizedLinear(in_features=768, out_features=768, qscheme=torch.per_tensor_affine)
              (value): DynamicQuantizedLinear(in_features=768, out_features=768, qscheme=torch.per_tensor_affine)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Dy

Size (MB): 437.982584
Size (MB): 181.443657


07/16/2020 14:12:25 - INFO - __main__ -   Creating features from dataset file at ./glue_data/MRPC
07/16/2020 14:12:25 - INFO - transformers.data.processors.glue -   *** Example ***
07/16/2020 14:12:25 - INFO - transformers.data.processors.glue -   guid: dev-1
07/16/2020 14:12:25 - INFO - transformers.data.processors.glue -   features: InputFeatures(input_ids=[101, 2002, 2056, 1996, 9440, 2121, 7903, 2063, 11345, 2449, 2987, 1005, 1056, 4906, 1996, 2194, 1005, 1055, 2146, 1011, 2744, 3930, 5656, 1012, 102, 1000, 1996, 9440, 2121, 7903, 2063, 11345, 2449, 2515, 2025, 4906, 2256, 2146, 1011, 2744, 3930, 5656, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], attention_mask=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

07/16/2020 14:12:25 - INFO - __main__ -   ***** Running evaluation  *****
07/16/2020 14:12:25 - INFO - __main__ -     Num examples = 408
07/16/2020 14:12:25 - INFO - __main__ -     Batch size = 8
Evaluating: 100%|██████████| 51/51 [01:44<00:00,  2.05s/it]
07/16/2020 14:14:10 - INFO - __main__ -   ***** Eval results  *****
07/16/2020 14:14:10 - INFO - __main__ -     acc = 0.8602941176470589
07/16/2020 14:14:10 - INFO - __main__ -     acc_and_f1 = 0.8810937025412575
07/16/2020 14:14:10 - INFO - __main__ -     f1 = 0.9018932874354562
07/16/2020 14:14:10 - INFO - __main__ -   Loading features from cached file ./glue_data/MRPC/cached_dev_bert-base-uncased_128_mrpc
07/16/2020 14:14:10 - INFO - __main__ -   ***** Running evaluation  *****
07/16/2020 14:14:10 - INFO - __main__ -     Num examples = 408
07/16/2020 14:14:10 - INFO - __main__ -     Batch size = 8
Evaluating:   0%|          | 0/51 [00:00<?, ?it/s]

{'acc': 0.8602941176470589, 'f1': 0.9018932874354562, 'acc_and_f1': 0.8810937025412575}
Evaluate total time (seconds): 105.0


Evaluating: 100%|██████████| 51/51 [01:43<00:00,  2.04s/it]
07/16/2020 14:15:54 - INFO - __main__ -   ***** Eval results  *****
07/16/2020 14:15:54 - INFO - __main__ -     acc = 0.8480392156862745
07/16/2020 14:15:54 - INFO - __main__ -     acc_and_f1 = 0.8714772349617813
07/16/2020 14:15:54 - INFO - __main__ -     f1 = 0.8949152542372881
07/16/2020 14:15:54 - INFO - transformers.configuration_utils -   Configuration saved in ./MRPC/quantized/config.json


{'acc': 0.8480392156862745, 'f1': 0.8949152542372881, 'acc_and_f1': 0.8714772349617813}
Evaluate total time (seconds): 104.0


07/16/2020 14:15:54 - INFO - transformers.modeling_utils -   Model weights saved in ./MRPC/quantized/pytorch_model.bin


## 3. Quantization and Inference with ORT ##
In this section, we will demonstrate how to export the PyTorch model to ONNX, quantize the exported ONNX model, and infererence the quantized model with ONNXRuntime.

### 3.1 Export to ONNX model and optimize with ONNXRuntime-tools
This step will export the PyTorch model to ONNX and then optimize the ONNX model with [ONNXRuntime-tools](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers), which is an offline optimizer tool for transformer based models.

In [15]:
import onnxruntime

def export_onnx_model(args, model, tokenizer, onnx_model_path):
    with torch.no_grad():
        inputs = {'input_ids':      torch.ones(1,128, dtype=torch.int64),
                    'attention_mask': torch.ones(1,128, dtype=torch.int64),
                    'token_type_ids': torch.ones(1,128, dtype=torch.int64)}
        outputs = model(**inputs)

        symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}
        torch.onnx.export(model,                                            # model being run
                    (inputs['input_ids'],                             # model input (or a tuple for multiple inputs)
                    inputs['attention_mask'], 
                    inputs['token_type_ids']),                                         # model input (or a tuple for multiple inputs)
                    onnx_model_path,                                # where to save the model (can be a file or file-like object)
                    opset_version=11,                                 # the ONNX version to export the model to
                    do_constant_folding=True,                         # whether to execute constant folding for optimization
                    input_names=['input_ids',                         # the model's input names
                                'input_mask', 
                                'segment_ids'],
                    output_names=['output'],                    # the model's output names
                    dynamic_axes={'input_ids': symbolic_names,        # variable length axes
                                'input_mask' : symbolic_names,
                                'segment_ids' : symbolic_names})
        logger.info("ONNX Model exported to {0}".format(onnx_model_path))

def evaluate_onnx(args, model_path, tokenizer, prefix=""):

    sess_options = onnxruntime.SessionOptions()
    sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
    sess_options.intra_op_num_threads=1
    session = onnxruntime.InferenceSession(model_path, sess_options)

    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,)
    eval_outputs_dirs = (args.output_dir, args.output_dir + '-MM') if args.task_name == "mnli" else (args.output_dir,)

    results = {}
    for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs):
        eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)

        if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
            os.makedirs(eval_output_dir)

        args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
        # Note that DistributedSampler samples randomly
        eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
        eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)

        # multi-gpu eval
        if args.n_gpu > 1:
            model = torch.nn.DataParallel(model)

        # Eval!
        logger.info("***** Running evaluation {} *****".format(prefix))
        logger.info("  Num examples = %d", len(eval_dataset))
        logger.info("  Batch size = %d", args.eval_batch_size)
        #eval_loss = 0.0
        #nb_eval_steps = 0
        preds = None
        out_label_ids = None
        for batch in tqdm(eval_dataloader, desc="Evaluating"):
            batch = tuple(t.detach().cpu().numpy() for t in batch)
            ort_inputs = {
                                'input_ids':  batch[0],
                                'input_mask': batch[1],
                                'segment_ids': batch[2]
                            }
            logits = np.reshape(session.run(None, ort_inputs), (-1,2))
            if preds is None:
                preds = logits
                #print(preds.shape)
                out_label_ids = batch[3]
            else:
                preds = np.append(preds, logits, axis=0)
                out_label_ids = np.append(out_label_ids, batch[3], axis=0)

        #print(preds.shap)
        #eval_loss = eval_loss / nb_eval_steps
        if args.output_mode == "classification":
            preds = np.argmax(preds, axis=1)
        elif args.output_mode == "regression":
            preds = np.squeeze(preds)
        #print(preds)
        #print(out_label_ids)
        result = compute_metrics(eval_task, preds, out_label_ids)
        results.update(result)

        output_eval_file = os.path.join(eval_output_dir, prefix + "_eval_results.txt")
        with open(output_eval_file, "w") as writer:
            logger.info("***** Eval results {} *****".format(prefix))
            for key in sorted(result.keys()):
                logger.info("  %s = %s", key, str(result[key]))
                writer.write("%s = %s\n" % (key, str(result[key])))

    return results


def time_ort_model_evaluation(model_path, configs, tokenizer, prefix=""):
    eval_start_time = time.time()
    result = evaluate_onnx(configs, model_path, tokenizer, prefix=prefix)
    eval_end_time = time.time()
    eval_duration_time = eval_end_time - eval_start_time
    print(result)
    print("Evaluate total time (seconds): {0:.1f}".format(eval_duration_time))

export_onnx_model(configs, model, tokenizer, "bert.onnx")

logger.info("****ORT float32***")
time_ort_model_evaluation('bert.onnx', configs, tokenizer, "onnx")
logger.info("*************************")
logger.info("")

def quantize_onnx_model(onnx_model_path, quantized_model_path):
    from onnxruntime.quantization import quantize, QuantizationMode
    import onnx
    onnx_opt_model = onnx.load(onnx_model_path)
    quantized_onnx_model = quantize(onnx_opt_model,
                                    quantization_mode=QuantizationMode.IntegerOps,
                                    symmetric_weight=True,
                                    force_fusions=True)

    onnx.save(quantized_onnx_model, quantized_model_path)
    logger.info(f"quantized model saved to:{quantized_model_path}")

logger.info("****ORT quantization***")
quantize_onnx_model('bert.onnx', 'bert.quant.onnx')
time_ort_model_evaluation('bert.quant.onnx', configs, tokenizer, "onnx.quant")
logger.info("*************************")
logger.info("")

from onnxruntime_tools import optimizer
opt_model = optimizer.optimize_model(
    'bert.onnx',
    'bert', 
    num_heads=12,
    hidden_size=768)
opt_model.save_model_to_file('bert.opt.onnx')

quantize_onnx_model('bert.opt.onnx', 'bert.opt.quant.onnx')

logger.info("****ORT opt quantization***")
time_ort_model_evaluation('bert.opt.quant.onnx', configs, tokenizer, "onnx.opt.quant")
logger.info("*************************")
logger.info("")

07/16/2020 14:47:28 - INFO - __main__ -   Loading features from cached file ./glue_data/MRPC/cached_dev_bert-base-uncased_128_mrpc
07/16/2020 14:47:28 - INFO - __main__ -   ***** Exporting ONNX {} *****
Evaluating:   0%|          | 0/51 [00:00<?, ?it/s]

Namespace(data_dir='./glue_data/MRPC', device='cpu', do_lower_case=True, eval_batch_size=8, label_list=['0', '1'], local_rank=-1, max_seq_length=128, model_name_or_path='bert-base-uncased', model_type='bert', n_gpu=0, output_dir='./MRPC/', output_mode='classification', overwrite_cache=False, per_gpu_eval_batch_size=8, processor=<transformers.data.processors.glue.MrpcProcessor object at 0x7f4ac06a26d8>, task_name='mrpc')


07/16/2020 14:47:36 - INFO - __main__ -   ONNX Model exported to bert.onnx
Evaluating:   0%|          | 0/51 [00:08<?, ?it/s]
07/16/2020 14:47:36 - INFO - __main__ -   ****ORT float32***
07/16/2020 14:47:36 - INFO - __main__ -   Loading features from cached file ./glue_data/MRPC/cached_dev_bert-base-uncased_128_mrpc
07/16/2020 14:47:36 - INFO - __main__ -   ***** Running evaluation onnx *****
07/16/2020 14:47:36 - INFO - __main__ -     Num examples = 408
07/16/2020 14:47:36 - INFO - __main__ -     Batch size = 8
Evaluating: 100%|██████████| 51/51 [00:21<00:00,  2.32it/s]
07/16/2020 14:47:58 - INFO - __main__ -   ***** Eval results onnx *****
07/16/2020 14:47:58 - INFO - __main__ -     acc = 0.8602941176470589
07/16/2020 14:47:58 - INFO - __main__ -     acc_and_f1 = 0.8810937025412575
07/16/2020 14:47:58 - INFO - __main__ -     f1 = 0.9018932874354562
07/16/2020 14:47:58 - INFO - __main__ -   *************************
07/16/2020 14:47:58 - INFO - __main__ -   
07/16/2020 14:47:58 - INFO

{'acc': 0.8602941176470589, 'f1': 0.9018932874354562, 'acc_and_f1': 0.8810937025412575}
Evaluate total time (seconds): 22.5


07/16/2020 14:48:10 - INFO - __main__ -   quantized model saved to:bert.quant.onnx
07/16/2020 14:48:10 - INFO - __main__ -   Loading features from cached file ./glue_data/MRPC/cached_dev_bert-base-uncased_128_mrpc
07/16/2020 14:48:10 - INFO - __main__ -   ***** Running evaluation onnx.quant *****
07/16/2020 14:48:10 - INFO - __main__ -     Num examples = 408
07/16/2020 14:48:10 - INFO - __main__ -     Batch size = 8
Evaluating: 100%|██████████| 51/51 [00:24<00:00,  2.06it/s]
07/16/2020 14:48:35 - INFO - __main__ -   ***** Eval results onnx.quant *****
07/16/2020 14:48:35 - INFO - __main__ -     acc = 0.8504901960784313
07/16/2020 14:48:35 - INFO - __main__ -     acc_and_f1 = 0.8734624155264822
07/16/2020 14:48:35 - INFO - __main__ -     f1 = 0.896434634974533
07/16/2020 14:48:35 - INFO - __main__ -   *************************
07/16/2020 14:48:35 - INFO - __main__ -   


{'acc': 0.8504901960784313, 'f1': 0.896434634974533, 'acc_and_f1': 0.8734624155264822}
Evaluate total time (seconds): 25.0


07/16/2020 14:48:37 - INFO - onnxruntime_tools.transformers.optimizer -   Save optimized model by onnxruntime to bert_o1_cpu.onnx
07/16/2020 14:48:38 - INFO - fusion_base -   Fused LayerNormalization count: 25
07/16/2020 14:48:38 - INFO - fusion_base -   Fused Gelu count: 12
07/16/2020 14:48:38 - INFO - fusion_base -   Fused SkipLayerNormalization count: 25
07/16/2020 14:48:39 - INFO - fusion_base -   Fused Attention count: 12
07/16/2020 14:48:39 - INFO - onnx_model -   Graph pruned: 0 inputs, 0 outputs and 5 nodes are removed
07/16/2020 14:48:39 - INFO - fusion_base -   Fused EmbedLayerNormalization(with mask) count: 1
07/16/2020 14:48:39 - INFO - onnx_model -   Graph pruned: 0 inputs, 0 outputs and 12 nodes are removed
07/16/2020 14:48:39 - INFO - onnx_model -   Graph pruned: 0 inputs, 0 outputs and 0 nodes are removed
07/16/2020 14:48:39 - INFO - fusion_base -   Fused BiasGelu count: 12
07/16/2020 14:48:39 - INFO - fusion_base -   Fused SkipLayerNormalization(add bias) count: 24
07/

{'acc': 0.8602941176470589, 'f1': 0.9038785834738617, 'acc_and_f1': 0.8820863505604604}
Evaluate total time (seconds): 20.1
