Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

## ONNX Runtime Question Answering with MobileBert model


In this tutorial, you will learn the end-to-end steps to obtain from a HuggingFace model, convert to ONNX format and then add pre/post processing steps to the ONNX model using onnxruntime-extensions library. And apply directly in a sample mobile android/ios application if applicable.

### 0. Prerequisites

You will need to pip install `onnxruntime onnx onnxruntime_extensions transformers` as the necessary libraries.

```sh
    pip install onnx
```
```
    pip install onnxruntime
```
```
    pip install onnxruntime_extensions
```
```
    pip install transformers
```

To work with Python in Jupyter Notebooks, you must activate an [Anaconda](https://www.anaconda.com/) environment or another Python environment in which you've installed the [Jupyter package](https://pypi.org/project/jupyter/). 

In [27]:
import io
import numpy as np
import onnx
import onnxruntime as ort

###  1. Prepare ONNX Model from HuggingFace MobileBert model

In [18]:
import transformers
from transformers.onnx import FeaturesManager
from pathlib import Path
from onnxruntime.quantization import quantize_dynamic, QuantType

In [19]:
def create_onnx_model_from_huggingface(hf_model_name, onnx_model_path):
    """
        Load the model from huggingface and export it to onnx
    """
    tokenizer = transformers.AutoTokenizer.from_pretrained(hf_model_name)
    model = transformers.MobileBertForQuestionAnswering.from_pretrained(hf_model_name)
    
    model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(model, feature="question-answering")
    onnx_config = model_onnx_config(model.config)

    onnx_inputs, onnx_outputs = transformers.onnx.export(tokenizer, # pretrained generic tokenizer class for the model
                                                         model, # pretrained hf model
                                                         onnx_config, # onnx configurations which includes input/output names/types info
                                                         16, # opset_version - the ONNX version to export the model to
                                                         onnx_model_path) # where to save the exported onnx model

In [20]:
onnx_model_path = Path('mobilebert_uncased_squad_v2.onnx')
if not onnx_model_path.exists():
    print("Creating ONNX model from huggingface model...")
    create_onnx_model_from_huggingface('csarron/mobilebert-uncased-squad-v2', onnx_model_path)

Creating ONNX model from huggingface model...


  torch.tensor(1000),
  _C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version)
  _C._jit_pass_onnx_graph_shape_type_inference(
  _C._jit_pass_onnx_graph_shape_type_inference(


verbose: False, log level: Level.ERROR



Check if the output ONNX model is exported successfully.

In [21]:
assert onnx_model_path.exists()

Quantize the output model.

In [22]:
def quantize_model(model_path: Path):
    """
        Quantize the model, so that it can be run on mobile devices with smaller memory footprint
    """
    quantized_model_path = model_path.with_name(model_path.stem+"_quant").with_suffix(model_path.suffix)
    quantize_dynamic(model_path, quantized_model_path, weight_type=QuantType.QInt8)
    model_path.unlink()
    return quantized_model_path

In [23]:
quantized_model = quantize_model(onnx_model_path)

Ignore MatMul due to non constant B: /[/mobilebert/encoder/layer.0/attention/self/MatMul]
Ignore MatMul due to non constant B: /[/mobilebert/encoder/layer.0/attention/self/MatMul_1]
Ignore MatMul due to non constant B: /[/mobilebert/encoder/layer.1/attention/self/MatMul]
Ignore MatMul due to non constant B: /[/mobilebert/encoder/layer.1/attention/self/MatMul_1]
Ignore MatMul due to non constant B: /[/mobilebert/encoder/layer.2/attention/self/MatMul]
Ignore MatMul due to non constant B: /[/mobilebert/encoder/layer.2/attention/self/MatMul_1]
Ignore MatMul due to non constant B: /[/mobilebert/encoder/layer.3/attention/self/MatMul]
Ignore MatMul due to non constant B: /[/mobilebert/encoder/layer.3/attention/self/MatMul_1]
Ignore MatMul due to non constant B: /[/mobilebert/encoder/layer.4/attention/self/MatMul]
Ignore MatMul due to non constant B: /[/mobilebert/encoder/layer.4/attention/self/MatMul_1]
Ignore MatMul due to non constant B: /[/mobilebert/encoder/layer.5/attention/self/MatMul]


###  2. Add pre and post processing steps to ONNX model

In [24]:
from onnxruntime_extensions.tools.pre_post_processing import *
from onnxruntime_extensions.tools import add_pre_post_processing_to_model as add_ppp
from contextlib import contextmanager

In [25]:
def add_pre_post_processing(input_model_path: Path, output_model_path: str, model_name: str = "csarron/mobilebert-uncased-squad-v2"):
    """
    Add pre and post processing to the model, for tokenization and post processing
    """
    onnx_opset = 16
    model = onnx.load(str(input_model_path.resolve(strict=True)))
    inputs = [create_named_value("input_text", onnx.TensorProto.STRING, [1, "num_sentences"])]  # Fix the batch size to be 1
    
    tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

    @contextmanager
    def temp_vocab_file():
        vocab_file = Path.cwd()/ "vocab.txt"
        yield vocab_file

    with temp_vocab_file() as vocab_file:
        import json
        with open(str(vocab_file), 'w') as f:
            f.write(json.dumps(tokenizer.vocab))

        pipeline = PrePostProcessor(inputs, onnx_opset)
        
        tokenizer_args = TokenizerParam(
            vocab_or_file=vocab_file,
            do_lower_case=True,
            tweaked_bos_id=0,
            is_sentence_pair=True,
        )
        
        pipeline.add_pre_processing(
            [
                BertTokenizer(tokenizer_args), # convert input_text into input_ids, attention_masks, token_type_ids
            ]
        )
        
        pipeline.add_post_processing(
            [
                (BertTokenizerQADecoder(tokenizer_args), # decode the input_ids to text
                [utils.IoMapEntry("BertTokenizer", producer_idx=0, consumer_idx=2)]) # input_ids
            ]
        )

    new_model = pipeline.run(model)
    onnx.save_model(new_model, output_model_path)

In [26]:
output_model_path = str(quantized_model).replace(".onnx", "_with_pre_post_processing.onnx")
add_pre_post_processing(quantized_model, output_model_path)

#### 3. Test output ONNX model

In [28]:
from onnxruntime_extensions import get_library_path

In [36]:
def test_onnx_model(model_path: str):
    
    so = ort.SessionOptions()
    so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_EXTENDED

    # Note: register the custom operators for the image decode/encode pre/post processing provided by onnxruntime-extensions
    # with onnxruntime. if we do not do this we'll get an error on model load about the operators not being found.
    ortext_lib_path = get_library_path()
    so.register_custom_ops_library(ortext_lib_path)
    inference_session = ort.InferenceSession(model_path, so)
    

    test_context = "The game was played on February 7, 2016 at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California."
    test_question = "What day was the game played on?"
    outputs = inference_session.run(['text'], {'input_text': [[test_question, test_context]]})
    output_answer = outputs[0][0]
    print("Answer:  " + output_answer)

In [37]:
test_onnx_model(output_model_path)

Answer:  february 7, 2016


2023-10-12 12:33:10.145313 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer '_ppp8_i64_0'. It is not used by any node and should be removed from the model.
2023-10-12 12:33:10.204589 [W:onnxruntime:, unsqueeze_elimination.cc:20 Apply] UnsqueezeElimination cannot remove node post_process_7


#### 4. Build and run inference with the output model in a mobile application

- Android

- iOS