In [98]:
# https://github.com/onnx/tensorflow-onnx/blob/master/tutorials/huggingface-bert.ipynb
# https://github.com/microsoft/onnxruntime-inference-examples/blob/main/quantization/notebooks/bert/Bert-GLUE_OnnxRuntime_quantization.ipynb

# Converting a Huggingface model to ONNX with tf2onnx

This is a simple example how to convert a [huggingface](https://huggingface.co/) model to ONNX using [tf2onnx](https://github.com/onnx/tensorflow-onnx).

We use the [TFBertForQuestionAnswering](https://huggingface.co/transformers/model_doc/bert.html#tfbertforquestionanswering) example from huggingface.

Other models will work similar. You'll find additional examples for other models in our unit tests [here](https://github.com/onnx/tensorflow-onnx/blob/master/tests/huggingface.py).

## Install dependencies

In [99]:
!pip install tensorflow transformers tf2onnx onnxruntime



## The keras code

In [100]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = ""

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import onnxruntime as rt
import tensorflow as tf
import tf2onnx
from transformers import BertTokenizer, TFBertForSequenceClassification, TFDistilBertForSequenceClassification
import tensorflow as tf
from tokenizers import BertWordPieceTokenizer

In [101]:
from os import environ
from psutil import cpu_count

# Constants from the performance optimization available in onnxruntime
# It needs to be done before importing onnxruntime
environ["OMP_NUM_THREADS"] = str(cpu_count(logical=True))
environ["OMP_WAIT_POLICY"] = 'ACTIVE'

In [102]:
! wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt



--2021-12-07 08:38:53--  https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.83.198
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.83.198|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 231508 (226K) [text/plain]
Saving to: ‘bert-base-uncased-vocab.txt.1’


2021-12-07 08:38:53 (2.76 MB/s) - ‘bert-base-uncased-vocab.txt.1’ saved [231508/231508]



In [103]:
from transformers import DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=1)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_transform', 'activation_13', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'pre_classifier', 'dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use i

In [104]:
# text = """
# Voted Boston Globe\s 2018 best New England beach town! Narragansett is the perfect seaside community for a laid back coastal retreat. Immaculate, highly ranked by VRBO, 4+ bedroom three and a half bath Pottery Barn inspired shingle style beach home designed for families and entertaining. Main part of the house has four bedrooms two and a half baths with an outdoor shower and changing room. Additionally, there is a large, finished lower level GUEST SUITE with a kitchenette, fourth bathroom and queen sized bed in the sleeping area. Perfect for family getaways, reunions, anniversaries, graduations. and local wedding accommodations. Comfortable for as little as 4 to as many as 12.\n \n\nThe Harbour Island community is set on a peninsula that jets out into the salt pond. Enjoy barbecuing on our expanded deck leading onto one of the largest back yards on the island, or simply walk to kayaking, clamming, and paddle boarding on Point Judith.\n\nOur property is surrounded by four of Narragansett‚Äôs finest beaches just minutes away, and just up the street form a quaint private association beach/swim area (dock is for members only). Close to restaurants, shopping, Block Island ferry, and the center of town. Available primarily in the summer months weekly, monthly, or seasonally with advanced notice. \n\n*** HOUSE AMENITIES ***\n\nTwo Story Lofted Great Room, Central AC, WiFi, Custom Kitchen, Stainless Steel Appliances, Stunning Guest Suite With Private Bath and Kitchenette, Outdoor Dining Area, Large Yard, Gas Fire Pit With Seating, Weber Grill, Quality Pots and Pans, Keurig and Conventional Coffee, Mosquito and Tick Controlled, Stone Patio, Hammock, Front Porch, Multiple TV\s, Outdoor Shower/Changing Area, Town Beach Passes, Walk to Private Quaint Association Beach (Dock is For Members Only), Off Street Parking For Up To Six Cars, Walk To Paddle Boarding, Kayaking, Clamming and Sunsets\n\nImmaculate, 3,000 sq/ft Pottery Barn inspired beach house... Featured in the style section of "SO RI" magazine! Centrally located on Harbour Island. Four bedrooms with additional lower level Guest Suite for accommodating even more guests. Open floor plan for entertaining, custom kitchen, huge yard for grilling and relaxing, central AC, deck, loft, outdoor shower, stone patio, close to beaches, kayaking,paddle boarding, B. I. ferry, shopping, & restaurants.\n\nAbout the area...\nHarbour Island is centrally located in Narragansett. Our private Association Beach and swim area is just down the street from our front door (dock is for members only). It\s a great place to take a dip, catch some sun, or launch a kayak or paddle board. Narragansett Town beach passes provided...Sand Hill Cove, Galilee, Scarborough Beach, and the Block Island ferry are all within 2.5 to 4 miles of the property. Harbour Island is close to shops, restaurants, center of town, and market yet you feel like your in a quaint coastal setting. Newport, Providence, and Foxwoods casino are all short drives away.\nAbout the home...We have one of the largest yards on the island for grilling and relaxing with a new expanded patio and fire pit off our main deck so there is plenty of room for the whole gang! Home was built by owner to entertain with a wide open floor plan, lofted great room w/ 20 foot ceilings, top of the line kitchen, central air, porch, deck, custom built-ins, fire pit and distant water views from upper level. There is an emphasis on open, airy, light filled rooms.\nMain part of the home has four bedrooms two and one half baths, and a hot and cold outdoor shower with changing room. Additionally, there is a large family room with fifth queen bed, fourth bath, microwave and fridge in the lower level to accommodate even more guests. Comfortable for as little as 4 to as many as 12. We have many returning guests year after...We do our best to make your stay as comfortable as possible providing incidentals such as stainless cookware, blankets, comforters, pillows, beach towels, bath towels, lawn chairs, lawn games, trash bags, spare propane for Weber grill, laundry/dish detergent, etc. Fresh sheets and pillow cases arrive at your front door from Vacation Beach House Linens the day of your arrival. Smaller, hypoallergenic pets will be considered only after speaking with host.	
# """
text="""
Just one block to the sand & surf at Venice Beach. Two blocks from shopping & dining on Abbot Kinney Blvd. Charming, historic, ground floor in duplex with three bedrooms. The main level features: one bathroom with shower, kitchen, dining room, living room, large entry with desk, and two bedrooms (full beds). With a staircase off the dining area, the lower level opens to sleeping area (king bed).\n\nThe outdoor patio is shared with the upper unit.\n\nHost lives full time in the upper unit. I am available if needed, please let me know\n\nThis is a ground floor flat with some stairs\n\nIf driving, please be aware of parking restrictions and street cleaning, as posted.\n\nBike rentals available at the beach\n\nBird/Lime/Jump scooters are plentiful in the area\n\nThis unit is street parking only. Please be aware of parking restrictions and street cleaning; there are signs posted on every block. Right outside the home, both non-metered metered street parking is available on a first come, first served basis. Meters are free after 6 PM. In addition there are several daily paid lots nearby. Feel free to search online for area paid lots, if curious (ZIP code 90291). Parking in this tourist destination is limited and in high demand.\n\nPublic transportation is nearby to take you to all LA has to offer	
"""


In [105]:
%%timeit
predict_input = tokenizer.encode(text,
                                 truncation=True,
                                 padding=True,
                                 return_tensors="tf")
tf_output = model.predict(predict_input)[0]
# print("tf_output:  ", tf_output)
tf_prediction = tf.nn.softmax(tf_output, axis=1).numpy()[0]
# print("tf_prediction: ",  tf_prediction)

The slowest run took 4.13 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 5: 573 ms per loop


## Convert to ONNX

In [106]:
# describe the inputs
input_spec = (
    tf.TensorSpec((None,  None), tf.int32, name="input_ids"),
)

# and convert
_, _ = tf2onnx.convert.from_keras(model, input_signature=input_spec, opset=13, output_path="distilbert.onnx")

In [107]:
!du -hs /content/distilbert.onnx

256M	/content/distilbert.onnx


# **Dynamic quantization**

In [108]:
from transformers import BertTokenizerFast
from onnxruntime import GraphOptimizationLevel, InferenceSession, SessionOptions, get_all_providers
from contextlib import contextmanager

def create_model_for_provider(model_path: str, provider: str) -> InferenceSession: 
  
  assert provider in get_all_providers(), f"provider {provider} not found, {get_all_providers()}"

  # Few properties that might have an impact on performances (provided by MS)
  options = SessionOptions()
  options.intra_op_num_threads = 1
  options.graph_optimization_level = GraphOptimizationLevel.ORT_ENABLE_ALL

  # Load the model as a graph and prepare the CPU backend 
  session = InferenceSession(model_path, options, providers=[provider])
  session.disable_fallback()
    
  return session

@contextmanager
def track_infer_time(buffer: [int]):
    start = time()
    yield
    end = time()

    buffer.append(end - start)

@dataclass
class OnnxInferenceResult:
  model_inference_time: [int]  
  optimized_model_path: str

# tokenizer = BertTokenizerFast.from_pretrained("distilbert-base-cased")
# cpu_model = create_model_for_provider("distilbert.onnx", "CPUExecutionProvider")


#

In [109]:
# Inputs are provided through numpy array
model_inputs = tokenizer(text, return_tensors="pt")
inputs_onnx = {k: v.cpu().detach().numpy().astype(np.int8) for k, v in model_inputs.items()}


In [111]:
inputs_onnx.pop('token_type_ids')
inputs_onnx.pop('attention_mask')

KeyError: ignored

In [112]:
from transformers.convert_graph_to_onnx import quantize
from pathlib import Path
from tqdm import trange
from dataclasses import dataclass
from time import time

# Transformers allow you to easily convert float32 model to quantized int8 with ONNX Runtime
quantized_model_path = quantize(Path("distilbert.onnx"))

# Then you just have to load through ONNX runtime as you would normally do
quantized_model = create_model_for_provider(quantized_model_path.as_posix(), "CPUExecutionProvider")

# Warm up the overall model to have a fair comparaison
# outputs = quantized_model.run(None, inputs_onnx)



         Please use quantize_static for static quantization, quantize_dynamic for dynamic quantization.


As of onnxruntime 1.4.0, models larger than 2GB will fail to quantize due to protobuf constraint.
This limitation will be removed in the next release of onnxruntime.
Quantized model has been written at distilbert-quantized.onnx: ✔


In [113]:
text="""
Just one block to the sand & surf at Venice Beach. Two blocks from shopping & dining on Abbot Kinney Blvd. Charming, historic, ground floor in duplex with three bedrooms. The main level features: one bathroom with shower, kitchen, dining room, living room, large entry with desk, and two bedrooms (full beds). With a staircase off the dining area, the lower level opens to sleeping area (king bed).\n\nThe outdoor patio is shared with the upper unit.\n\nHost lives full time in the upper unit. I am available if needed, please let me know\n\nThis is a ground floor flat with some stairs\n\nIf driving, please be aware of parking restrictions and street cleaning, as posted.\n\nBike rentals available at the beach\n\nBird/Lime/Jump scooters are plentiful in the area\n\nThis unit is street parking only. Please be aware of parking restrictions and street cleaning; there are signs posted on every block. Right outside the home, both non-metered metered street parking is available on a first come, first served basis. Meters are free after 6 PM. In addition there are several daily paid lots nearby. Feel free to search online for area paid lots, if curious (ZIP code 90291). Parking in this tourist destination is limited and in high demand.\n\nPublic transportation is nearby to take you to all LA has to offer	
"""

In [114]:
len(text)
opt = rt.SessionOptions()
ort_session = rt.InferenceSession("distilbert-quantized.onnx")

In [115]:
%%timeit
tokenized_text = tokenizer.encode(text,
                                 truncation=True,
                                 padding=True,
                                 return_tensors="tf")

"""
# Inputs are provided through numpy array
model_inputs = tokenizer(text, return_tensors="pt")
inputs_onnx = {k: v.cpu().detach().numpy().astype(np.int8) for k, v in model_inputs.items()}

"""
input_ids=tokenized_text
input_dict = {"input_ids" : input_ids.numpy() }
ort_session.run(None, input_dict)


1 loop, best of 5: 408 ms per loop


In [116]:
# # Evaluate performances
# time_buffer = []
# for _ in trange(100, desc=f"Tracking inference time on CPUExecutionProvider with quantized model"):
#     with track_infer_time(time_buffer):
#         outputs = quantized_model.run(None, inputs_onnx)

# # Store the result
# results = {}

# results["ONNX CPU Quantized"] = OnnxInferenceResult(
#     time_buffer, 
#     quantized_model_path
# ) 
# results