# Converting a Tensorflow Bert model to ONNX

This tutorial shows how to convert the original Tensorflow Bert model to ONNX. 
In this example we fine tune Bert for squad-1.1 on top of [BERT-Base, Uncased](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip).

Since this tutorial cares mostly about the conversion process we reuse tokenizer and utilities defined in the Bert source tree as much as possible.

This should work with all versions supported by the [tensorflow-onnx converter](https://github.com/onnx/tensorflow-onnx), we used the following versions while writing the tutorial:
```
tensorflow-gpu: 1.13.1
onnx: 1.5.1
tf2onnx: 1.5.1
onnxruntime: 0.4
```

To make the fine tuning work on my Gtx-1080 gpu, we changed the MAX_SEQ_LENGTH to 256 and used a training batch size of 8.

## Step 1 - define some environment variables
Before we start, lets setup some variables where to find things.

In [1]:
import os
import sys

ROOT = os.getcwd()
BERT_BASE_DIR = os.path.join(ROOT, 'uncased_L-12_H-768_A-12')
SQUAD_DIR = os.path.join(ROOT, 'squad-1.1')
OUT = os.path.join(ROOT, 'out')

sys.path.append(os.path.join(ROOT, "bert"))
    
os.environ['PYTHONPATH'] = os.path.join(ROOT, "bert")
os.environ['BERT_BASE_DIR'] = BERT_BASE_DIR
os.environ['SQUAD_DIR'] = SQUAD_DIR
os.environ['OUT'] = OUT
os.environ['CUDA_VISIBLE_DEVICES'] = "0"

## Step 2 - clone the Bert github repository

In [1]:
!git clone https://github.com/google-research/bert bert

Cloning into 'bert'...
remote: Enumerating objects: 329, done.[K
remote: Total 329 (delta 0), reused 0 (delta 0), pack-reused 329[K
Receiving objects: 100% (329/329), 234.38 KiB | 0 bytes/s, done.
Resolving deltas: 100% (189/189), done.
Checking connectivity... done.


## Step 3 - download the pretrained Bert model and squad-1.1 dataset

In [None]:
!wget -q https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
!unzip /uncased_L-12_H-768_A-12.zip

!mkdir squad-1.1 out

!wget -O squad-1.1/train-v1.1.json  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json 
!wget -O squad-1.1/dev-v1.1.json  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json 
!wget -O squad-1.1/evaluate-v1.1.json  https://rajpurkar.github.io/SQuAD-explorer/dataset/evaluate-v1.1.json 

## Step 4 - fine tune the Bert model for squad-1.1
This is the same as described in the [Bert repository](https://github.com/google-research/bert). You need to do this only once.


In [None]:
#
# finetune bert for squad-1.1
# this may take a bit
#

!cd bert && \
python run_squad.py \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --do_train=True \
  --train_file=$SQUAD_DIR/train-v1.1.json \
  --do_predict=True \
  --predict_file=$SQUAD_DIR/dev-v1.1.json \
  --train_batch_size=8 \
  --learning_rate=3e-5 \
  --num_train_epochs=2.0 \
  --max_seq_length=256 \
  --doc_stride=128 \
  --output_dir=$OUT

## Step 5 - create the inference graph and save it
With a fined tuned model in hands we want to create the inference graph for it and save it as saved_model format.

***We assune that after 2 epochs the checkpoint is model.ckpt-21899 - if the following code does not find it, check the $OUT directory for the higest checkpoint***.

In [19]:
import collections
import json
import math
import os
import random

import numpy as np
import tensorflow as tf

import modeling
import optimization
import run_squad
import tokenization
import modeling
import optimization
import tokenization
import run_squad
import six

#
# define some constants used by the model
#
MAX_SEQ_LENGTH = 256
EVAL_BATCH_SIZE = 8
N_BEST_SIZE = 20
MAX_ANSWER_LENGTH = 30
MAX_QUERY_LENGTH = 64
DOC_STRIDE = 128

VOCAB_FILE = os.path.join(BERT_BASE_DIR, 'vocab.txt')
CONFIG_FILE = os.path.join(BERT_BASE_DIR, 'bert_config.json')
CHECKPOINT = os.path.join(OUT, 'model.ckpt-21899')

tokenizer = tokenization.FullTokenizer(vocab_file=VOCAB_FILE, do_lower_case=True)

tf.logging.set_verbosity("WARN")

# touch flags
FLAGS = tf.flags.FLAGS

Create the model and run predictions on all data and save the results so we can compare them later to the onnxruntime version.

In [64]:
run_config = tf.contrib.tpu.RunConfig(model_dir=OUT, tpu_config=None)

model_fn = run_squad.model_fn_builder(
    bert_config=modeling.BertConfig.from_json_file(CONFIG_FILE),
    init_checkpoint=CHECKPOINT,
    learning_rate=0,
    num_train_steps=0,
    num_warmup_steps=0,
    use_tpu=False,
    use_one_hot_embeddings=False)

estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=False,
    model_fn=model_fn,
    config=run_config,
    predict_batch_size=EVAL_BATCH_SIZE,
    export_to_tpu=False)


eval_examples = run_squad.read_squad_examples(input_file=os.path.join(SQUAD_DIR, "dev-v1.1.json"), is_training=False)
eval_writer = run_squad.FeatureWriter(filename=os.path.join(OUT, "eval.tf_record"), is_training=False)
eval_features = []

def append_feature(feature):
    eval_features.append(feature)
    eval_writer.process_feature(feature)

run_squad.convert_examples_to_features(
    examples=eval_examples,
    tokenizer=tokenizer,
    max_seq_length=MAX_SEQ_LENGTH,
    doc_stride=DOC_STRIDE,
    max_query_length=MAX_QUERY_LENGTH,
    is_training=False,
    output_fn=append_feature)
eval_writer.close()

predict_input_fn = run_squad.input_fn_builder(
    input_file=eval_writer.filename,
    seq_length=MAX_SEQ_LENGTH,
    is_training=False,
    drop_remainder=False)



In [81]:
# N is the number of examples we are evaluating. On the CPU this might take a bit.
# During development you can set N to some more practical
N = len(eval_features)

all_results = []
for result in estimator.predict(predict_input_fn, yield_single_examples=True):
    if len(all_results) % 1000 == 0:
        print("sample: %d" % (len(all_results)))
    unique_id = int(result["unique_ids"])
    start_logits = [float(x) for x in result["start_logits"].flat]
    end_logits = [float(x) for x in result["end_logits"].flat]
    raw_result = run_squad.RawResult(unique_id=unique_id, start_logits=start_logits, end_logits=end_logits)
    all_results.append(raw_result)
    if len(all_results) >= N:
        break
    
run_squad.write_predictions(eval_examples[:N], eval_features[:N], all_results,
                            N_BEST_SIZE, MAX_ANSWER_LENGTH, True, 
                            os.path.join(OUT, "predictions.json"),
                            os.path.join(OUT, "nbest_predictions.json"), 
                            os.path.join(OUT, "null_odds.json"))

Processing example: 0


Now lets create the inference graph and save it.

In [23]:
# Export the model
def serving_input_fn():
    receiver_tensors = {
        'unique_ids': tf.placeholder(dtype=tf.int64, shape=[None], name='unique_ids'),
        'input_ids': tf.placeholder(dtype=tf.int64, shape=[None, MAX_SEQ_LENGTH], name='input_ids'),
        'input_mask': tf.placeholder(dtype=tf.int64, shape=[None, MAX_SEQ_LENGTH], name='input_mask'),
        'segment_ids': tf.placeholder(dtype=tf.int64, shape=[None, MAX_SEQ_LENGTH], name='segment_ids')
    }
    return tf.estimator.export.ServingInputReceiver(receiver_tensors, receiver_tensors)

path = estimator.export_savedmodel(os.path.join(OUT, "export"), serving_input_fn)
os.environ['LAST_SAVED_MODEL'] = path.decode('utf-8')

## Step 6 - convert to ONNX

Convert the model from tensorflow to onnx using https://github.com/onnx/tensorflow-onnx.

In [None]:
# install the latest version of tf2onnx if needed
!pip install -U tf2onnx

In [28]:
# convert model
# because we still have a tensorflow session open in this notebook, force the converter to use the CPU.
#
!CUDA_VISIBLE_DEVICES='' python -m tf2onnx.convert --saved-model $LAST_SAVED_MODEL --output $OUT/bert.onnx --opset 8

None
2019-06-10 13:19:47.511598: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-06-10 13:19:54,043 - INFO - Using tensorflow=1.13.1, onnx=1.5.0, tf2onnx=1.6.0/22481c
2019-06-10 13:19:54,043 - INFO - Using opset <onnx, 8>
2019-06-10 13:19:57,219 - INFO - 
2019-06-10 13:19:58,562 - INFO - Optimizing ONNX model
2019-06-10 13:19:59,958 - INFO - After optimization: Cast -4 (70->66), Identity -30 (31->1), Transpose -1 (62->61), Unsqueeze -173 (191->18)
2019-06-10 13:20:00,031 - INFO - 
2019-06-10 13:20:00,031 - INFO - Successfully converted TensorFlow model /home/gs/bert/out/export/1560197514 to ONNX
2019-06-10 13:20:01,241 - INFO - ONNX model is saved at /home/gs/bert/out/bert.onnx


## Step 7 - run the ONNX model under onnxruntime


Lets look at the inputs to the ONNX model. The input 'unique_ids' is special and creates some issue in ONNX: the input is passed directly to the output and in Tensorflow both have the same name. In ONNX that is not supported and the converter creates a new name for the input. We need to use that created name so we remember it.

In [33]:
import onnxruntime as ort

sess = ort.InferenceSession(os.path.join(OUT, "bert.onnx"))
for input_meta in sess.get_inputs():
    print(input_meta)

# remember the name of unique_id
unique_id_name = sess.get_inputs()[0].name

NodeArg(name='unique_ids_raw_output___9:0', type='tensor(int64)', shape=[None])
NodeArg(name='segment_ids:0', type='tensor(int64)', shape=[None, 256])
NodeArg(name='input_mask:0', type='tensor(int64)', shape=[None, 256])
NodeArg(name='input_ids:0', type='tensor(int64)', shape=[None, 256])


In [96]:
RawResult = collections.namedtuple("RawResult", ["unique_id", "start_logits", "end_logits"])

all_results = []
for idx in range(0, N):
    item = eval_features[idx]
    # this is using batch_size=1
    # feed the input data as int64
    data = {"unique_ids_raw_output___9:0": np.array([item.unique_id], dtype=np.int64),
            "input_ids:0": np.array([item.input_ids], dtype=np.int64),
            "input_mask:0": np.array([item.input_mask], dtype=np.int64),
            "segment_ids:0": np.array([item.segment_ids], dtype=np.int64)}
    result = sess.run(["unique_ids:0", "unstack:0", "unstack:1"], data)
    unique_id = result[0][0]
    start_logits = [float(x) for x in result[1][0].flat]
    end_logits = [float(x) for x in result[2][0].flat]
    all_results.append(RawResult(unique_id=unique_id, start_logits=start_logits, end_logits=end_logits))
    if unique_id % 1000 == 0:
        print("sample: %d" % (len(all_results)))
    if len(all_results) >= N:
        break

run_squad.write_predictions(eval_examples[:N], eval_features[:N], all_results,
                            N_BEST_SIZE, MAX_ANSWER_LENGTH, True, 
                            os.path.join(OUT, "onnx_predictions.json"),
                            os.path.join(OUT, "onnx_nbest_predictions.json"), 
                            os.path.join(OUT, "onnx_null_odds.json"))


example: 1


Compare some results between Tensorflow and ONNX:

In [97]:
!head -20 $OUT/predictions.json

{
    "56be4db0acb8001400a502ec": "Denver Broncos",
    "56be4db0acb8001400a502ed": "Carolina Panthers",
    "56be4db0acb8001400a502ee": "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California",
    "56be4db0acb8001400a502ef": "Denver Broncos",
    "56be4db0acb8001400a502f0": "gold",
    "56be8e613aeaaa14008c90d1": "\"golden anniversary",
    "56be8e613aeaaa14008c90d2": "February 7, 2016",
    "56be8e613aeaaa14008c90d3": "American Football Conference",
    "56bea9923aeaaa14008c91b9": "\"golden anniversary",
    "56bea9923aeaaa14008c91ba": "American Football Conference",
    "56bea9923aeaaa14008c91bb": "February 7, 2016",
    "56beace93aeaaa14008c91df": "Denver Broncos",
    "56beace93aeaaa14008c91e0": "Levi's Stadium",
    "56beace93aeaaa14008c91e1": "San Francisco",
    "56beace93aeaaa14008c91e2": "Super Bowl L",
    "56beace93aeaaa14008c91e3": "2015",
    "56bf10f43aeaaa14008c94fd": "2015",
    "56bf10f43aeaaa14008c94fe": "San Francisco",
    "56bf

In [98]:
!head -20 $OUT/onnx_predictions.json

{
    "56be4db0acb8001400a502ec": "Denver Broncos",
    "56be4db0acb8001400a502ed": "Carolina Panthers",
    "56be4db0acb8001400a502ee": "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California",
    "56be4db0acb8001400a502ef": "Denver Broncos",
    "56be4db0acb8001400a502f0": "gold",
    "56be8e613aeaaa14008c90d1": "\"golden anniversary",
    "56be8e613aeaaa14008c90d2": "February 7, 2016",
    "56be8e613aeaaa14008c90d3": "American Football Conference",
    "56bea9923aeaaa14008c91b9": "\"golden anniversary",
    "56bea9923aeaaa14008c91ba": "American Football Conference",
    "56bea9923aeaaa14008c91bb": "February 7, 2016",
    "56beace93aeaaa14008c91df": "Denver Broncos",
    "56beace93aeaaa14008c91e0": "Levi's Stadium",
    "56beace93aeaaa14008c91e1": "San Francisco",
    "56beace93aeaaa14008c91e2": "Super Bowl L",
    "56beace93aeaaa14008c91e3": "2015",
    "56bf10f43aeaaa14008c94fd": "2015",
    "56bf10f43aeaaa14008c94fe": "San Francisco",
    "56bf

## Summary

That was all it takes to convert a relativly complex model from Tensorflow to ONNX. 

You find more documentation about tensorflow-onnx [here](https://github.com/onnx/tensorflow-onnx).