single example inference seems slow #32

652994331 · 2020-06-03T22:28:49Z

Hi, my environment is tf 1.13.1 . I already set up the fastertransformer v1 and used bert example. When i used original bert inference(input test file) and the predcit time per sample is around 0.0035s(time used in estimator.predict / num of sample). The original bert(without fastertransformer is around 0.007s)

However, when i used input fn builder(not file based) to inference only one sample, the time is 0.009s (same as one inference of original bert which is also 0.009s). Could u please help about this?

byshiue · 2020-06-03T23:38:23Z

Please try the tensorflow_bert demo in v2.

652994331 · 2020-06-03T23:41:11Z

@byshiue thank you for the quick reply, but any reason? which is part is different from v1 ? thanks

byshiue · 2020-06-03T23:49:03Z

The encoder of v1 and v2 are same. And v1 also provide the tensorflow_bert demo, but we do not demonstrate how to use in the README. So, I recommend you run the tensorflow_bert following by the README of v2 first.
There are many possible reasons, and I cannot give the answer because we do not have enough information.

652994331 · 2020-06-04T00:15:19Z

@ thank you, i went back to test result again, and i found, if we do not use fastertransformer, the original bert inference time for single example(not average inference time of a test files but one input) is around 15ms. so Actually we reduce the time from 15ms -> 9ms, so fastertransformer works.

I was thinking about one thing. for original bert without fastertransformer, we can use export model to save mode as pb and use feature to inference. it will reduce time a lot, Can we use export model here with faster transformer? like how?

thanks so much

byshiue · 2020-06-04T00:23:41Z

Yes. There are two ways.
First, you can restore the checkpoint, and get the variables by tf.get_tensor_by_name or other similar function, and put them into FasterTransformer. If you put the variables by the tf.tensor format, then the overhead of constructor of FasterTransformer would be smaller because it does not need to copy the memory.
Another way is put the weights as the numpy format. In this way, the overhead of constructor would be large, but there is no effect for inference time.

652994331 · 2020-06-04T00:28:03Z

@byshiue thank you.. but kinda overwhelming.... i am new for tensorflow and bert. Is there any code i can reference?

byshiue · 2020-06-04T00:36:37Z

You can first try by sample/tensorflow/encoder_sample.py and sample/tensorflow/utils/encoder.py.
This is an easy environment to verify the correctness and the inference speed.
For example, you can try to replace the "encoder_vars[val_off + 0]" of encoder.py by "tf.get_default_graph().get_tensor_by_name('layer_%d/attention/self/query/kernel:0' % layer_idx)".

Another sample is, using sess.run(all_var) to get the values of all variables as numpy format, and then put them into the FasteTransformer op.

After you understand how to use the FasterTransformer, you can modify the sample of tensorflow_bert to run the test on the BERT.

byshiue · 2020-06-25T00:50:34Z

closing due to inactivity

byshiue transferred this issue from NVIDIA/DeepLearningExamples Apr 5, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

single example inference seems slow #32

single example inference seems slow #32

652994331 commented Jun 3, 2020

byshiue commented Jun 3, 2020

652994331 commented Jun 3, 2020

byshiue commented Jun 3, 2020

652994331 commented Jun 4, 2020

byshiue commented Jun 4, 2020

652994331 commented Jun 4, 2020

byshiue commented Jun 4, 2020

byshiue commented Jun 25, 2020

single example inference seems slow #32

single example inference seems slow #32

Comments

652994331 commented Jun 3, 2020

byshiue commented Jun 3, 2020

652994331 commented Jun 3, 2020

byshiue commented Jun 3, 2020

652994331 commented Jun 4, 2020

byshiue commented Jun 4, 2020

652994331 commented Jun 4, 2020

byshiue commented Jun 4, 2020

byshiue commented Jun 25, 2020