Slow quantized tflite model inference #138

mieszkokl · 2020-06-09T11:07:12Z

I have trained keras model which is semantic segmentation FPN with ResNet101 backbone(using https://github.com/qubvel/segmentation_models). I want to deploy it on Coral Dev Board.

Firstly I've converted it to tflite using:

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.representative_dataset = representative_dataset_gen
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tf_lite_model = converter.convert()

with open("model.tflite", "wb") as f:
    f.write(tf_lite_model)

Output from conversion:

2020-06-05 10:53:29.063149: I tensorflow/core/grappler/devices.cc:55] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2020-06-05 10:53:29.063233: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session
2020-06-05 10:53:29.080730: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:797] Optimization results for grappler item: graph_to_optimize
2020-06-05 10:53:29.080748: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799]   function_optimizer: function_optimizer did nothing. time = 0.006ms.
2020-06-05 10:53:29.080752: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799]   function_optimizer: function_optimizer did nothing. time = 0ms.
2020-06-05 10:53:32.284115: I tensorflow/core/grappler/devices.cc:55] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2020-06-05 10:53:32.284242: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session
2020-06-05 10:53:33.407982: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:797] Optimization results for grappler item: graph_to_optimize
2020-06-05 10:53:33.408011: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799]   constant_folding: Graph size after: 1092 nodes (-568), 1139 edges (-568), time = 474.12ms.
2020-06-05 10:53:33.408016: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799]   constant_folding: Graph size after: 1092 nodes (0), 1139 edges (0), time = 213.886ms.

Then I've compiled it to edge TPU using edge_compiler:

edgetpu_compiler model.tflite

Logs from compiler:

Edge TPU Compiler version 2.1.302470888
Input: model.tflite
Output: model_edgetpu.tflite

Operator                       Count      Status

ADD                            1          More than one subgraph is not supported
ADD                            71         Mapped to Edge TPU
MAX_POOL_2D                    1          Mapped to Edge TPU
PAD                            35         Mapped to Edge TPU
MUL                            35         Mapped to Edge TPU
CONCATENATION                  1          More than one subgraph is not supported
QUANTIZE                       1          Operation is otherwise supported, but not mapped due to some unspecified limitation
QUANTIZE                       3          Mapped to Edge TPU
CONV_2D                        115        Mapped to Edge TPU
CONV_2D                        4          More than one subgraph is not supported
DEQUANTIZE                     1          Operation is working on an unsupported data type
RESIZE_BILINEAR                2          Operation is otherwise supported, but not mapped due to some unspecified limitation
RESIZE_BILINEAR                6          Mapped to Edge TPU
SOFTMAX                        1          Max 16000 elements supported

Single frame inference on Coral Dev Board using TPU takes ~4s. Is it normal that it's so slow? What can I do to make it faster?

I've also created an issue in Tensorflow repository, because converted tflite model inference on PC using CPU takes a lot of time, but finally they forwarded me here.

The text was updated successfully, but these errors were encountered:

Namburger · 2020-06-09T11:55:26Z

Hi!
Did the compiler mentioned how many ops will be mapped to the tpu?
What's the inference time for the CPU model?
This also depends on the size of the model, if it's too big, then inference speed may be bounded via IO and it's best to try our model pipelining api for that (although it is new and only available for c++ at this time)

mieszkokl · 2020-06-09T12:16:22Z

Info from compiler about ops mapped to tpu:

Number of operations that will run on Edge TPU: 266
Number of operations that will run on CPU: 11

Inference time using:

PC CPU and keras model: ~300ms
PC CPU and tflite model: ~58s
Coral Dev Board CPU and tflite model: ~6s
Coral Dev Board TPU and compiled tflite model: ~4s

Namburger · 2020-06-09T14:23:27Z

@mieszkokl
I see, what is the input tensor shape and the size of the model? Our compiler only delegates compatible ops from the CPU tflite model to the edgetpu, yours have some speed up so it looks like it is working as expected on our side.
Can you also link me to your original tensorflow issue?
Also, what tensorflow version do you use? It seems weird to me that the keras model can run at ~300ms and jump up to ~58s.

mieszkokl · 2020-06-09T14:37:45Z

Input tensor shape is (1, 224, 224, 3).
Keras model is 185MB, tflite model is 47.4 MB, compiled one is 48.3MB.
Original tensorflow issue: tensorflow/tensorflow#40183
I use tensorflow v2.2.0.

Namburger · 2020-06-09T14:58:34Z

@mieszkokl sorry, but it just seems really odd to me that the tflite model performs worse than the original graph model. Our compiler can only delegates from the CPU's tflite model so we can't really deal with performance drops from the original graph file.

[Edit] I've follow up with some of my own findings on that issues: Closing this issue for now since I don't see this as a bug on our side.

arun-kumark · 2021-08-06T05:52:28Z

Hi Namburger,

I am testing the performance/throughput of fp32 and quantized models on my platform.
GenuineIntel Intel(R) Core(TM) i5-6442EQ CPU @ 1.90GHz of Logical CPUs : 4

My TF versions are as follows:

tflite-runtime==2.5.0.post1
tensorflow==1.14.0

The results on:

FP32 on CPU

-INFO- Running prediction...
-INFO- Acquired 1 file(s) for model 'MobileNet v1.0'
-INFO- Task runtime: 0:00:28.796083
-INFO- Throughput: 35.8 fps
-INFO- Latency: 29.5 ms
-INFO- Target          Workload        H/W   Prec  Batch Conc. Metric       Score    Units
-INFO- -----------------------------------------------------------------------------------
-INFO- tensorflow_lite mobilenet       cpu   fp32      1     1 throughput    35.8      fps
-INFO- tensorflow_lite mobilenet       cpu   fp32      1     1 latency       29.5       ms
-INFO- Total runtime: 0:00:28.830364
-INFO- Done

And results for
INT8 on CPU

google@localhost:~/mlmark$ harness/mlmark.py -c config/tflite-cpu-mobilenet-int8-throughput.json 
-INFO- Running prediction...
-INFO- Acquired 1 file(s) for model 'MobileNet v1.0'
-INFO- Task runtime: 0:01:00.933346
-INFO- Throughput: 16.9 fps
-INFO- Latency: 65. ms
-INFO- Target          Workload        H/W   Prec  Batch Conc. Metric       Score    Units
-INFO- -----------------------------------------------------------------------------------
-INFO- tensorflow_lite mobilenet       cpu   int8      1     1 throughput    16.9      fps
-INFO- tensorflow_lite mobilenet       cpu   int8      1     1 latency       65.        ms
-INFO- Total runtime: 0:01:00.960828
-INFO- Done

Observations: The performance of FP32 model is almost double than INT8 models on CPU, but Google TensorFlow lite benchmarking mentions the opposite:

https://www.tensorflow.org/lite/guide/hosted_models#quantized_models

I also tried replacing the models from the models present in above Hosted location, but the harness gives the similar results.
I also gone through the similar issue: https://github.com/tensorflow/tensorflow/issues/21698

Do I understand correctly that, this observation is obvious?
As I believe that CPU should perfom better (due to less size of model) and less FP. But the scene is changed after results ;(

Could you let me know, where could go wrong?

Thanks
Kind Regards
Arun

Namburger added runtime benchmark labels Jun 9, 2020

Namburger added the research label Jun 9, 2020

Namburger closed this as completed Jun 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow quantized tflite model inference #138

Slow quantized tflite model inference #138

mieszkokl commented Jun 9, 2020

Namburger commented Jun 9, 2020

mieszkokl commented Jun 9, 2020

Namburger commented Jun 9, 2020

mieszkokl commented Jun 9, 2020

Namburger commented Jun 9, 2020 •

edited

Loading

arun-kumark commented Aug 6, 2021

Slow quantized tflite model inference #138

Slow quantized tflite model inference #138

Comments

mieszkokl commented Jun 9, 2020

Namburger commented Jun 9, 2020

mieszkokl commented Jun 9, 2020

Namburger commented Jun 9, 2020

mieszkokl commented Jun 9, 2020

Namburger commented Jun 9, 2020 • edited Loading

arun-kumark commented Aug 6, 2021

Namburger commented Jun 9, 2020 •

edited

Loading