Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low inference speed #1

Closed
fischermario opened this issue Jul 1, 2018 · 3 comments
Closed

Low inference speed #1

fischermario opened this issue Jul 1, 2018 · 3 comments

Comments

@fischermario
Copy link

I have tried to recreate the benchmark results with the examples from the repository. The inference speed on my Jetson TX2 is much slower compared to the results in the table on the front page.

This is the log for classification.ipynb:

2018-07-01 22:18:34.878861: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2018-07-01 22:18:34.879005: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 4.46GiB
2018-07-01 22:18:34.879066: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-07-01 22:18:35.940353: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-01 22:18:35.940441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-07-01 22:18:35.940466: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-07-01 22:18:35.940661: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4002 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Converted 230 variables to const ops.
2018-07-01 22:18:49.301345: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0
2018-07-01 22:18:50.402393: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2660] Max batch size= 1 max workspace size= 33554432
2018-07-01 22:18:50.402478: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2664] Using FP16 precision mode
2018-07-01 22:18:50.402500: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2666] starting build engine
2018-07-01 22:19:11.072290: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2671] Built network
2018-07-01 22:19:11.308241: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2676] Serialized engine
2018-07-01 22:19:11.318361: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2684] finished engine InceptionV1/my_trt_op0 containing 493 nodes
2018-07-01 22:19:11.318499: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2704] Finished op preparation
2018-07-01 22:19:11.339604: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2712] OK finished op building
2018-07-01 22:19:11.392810: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-07-01 22:19:11.392929: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-01 22:19:11.392958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-07-01 22:19:11.392980: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-07-01 22:19:11.393077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4002 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
(0.374037) golden retriever

(0.048114) miniature poodle

(0.042460) toy poodle

(0.036036) cocker spaniel, English cocker spaniel, cocker

(0.017122) standard poodle

Inference finished in 2712 ms

My only modification to the example code is time measurement around

output = tf_sess.run(tf_output, feed_dict={
    tf_input: image[None, ...]
})

I ran my tests after a reboot with

sudo nvpmodel -m 0
sudo ~/jetson_clocks.sh

Without those commands the inference time is ~200 ms higher.

What am I missing here?

@ghost
Copy link

ghost commented Jul 2, 2018

The first call of tf_sess.run takes significantly longer than consecutive calls due to initialization. In the benchmark timings reported we averaged over several to calls to tf_sess.run, excluding the first call. Are you excluding the first call in your timing?

@fischermario
Copy link
Author

That was it. I did not take the initialization time into account. Thanks for the insight 😊

@ghost
Copy link

ghost commented Jul 2, 2018

No problem, glad to hear it worked :).

Closing this.

@ghost ghost closed this as completed Jul 2, 2018
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant