-
-
Notifications
You must be signed in to change notification settings - Fork 55.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Up to 50% longer inference time on for onnx model for the same hardware, compiled the sam way. What can be the reason? #23223
Comments
Docker and WSL are not perfect candidates for performance benchmarking. Both do not have full access to the system resources. So this is not OpenCV issue. |
I guess you missed part where I wrote "I've tested it also on Windows. This time I wrote a Python script. I have two Python environments, in one I get 25 FPS in the other it is 15 FPS." Both times I've compiled Python bindings via OpenCV the same way (the same set of options). So I've already excluded System, Compiler, Language, OpenCV version, CUDA version. I will work on reproducing, I just hoped that maybe someone has an idea what can be the issue here. |
Hi @TrueWodzu, can you try to reproduce this issue with OpenCV 4.x branch? |
Hi @zihaomu thank you for your interest. The same issue happens on 4.x branch. I did some more tests. I took the compiled application which runs slower on Docker and moved it to WSL. The application was running faster under WSL. I've measured GPU clocks and I can see that under docker GPU is less utilized when the application runs:
Docker:
See how on WSL clock rises to 1875 and on Docker it stays at 1005?. However, I want to stress that in my opinion it is not the Docker issue. I've run Nvidia examples on both systems and their performance is exactly the same. |
Question1Could you share the result of following script?(WSL2 and Docker) import cv2
print(cv2.getBuildInformation()) Question2Could you share benchmark script and ONNX model? |
Answer 1
I won't be sending second config because it is identical, I diffed it. Answer 2 See updated, "steps to reproduce" section. |
@alalek Since I've completed missing steps, is this issue now viable to look into? |
I was able to recreate this on different machine. My laptop has RTX 2060, I've recreated this on GTX 1650 |
I could reproduce on same hardware. Software
Hardware
Execution time(TrueWodzu's code)Ubuntu 22.04(native)
Ubuntu 22.04(docker container)
|
@atinfinity Thank you so much for you time. Just to make sure, you did this without Windows and WSL layer? You did this on pure Ubuntu system? |
And, I checked inference of YOLOv4(Darknet).
In this case, there is no difference in execution time between "Ubuntu 22.04(native)" and "Ubuntu 22.04(docker container)". |
I tried on pure Ubuntu system. |
I tried to use cuDNN sample( cp -r /usr/src/cudnn_samples_v8/ $HOME
cd $HOME/cudnn_samples_v8/mnistCUDNN
make
./mnistCUDNN As a result, CUDA kernel processing time is slower than Ubuntu 22.04(native). |
@atinfinity I've run mnistCUDNN example but I don't see any time differences. Also I think that these tests are way too short to make any conclusions. |
@TrueWodzu The 50% performance degradation is very similar to the phenomenon in my issue. |
@ZJDATY I've read your post, and I see you have put a lot of energy into finding /reproducing the problem. The thing is, that in my case this happens on GPU, your case is for CPU. |
I just stumbled across an issue that might be very similar or the reason. The same inference on 4.8.0 compared to 4.5.2 is sometimes slower. I traced it down to 4.5.2 using maximum 553.6 MB RAM wand 4.8.0 is using 1.89 GB RAM. Anybody seeing the problem might be swapping? Is the very high RAM usage a known issue? |
@ukoehler, please add a link to model. If possible, provide per-layer timings using getPerfProfile on both 4.5.2 and 4.8.0. This might help determine which layer has regression. std::vector<double> timings;
net.getPerfProfile(timings);
std::vector<String> names = net.getLayerNames();
CV_Assert(names.size() == timings.size());
for (int i = 0; i < names.size(); ++i)
{
Ptr<dnn::Layer> l = net.getLayer(net.getLayerId(names[i]));
std::cout << names[i] << " " << l->type << " " << timings[i] << std::endl;
} |
I just ran more test and have increases from 1.368 s for version 4.5.2 to 4.351 s for version 4.8.0. This version is just collecting show stopper bugs. |
System Information
OpenCV version: 4.6.0
Operating System / Platform: Ubuntu 20.04 / Windows / WSL2 / Docker
Compiler & compiler version: GCC 9,11, MSVC 2017, 2019, Python 3.10
CUDA: 11.4, 11.8
Detailed description
A word of preface:
I am observing up to 50% larger inference times on one environment compared to other. Both environments are running on the same hardware and OpenCV have been compiled the same way. What is interesting is that performance is constant regardless of operating system.
For example: I wrote a c++ executable and I've run it on a WSL2 image (Ubuntu 20.04/CUDA 11.8) where my inference time is 25 FPS and I have a docker image (Ubuntu 20.04/CUDA 11.8) where I compiled OpenCV exactly the same and inference time is 15 FPS.
I've tested it also on Windows. This time I wrote a Python script. I have two Python environments, in one I get 25 FPS in the other it is 15 FPS.
So the problem is not within my code . and it does not depend on the operating system. Each time I've used the same OpenCV version with the same set of options. I am suspecting that maybe during compilation of OpenCV something sometimes is compiled differently?
I've dug deeper. I've profiled OpenCV, here are the results:
This is the major difference. So I know the problem lies within dnn. I'looked at source and discovered that I can time "layers". So I did that.
Here are top 20 worst times (seconds) where inference is slower:
And here where it is faster.
I am not sure if this is a bug or not, but drop in performance is quite serious and it would be good to know what can cause this, so this can be documented.
Steps to reproduce
This happens for any Yolov5 model translated to onnx. The difference can be observed on any model. The bigger model, the bigger difference. On my machine I can reproduce this every time by installing a fresh WSL2 image and fresh docker image based on Ubuntu 20.04. However, as said earlier. Docker is not the problem here, neither the operating system or compiler.
Build settings:
Benchmarking code:
Of course there will be some small fluctuations in times but they will be very small.
You can obtain model from here
Issue submission checklist
The text was updated successfully, but these errors were encountered: