Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
CUDA backend for the DNN module #14827
Demo Video: https://www.youtube.com/watch?v=ljCfluWYymM
Project summary/benchmarks: https://gist.github.com/YashasSamaga/a84cf2826ab2dc755005321fe17cd15d
Current Support Matrix: (not updated)
alalek left a comment
Please note, that we usually do not merge large code parts without corresponding tests.
So, consider working on this GSoC task in a single PR (if you don't have another agreement with your mentor).
Some build-related comments are below.
Do I have to use
Can I use
Is it fine to force push occasionally when there isn't any dependent stuff like reviews in between?
It is used to avoid excessive merge issues from
Feel free to use
In master branch it is just a wrapper, so it should do the same things.
It is OK.
Seems like it would be implementation defined at worst, rather than UB. You sure it’s UB? If it’s ok in c++17 and works in our case I think it’s fine. I would be surprised if some compilers defined
You have to select the device before creating the
@YashasSamaga but still there is another problem.
However, the process still took some memory on GPU 0.
Devices are associated with each CPU thread. When you set the device using
You want the entire process to use a single GPU? You have to set the device at the very beginning before any device is used.
You can also control the device that is used externally by setting the
Can anyone help me installing opencv with cuda enabled on ubuntu 18.04? Step by step. I downloaded the opencv project from master. Then I created a folder called binary and inside I opened cmake-gui. I checked with_cuda, with_cudnn. I am following these instructions https://docs.opencv.org/master/d7/d9f/tutorial_linux_install.html
But after well compiling and installing it, I cannot pass the tests for dnn and in my python app the backends for cuda are not recognized.
Which is the exactly parameters I have to check to compile it? I'm running on the last cuda 10.1 and cuddn 7.6.
With the test I'm refering to run the program ./opencv/build/bin/opencv_test_dnn At some page I read this program should run fine if everything is working. I have an extra question about the arch_bin cuda option. So when I'm building it, it says the minimum to run is 5.3. I'm putting 5.3, 6.0 and 7.5. This number, referers to my GPU or to the cuda drivers? Because in the nvidia website it says my GPU is 5.0 as for compatibility, but my cuda installed is 10.1 and the cudnn is 7.6.
About the option you comment, I already had that option turned on. I copy you what I obtained from the cmake when generating it. I didn't have any error while compliling it or installing it afterwards. (I do all the commands with a sudo)
`Detected processor: x86_64
General configuration for OpenCV 4.1.2-dev =====================================
Parallel framework: pthreads
Trace: YES (with Intel ITT)
Other third-party libraries:
NVIDIA CUDA: YES (ver 10.1, CUFFT CUBLAS)
cuDNN: YES (ver 7.6.2)
OpenCL: YES (no extra features)
Python (for build): /usr/bin/python2.7
Install to: /usr/local
I'm running the Jetson Nano getting outstanding performance on the classification networks running on the GPU, however I'm not seeing any improvements (like what was scene on x86 benchmarks published here) on the object detection networks over running CPU interferencing. I'm using jtop to monitor the system and I can see the models getting loaded to the GPU and the system using the GPU. Any suggestions or thoughts like pre-processing the object detection models. I notice that Jetpack uses UFF model format that why I ask.
There are several issues:
It depends on what model you are running. MobileNet SSD will suffer from issue 1. All Faster RCNN based detectors suffer from issue 2. The issue 2 will be resolved before Christmas.
You can set
Do not take this post very seriously. I have my end semester exams going on. I just scrambled something to test darknet and OpenCV CUDA backend. I will be doing this again next week on different devices.
YOLO v3 Investigation
The OpenCV CUDA backend uses the CPU to perform NMS. In region layer, the data is moved from the GPU to the host for performing NMS.
If we look at the layerwise timings, it appears that the OpenCV CUDA backend beats route, resize, shortcut, etc. layers. The region layer was performing very badly compared to darknet. This was probably due to the NMS being performed on the CPU and the D2H data transfer involved.
INVALID BENCHMARK: NMS not included in Darknet bench
I have forced synchronization after every layer in both darknet and the OpenCV CUDA backend to allow the layerwise timings to be measured. Both suffer equally (GPU goes idle during layer switch). The code used to obtain layerwise timings of the CUDA backend is similar to: YashasSamaga@55ad843. I uncommented the timing code for darknet.
Darknet measures timings using
Darknet mostly includes the ReLU timings in the convolution timings. The convolution time + ReLU time from the CUDA backend correlate with the darknet convolution time.
CUDA backend does not support tensor cores. It's trivial to enable it for cuDNN. Darknet will benefit from tensor cores on 7.x CC GPUs.
@charlestamz on what device did you benchmark and how?
INVALID BENCHMARK: NMS not included in Darknet bench
@charlestamz The FPS reported by the object detector sample depends on the camera FPS and what not.
I've tested your cuda backend on jetson nano and it worked flawlessly at 28fps when executed with tiny-yolov3 (320x320) in normal c++ code!
Now integrated it with the dnn_detect ROS node, which works fine for about 100 images, but then crashes the jetson entirely.
edit: found out the 5A power supply wasn't adequate! Switching the jetson nano to 5W mode fixed the crashing!
I think I had earlier made a mistake in reporting the YOLOv3 timings for the CUDA backend and Darknet. I hadn't included NMS timings in the darknet bench. I also hadn't take the average of several runs using a loop (instead took average of what darknet said in its output across several runs).
Warmup runs: 3
Benchmark code (for both darknet and opencv): https://gist.github.com/YashasSamaga/26eb2eb16be2cc749e3394d300a7585e
DISCLAIMER: I am not very comfortable editing darknet code but I hope the benchmark is fair and correct (would be great if somebody could validate).
NOTE: I have an experimental patch (part of a more general experimental graph patch) which can save another 2ms. YOLOv3 has three region layers. The output of the first two are copied to the CPU for NMS simultaneously as the GPU continues computing the remaining layers. This way the GPU to CPU memory transfer of the two region layers can be completely hidden. The NMS for the first two region layers begin on CPU even before the forward pass on GPU finishes fully.
There are few open PRs for ROI pooling and CropAndResize. These should improve the performance of many detection models.