Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upCUDA backend for the DNN module #14827
Conversation
5717c7f
to
359bf93
Good progress! Please note, that we usually do not merge large code parts without corresponding tests. So, consider working on this GSoC task in a single PR (if you don't have another agreement with your mentor). Some build-related comments are below. |
46db2b1
to
fbd05d3
This comment has been minimized.
This comment has been minimized.
Do I have to use Can I use Is it fine to force push occasionally when there isn't any dependent stuff like reviews in between? |
This comment has been minimized.
This comment has been minimized.
It is used to avoid excessive merge issues from
Feel free to use
In master branch it is just a wrapper, so it should do the same things.
It is OK. |
c8fd75b
to
30b294e
79c65f0
to
2941d74
39837c8
to
b89d7e0
This comment has been minimized.
This comment has been minimized.
davisking
commented
Jul 21, 2019
Seems like it would be implementation defined at worst, rather than UB. You sure it’s UB? If it’s ok in c++17 and works in our case I think it’s fine. I would be surprised if some compilers defined |
a818297
to
3584d72
This comment has been minimized.
This comment has been minimized.
Yes, I will be attempting it in the last week of November unless somebody beats me to it. It will mostly require API changes. I am not sure how the API should be. |
This comment has been minimized.
This comment has been minimized.
charlestamz
commented
Nov 13, 2019
@YashasSamaga you did a awsome job . |
This comment has been minimized.
This comment has been minimized.
charlestamz
commented
Nov 13, 2019
Hi all, |
This comment has been minimized.
This comment has been minimized.
You can use You have to select the device before creating the |
This comment has been minimized.
This comment has been minimized.
charlestamz
commented
Nov 14, 2019
@YashasSamaga great , it worked. Thank you. |
This comment has been minimized.
This comment has been minimized.
charlestamz
commented
Nov 15, 2019
@YashasSamaga but still there is another problem. |
This comment has been minimized.
This comment has been minimized.
|
This comment has been minimized.
This comment has been minimized.
charlestamz
commented
Nov 15, 2019
•
@YashasSamaga
However, the process still took some memory on GPU 0.
|
This comment has been minimized.
This comment has been minimized.
Devices are associated with each CPU thread. When you set the device using You want the entire process to use a single GPU? You have to set the device at the very beginning before any device is used. You can also control the device that is used externally by setting the |
This comment has been minimized.
This comment has been minimized.
Scolymus
commented
Nov 17, 2019
Can anyone help me installing opencv with cuda enabled on ubuntu 18.04? Step by step. I downloaded the opencv project from master. Then I created a folder called binary and inside I opened cmake-gui. I checked with_cuda, with_cudnn. I am following these instructions https://docs.opencv.org/master/d7/d9f/tutorial_linux_install.html But after well compiling and installing it, I cannot pass the tests for dnn and in my python app the backends for cuda are not recognized. Which is the exactly parameters I have to check to compile it? I'm running on the last cuda 10.1 and cuddn 7.6. |
This comment has been minimized.
This comment has been minimized.
@Scolymus You have to mark |
This comment has been minimized.
This comment has been minimized.
Scolymus
commented
Nov 18, 2019
With the test I'm refering to run the program ./opencv/build/bin/opencv_test_dnn At some page I read this program should run fine if everything is working. I have an extra question about the arch_bin cuda option. So when I'm building it, it says the minimum to run is 5.3. I'm putting 5.3, 6.0 and 7.5. This number, referers to my GPU or to the cuda drivers? Because in the nvidia website it says my GPU is 5.0 as for compatibility, but my cuda installed is 10.1 and the cudnn is 7.6. About the option you comment, I already had that option turned on. I copy you what I obtained from the cmake when generating it. I didn't have any error while compliling it or installing it afterwards. (I do all the commands with a sudo) `Detected processor: x86_64 General configuration for OpenCV 4.1.2-dev ===================================== Extra modules: Platform: CPU/HW features: C/C++: OpenCV modules: GUI: Media I/O: Video I/O: Parallel framework: pthreads Trace: YES (with Intel ITT) Other third-party libraries: NVIDIA CUDA: YES (ver 10.1, CUFFT CUBLAS) cuDNN: YES (ver 7.6.2) OpenCL: YES (no extra features) Python 2: Python 3: Python (for build): /usr/bin/python2.7 Java: Install to: /usr/localConfiguring done |
This comment has been minimized.
This comment has been minimized.
@Scolymus GPUs with CC below 5.3 are not supported. The CUDA backend provides a half-precision target which requires features present in devices with CC 5.3 and above. |
This comment has been minimized.
This comment has been minimized.
sgriset
commented
Nov 19, 2019
I'm running the Jetson Nano getting outstanding performance on the classification networks running on the GPU, however I'm not seeing any improvements (like what was scene on x86 benchmarks published here) on the object detection networks over running CPU interferencing. I'm using jtop to monitor the system and I can see the models getting loaded to the GPU and the system using the GPU. Any suggestions or thoughts like pre-processing the object detection models. I notice that Jetpack uses UFF model format that why I ask. |
This comment has been minimized.
This comment has been minimized.
There are several issues:
It depends on what model you are running. MobileNet SSD will suffer from issue 1. All Faster RCNN based detectors suffer from issue 2. The issue 2 will be resolved before Christmas. You can set |
This comment has been minimized.
This comment has been minimized.
isra60
commented
Nov 19, 2019
What about YOLO detector?? Have you test performance against Darknet?? https://github.com/AlexeyAB/darknet |
This comment has been minimized.
This comment has been minimized.
charlestamz
commented
Nov 22, 2019
@isra60 i didn't run a full test, on my test, the new dnn module( the master branch) is about 2/3 fps of https://github.com/AlexeyAB/darknet. |
This comment has been minimized.
This comment has been minimized.
Do not take this post very seriously. I have my end semester exams going on. I just scrambled something to test darknet and OpenCV CUDA backend. I will be doing this again next week on different devices. YOLO v3 InvestigationThe OpenCV CUDA backend uses the CPU to perform NMS. In region layer, the data is moved from the GPU to the host for performing NMS. If we look at the layerwise timings, it appears that the OpenCV CUDA backend beats route, resize, shortcut, etc. layers. The region layer was performing very badly compared to darknet. This was probably due to the NMS being performed on the CPU and the D2H data transfer involved. Timings:INVALID BENCHMARK: NMS not included in Darknet bench
Layerwise timingsI have forced synchronization after every layer in both darknet and the OpenCV CUDA backend to allow the layerwise timings to be measured. Both suffer equally (GPU goes idle during layer switch). The code used to obtain layerwise timings of the CUDA backend is similar to: YashasSamaga@55ad843. I uncommented the timing code for darknet. Darknet measures timings using Update: OpenCV layerwise timings (using OpenCV layerwise timings (using CUDA event API) Darknet mostly includes the ReLU timings in the convolution timings. The convolution time + ReLU time from the CUDA backend correlate with the darknet convolution time. Notes:CUDA backend does not support tensor cores. It's trivial to enable it for cuDNN. Darknet will benefit from tensor cores on 7.x CC GPUs. @charlestamz on what device did you benchmark and how? |
This comment has been minimized.
This comment has been minimized.
charlestamz
commented
Nov 23, 2019
@YashasSamaga |
This comment has been minimized.
This comment has been minimized.
INVALID BENCHMARK: NMS not included in Darknet bench@charlestamz The FPS reported by the object detector sample depends on the camera FPS and what not.
|
This comment has been minimized.
This comment has been minimized.
phil-ludewig
commented
Dec 3, 2019
•
I've tested your cuda backend on jetson nano and it worked flawlessly at 28fps when executed with tiny-yolov3 (320x320) in normal c++ code! Now integrated it with the dnn_detect ROS node, which works fine for about 100 images, but then crashes the jetson entirely. edit: found out the 5A power supply wasn't adequate! Switching the jetson nano to 5W mode fixed the crashing! |
This comment has been minimized.
This comment has been minimized.
albertchristianto
commented
Dec 6, 2019
@charlestamz So, OpenCV support CUDA back end in the latest release, isn't it? |
This comment has been minimized.
This comment has been minimized.
@albertchristianto, no, it has been merged after 4.1.2 release. Check the dates. |
This comment has been minimized.
This comment has been minimized.
thedevleon
commented
Dec 7, 2019
Does someone have a precompiled binary for Windows x64? I'm having some issues building OpenCV with CUDA because of clashing CUDA versions and various other issues. |
This comment has been minimized.
This comment has been minimized.
I think I had earlier made a mistake in reporting the YOLOv3 timings for the CUDA backend and Darknet. I hadn't included NMS timings in the darknet bench. I also hadn't take the average of several runs using a loop (instead took average of what darknet said in its output across several runs). YOLO Benchmark
Warmup runs: 3
Benchmark code (for both darknet and opencv): https://gist.github.com/YashasSamaga/26eb2eb16be2cc749e3394d300a7585e DISCLAIMER: I am not very comfortable editing darknet code but I hope the benchmark is fair and correct (would be great if somebody could validate). NOTE: I have an experimental patch (part of a more general experimental graph patch) which can save another 2ms. YOLOv3 has three region layers. The output of the first two are copied to the CPU for NMS simultaneously as the GPU continues computing the remaining layers. This way the GPU to CPU memory transfer of the two region layers can be completely hidden. The NMS for the first two region layers begin on CPU even before the forward pass on GPU finishes fully. Please note the corrections @crackwitz @charlestamz @isra60 @sgriset There are few open PRs for ROI pooling and CropAndResize. These should improve the performance of many detection models. |
YashasSamaga commentedJun 18, 2019
•
edited
How to use build and use the CUDA backend?
How to use multiple GPUs?
Benchmarks
Demo Video: https://www.youtube.com/watch?v=ljCfluWYymM
Project summary/benchmarks: https://gist.github.com/YashasSamaga/a84cf2826ab2dc755005321fe17cd15d
Current Support Matrix: (not updated)
Known issues:
Ideas:
CUDNN_FUSED_CONV_SCALE_BIAS_ADD_ACTIVATION
)References: #14585
Results: