Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA backend for the DNN module #14827

Merged
merged 129 commits into from Oct 21, 2019

Conversation

@YashasSamaga
Copy link
Contributor

YashasSamaga commented Jun 18, 2019

How to use build and use the CUDA backend?

How to use multiple GPUs?

Benchmarks

Demo Video: https://www.youtube.com/watch?v=ljCfluWYymM

Project summary/benchmarks: https://gist.github.com/YashasSamaga/a84cf2826ab2dc755005321fe17cd15d

Current Support Matrix: (not updated)

Blip Meaning
✔️ supports all the configurations that are supported by all the existing backends (and might support more than what's currently supported)
🔵 partially supported (fallback to CPU for unsupported configurations)
not supported (fallback to CPU)
Layer Status Constraints Notes
Activations ✔️
Batch Normalization ✔️
Blank Layer ✔️
Concat Layer ✔️
Const Layer ✔️
Convolution 2d ✔️ asymmetric padding is disabled in layer constructor but the backend supports it
Convolution 3d ✔️ asymmetric padding is disabled in layer constructor but the backend supports it
Crop and resize
Crop Layer ✔️ forwarded to Slice Layer
Detection Output Layer
Deconvolution 2d 🔵 padding configuration should not lead to extra uneven padding
Deconvolution 3d 🔵 padding configuration should not lead to extra uneven padding
Elementwise Layers ✔️
Eltwise Layer ✔️
Flatten Layer ✔️
Fully Connected Layer ✔️
Input Layer
Interp Layer ✔️
Local Response Normalization ✔️
Max Unpooling 2d ✔️
Max Unpooling 3d ✔️
MVN Layer
Normalize Layer 🔵 Only L1 and L2 norm supported
Padding Layer ✔️
Permute Layer ✔️
Pooling 2d 🔵 Only max and average pooling supported supports asymmetric padding
Pooling 3d 🔵 Only max and average pooling supported supports asymmetric padding
Prior Box Layer ✔️
Proposal Layer
Region Layer ✔️ NMS performed using CPU
Reorg Layer ✔️
Reshape Layer ✔️
Resize Layer ✔️
Scale Layer ✔️
Shift Layer ✔️ forwarded to Scale Layer
Shuffle Channel Layer ✔️
Slice Layer ✔️
Softmax Layer ✔️
Split Layer ✔️
LSTM Layer

Known issues:

  1. Tests for some of the SSD based networks fail on Jetson Nano

Ideas:

  1. Fuse conv, scale, bias, relu and eltwise (CUDNN_FUSED_CONV_SCALE_BIAS_ADD_ACTIVATION)
  2. Enable concat fusion
  3. Fuse ReLU with bias addition in convolution layer

References: #14585

Results:

force_builders_only=Custom,linux,docs
buildworker:Custom=linux-4
docker_image:Custom=ubuntu-cuda:18.04
@YashasSamaga YashasSamaga force-pushed the YashasSamaga:cuda4dnn-csl-low branch 2 times, most recently from 5717c7f to 359bf93 Jun 18, 2019
Copy link
Contributor

alalek left a comment

Good progress!

Please note, that we usually do not merge large code parts without corresponding tests.
Also we prefer to merge completed tasks instead of some helper parts.

So, consider working on this GSoC task in a single PR (if you don't have another agreement with your mentor).

Some build-related comments are below.

modules/dnn/src/cuda4dnn/csl/cudnn.cpp Outdated Show resolved Hide resolved
modules/dnn/src/cuda4dnn/csl/stream.cpp Outdated Show resolved Hide resolved
modules/dnn/include/opencv2/dnn/csl/cublas.hpp Outdated Show resolved Hide resolved
@YashasSamaga YashasSamaga changed the title add low-level CSL components for cuda4dnnn [WIP] CUDA backend for the DNN module Jun 21, 2019
@YashasSamaga YashasSamaga force-pushed the YashasSamaga:cuda4dnn-csl-low branch from 46db2b1 to fbd05d3 Jun 21, 2019
@YashasSamaga

This comment has been minimized.

Copy link
Contributor Author

YashasSamaga commented Jun 21, 2019

Do I have to use CV_OVERRIDE and CV_FINAL? I preassume that they were added for portability but now since both final and override are keywords in C++11, should they be used?

Can I use std::shared_ptr instead of cv::Ptr? There isn't a make_shared equivalent and makePtr doesn't do what std::make_shared does.

Is it fine to force push occasionally when there isn't any dependent stuff like reviews in between?

@alalek

This comment has been minimized.

Copy link
Contributor

alalek commented Jun 21, 2019

CV_OVERRIDE and CV_FINAL

It is used to avoid excessive merge issues from 3.4 branch.
As your code is in master branch only and this problem is not actual, so you can use C++ keywords/modifiers.

use std::shared_ptr instead of cv::Ptr

Feel free to use std::shared_ptr (but it is not supported by bindings generator, so be careful with public API).

makePtr doesn't do what std::make_shared does.

In master branch it is just a wrapper, so it should do the same things.

Is it fine to force push

It is OK.
Also rebasing is preferred over "merge" commits (it is easy to do that using 1 squashed commit: squash first, then rebase).

@YashasSamaga YashasSamaga force-pushed the YashasSamaga:cuda4dnn-csl-low branch from c8fd75b to 30b294e Jun 25, 2019
@alalek alalek mentioned this pull request Jun 26, 2019
2 of 2 tasks complete
@YashasSamaga YashasSamaga force-pushed the YashasSamaga:cuda4dnn-csl-low branch 3 times, most recently from 79c65f0 to 2941d74 Jun 28, 2019
@YashasSamaga YashasSamaga force-pushed the YashasSamaga:cuda4dnn-csl-low branch from 39837c8 to b89d7e0 Jul 16, 2019
@davisking

This comment has been minimized.

Copy link

davisking commented Jul 21, 2019

Seems like it would be implementation defined at worst, rather than UB. You sure it’s UB? If it’s ok in c++17 and works in our case I think it’s fine. I would be surprised if some compilers defined std::iterator_traits<T>::iterator_category for non iterators in c++11.

@YashasSamaga YashasSamaga force-pushed the YashasSamaga:cuda4dnn-csl-low branch 2 times, most recently from a818297 to 3584d72 Jul 25, 2019
@YashasSamaga

This comment has been minimized.

Copy link
Contributor Author

YashasSamaga commented Nov 3, 2019

Yes, I will be attempting it in the last week of November unless somebody beats me to it. It will mostly require API changes. I am not sure how the API should be.

@charlestamz

This comment has been minimized.

Copy link

charlestamz commented Nov 13, 2019

@YashasSamaga you did a awsome job .
And I have a question, with OPENCV_DNN_CUDA, how can I choose GPU id in the module ?

@charlestamz

This comment has been minimized.

Copy link

charlestamz commented Nov 13, 2019

Hi all,
I have 2 GPUs on my pc. Is it possible for me to choose one in dnn forward?
I found all the load was put on the first GPU.

@YashasSamaga

This comment has been minimized.

Copy link
Contributor Author

YashasSamaga commented Nov 13, 2019

You can use cudaSetDevice() from cuda_runtime.h (CUDA API headers) or you can use cv::cuda::setDevice().

You have to select the device before creating the cv::dnn::Net object. The cv::dnn::Net object will be associated with that device. You should also ensure that the same device is set before calling cv::dnn::Net::forward. In my opinion, it's best to have separate threads for each device. You can find more information here.

@charlestamz

This comment has been minimized.

Copy link

charlestamz commented Nov 14, 2019

@YashasSamaga great , it worked. Thank you.

@charlestamz

This comment has been minimized.

Copy link

charlestamz commented Nov 15, 2019

@YashasSamaga but still there is another problem.
After I added cv::cuda::setDevice() before creating a cv::dnn::Net object, this object was indeed associated with the device I specified.
However, the whole process was also put on the default GPU (GPU 0). And I did run some imgproc functions before I use the dnn forward.
I build OpenCV with CUDA , so maybe OpenCV did some magic in the imgproc functions.
Is it possible that I only run the whole process on the GPU I specified?

@YashasSamaga

This comment has been minimized.

Copy link
Contributor Author

YashasSamaga commented Nov 15, 2019

@charlestamz

cv::cuda::setDevice(0);
auto net = cv::dnn::readNet("", "");

cv::cuda::setDevice(1);
// imgproc stuff

cv::cuda::seteDevice(0);
net.setInput(blob);
auto output = net.forward();
@charlestamz

This comment has been minimized.

Copy link

charlestamz commented Nov 15, 2019

@YashasSamaga
my code looks like:

cv::cuda::setDevice(1);
auto net = cv::dnn::readNet("", "");
cv::cuda::setDevice(1);

//videoio stuff
// imgproc stuff

cv::cuda::seteDevice(1);
net.setInput(blob);
auto output = net.forward();

However, the process still took some memory on GPU 0.

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1325      G   /usr/lib/xorg/Xorg                            69MiB |
|    0     32235      C   ./vision3                                    107MiB |
|    1     32235      C   ./vision3                                    519MiB |
+-----------------------------------------------------------------------------+
@YashasSamaga

This comment has been minimized.

Copy link
Contributor Author

YashasSamaga commented Nov 15, 2019

Devices are associated with each CPU thread. When you set the device using cv::cuda::setDevice(int), you are setting the device that the calling CPU thread must use for all operations.

You want the entire process to use a single GPU? You have to set the device at the very beginning before any device is used.

You can also control the device that is used externally by setting the CUDA_VISIBLE_DEVICES environment variable.

@Scolymus

This comment has been minimized.

Copy link

Scolymus commented Nov 17, 2019

Can anyone help me installing opencv with cuda enabled on ubuntu 18.04? Step by step. I downloaded the opencv project from master. Then I created a folder called binary and inside I opened cmake-gui. I checked with_cuda, with_cudnn. I am following these instructions https://docs.opencv.org/master/d7/d9f/tutorial_linux_install.html

But after well compiling and installing it, I cannot pass the tests for dnn and in my python app the backends for cuda are not recognized.

Which is the exactly parameters I have to check to compile it? I'm running on the last cuda 10.1 and cuddn 7.6.

@YashasSamaga

This comment has been minimized.

Copy link
Contributor Author

YashasSamaga commented Nov 18, 2019

@Scolymus You have to mark OPENCV_DNN_CUDA also. What do you mean by "cannot pass the tests for dnn"?

@alalek alalek removed their request for review Nov 18, 2019
@Scolymus

This comment has been minimized.

Copy link

Scolymus commented Nov 18, 2019

@YashasSamaga

With the test I'm refering to run the program ./opencv/build/bin/opencv_test_dnn At some page I read this program should run fine if everything is working. I have an extra question about the arch_bin cuda option. So when I'm building it, it says the minimum to run is 5.3. I'm putting 5.3, 6.0 and 7.5. This number, referers to my GPU or to the cuda drivers? Because in the nvidia website it says my GPU is 5.0 as for compatibility, but my cuda installed is 10.1 and the cudnn is 7.6.

About the option you comment, I already had that option turned on. I copy you what I obtained from the cmake when generating it. I didn't have any error while compliling it or installing it afterwards. (I do all the commands with a sudo)

`Detected processor: x86_64
Looking for ccache - not found
Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found suitable version "1.2.11", minimum required is "1.2.3")
Could NOT find Jasper (missing: JASPER_LIBRARIES JASPER_INCLUDE_DIR)
Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.11")
Checking for module 'gtk+-3.0'
No package 'gtk+-3.0' found
found Intel IPP (ICV version): 2019.0.0 [2019.0.0 Gold]
at: /home/scolymus/Downloads/opencv-master/build/3rdparty/ippicv/ippicv_lnx/icv
found Intel IPP Integration Wrappers sources: 2019.0.0
at: /home/scolymus/Downloads/opencv-master/build/3rdparty/ippicv/ippicv_lnx/iw
CUDA detected: 10.1
CUDA NVCC target flags: -gencode;arch=compute_53,code=sm_53;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-D_FORCE_INLINES
Could not find OpenBLAS include. Turning OpenBLAS_FOUND off
Could not find OpenBLAS lib. Turning OpenBLAS_FOUND off
Could NOT find Atlas (missing: Atlas_CLAPACK_INCLUDE_DIR)
A library with BLAS API found.
A library with LAPACK API found.
Could NOT find JNI (missing: JAVA_INCLUDE_PATH JAVA_INCLUDE_PATH2 JAVA_AWT_INCLUDE_PATH)
VTK is not found. Please set -DVTK_DIR in CMake to VTK build directory, or to VTK install subdirectory with VTKConfig.cmake file
OpenCV Python: during development append to PYTHONPATH: /home/scolymus/Downloads/opencv-master/build/python_loader
Checking for module 'gstreamer-base-1.0'
No package 'gstreamer-base-1.0' found
Checking for module 'gstreamer-app-1.0'
No package 'gstreamer-app-1.0' found
Checking for module 'gstreamer-riff-1.0'
No package 'gstreamer-riff-1.0' found
Checking for module 'gstreamer-pbutils-1.0'
No package 'gstreamer-pbutils-1.0' found
Caffe: NO
Protobuf: NO
Glog: NO
freetype2: YES (ver 21.0.15)
harfbuzz: YES (ver 1.7.2)
Could NOT find HDF5 (missing: HDF5_LIBRARIES HDF5_INCLUDE_DIRS) (found version "")
Module opencv_ovis disabled because OGRE3D was not found
No preference for use of exported gflags CMake configuration set, and no hints for include/library directories provided. Defaulting to preferring an installed/exported gflags CMake configuration if available.
Failed to find installed gflags CMake configuration, searching for gflags build directories exported with CMake.
Failed to find gflags - Failed to find an installed/exported CMake configuration for gflags, will perform search for installed gflags components.
Failed to find gflags - Could not find gflags include directory, set GFLAGS_INCLUDE_DIR to directory containing gflags/gflags.h
Failed to find glog - Could not find glog include directory, set GLOG_INCLUDE_DIR to directory containing glog/logging.h
Module opencv_sfm disabled because the following dependencies are not found: Eigen Glog/Gflags
Checking for module 'tesseract'
No package 'tesseract' found
Tesseract: NO
Registering hook 'INIT_MODULE_SOURCES_opencv_dnn': /home/scolymus/Downloads/opencv-master/modules/dnn/cmake/hooks/INIT_MODULE_SOURCES_opencv_dnn.cmake
xfeatures2d/boostdesc: Download: boostdesc_bgm.i
xfeatures2d/boostdesc: Download: boostdesc_bgm_bi.i
xfeatures2d/boostdesc: Download: boostdesc_bgm_hd.i
xfeatures2d/boostdesc: Download: boostdesc_binboost_064.i
xfeatures2d/boostdesc: Download: boostdesc_binboost_128.i
xfeatures2d/boostdesc: Download: boostdesc_binboost_256.i
xfeatures2d/boostdesc: Download: boostdesc_lbgm.i
xfeatures2d/vgg: Download: vgg_generated_48.i
xfeatures2d/vgg: Download: vgg_generated_64.i
xfeatures2d/vgg: Download: vgg_generated_80.i
xfeatures2d/vgg: Download: vgg_generated_120.i
data: Download: face_landmark_model.dat
NVIDIA_OPTICAL_FLOW: Download: 79c6cee80a2df9a196f20afd6b598a9810964c32.zip

General configuration for OpenCV 4.1.2-dev =====================================
Version control: unknown

Extra modules:
Location (extra): /home/scolymus/Downloads/opencv-master/modules/opencv_contrib-master/modules
Version control (extra): unknown

Platform:
Timestamp: 2019-11-18T10:58:33Z
Host: Linux 4.15.0-64-generic x86_64
CMake: 3.10.2
CMake generator: Unix Makefiles
CMake build tool: /usr/bin/make
Configuration: Release

CPU/HW features:
Baseline: SSE SSE2 SSE3
requested: SSE3
Dispatched code generation: SSE4_1 SSE4_2 FP16 AVX AVX2 AVX512_SKX
requested: SSE4_1 SSE4_2 AVX FP16 AVX2 AVX512_SKX
SSE4_1 (16 files): + SSSE3 SSE4_1
SSE4_2 (2 files): + SSSE3 SSE4_1 POPCNT SSE4_2
FP16 (1 files): + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 AVX
AVX (5 files): + SSSE3 SSE4_1 POPCNT SSE4_2 AVX
AVX2 (29 files): + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2
AVX512_SKX (6 files): + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2 AVX_512F AVX512_COMMON AVX512_SKX

C/C++:
Built as dynamic libs?: YES
C++ Compiler: /usr/bin/c++ (ver 7.4.0)
C++ flags (Release): -fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Winit-self -Wsuggest-override -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections -msse -msse2 -msse3 -fvisibility=hidden -fvisibility-inlines-hidden -O3 -DNDEBUG -DNDEBUG
C++ flags (Debug): -fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Winit-self -Wsuggest-override -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections -msse -msse2 -msse3 -fvisibility=hidden -fvisibility-inlines-hidden -g -O0 -DDEBUG -D_DEBUG
C Compiler: /usr/bin/cc
C flags (Release): -fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Winit-self -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections -msse -msse2 -msse3 -fvisibility=hidden -O3 -DNDEBUG -DNDEBUG
C flags (Debug): -fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Winit-self -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections -msse -msse2 -msse3 -fvisibility=hidden -g -O0 -DDEBUG -D_DEBUG
Linker flags (Release): -Wl,--gc-sections
Linker flags (Debug): -Wl,--gc-sections
ccache: NO
Precompiled headers: NO
Extra dependencies: m pthread cudart_static dl rt nppc nppial nppicc nppicom nppidei nppif nppig nppim nppist nppisu nppitc npps cublas cudnn cufft -L/usr/local/cuda-10.1/lib64 -L/usr/lib/x86_64-linux-gnu
3rdparty dependencies:

OpenCV modules:
To be built: aruco bgsegm bioinspired calib3d ccalib core cudaarithm cudabgsegm cudacodec cudafeatures2d cudafilters cudaimgproc cudalegacy cudaobjdetect cudaoptflow cudastereo cudawarping cudev datasets dnn dnn_objdetect dnn_superres dpm face features2d flann freetype fuzzy gapi hfs highgui img_hash imgcodecs imgproc line_descriptor ml objdetect optflow phase_unwrapping photo plot python2 python3 quality reg rgbd saliency shape stereo stitching structured_light superres surface_matching text tracking ts video videoio videostab xfeatures2d ximgproc xobjdetect xphoto
Disabled: world
Disabled by dependency: -
Unavailable: cnn_3dobj cvv hdf java js matlab ovis sfm viz
Applications: tests perf_tests apps
Documentation: NO
Non-free algorithms: NO

GUI:
GTK+: YES (ver 2.24.32)
GThread : YES (ver 2.56.4)
GtkGlExt: NO
VTK support: NO

Media I/O:
ZLib: /usr/lib/x86_64-linux-gnu/libz.so (ver 1.2.11)
JPEG: /usr/lib/x86_64-linux-gnu/libjpeg.so (ver 80)
WEBP: /usr/lib/x86_64-linux-gnu/libwebp.so (ver encoder: 0x020e)
PNG: /usr/lib/x86_64-linux-gnu/libpng.so (ver 1.6.34)
TIFF: /usr/lib/x86_64-linux-gnu/libtiff.so (ver 42 / 4.0.9)
JPEG 2000: build (ver 1.900.1)
OpenEXR: build (ver 2.3.0)
HDR: YES
SUNRASTER: YES
PXM: YES
PFM: YES

Video I/O:
DC1394: YES (2.2.5)
FFMPEG: YES
avcodec: YES (57.107.100)
avformat: YES (57.83.100)
avutil: YES (55.78.100)
swscale: YES (4.8.100)
avresample: YES (3.7.0)
GStreamer: NO
v4l/v4l2: YES (linux/videodev2.h)

Parallel framework: pthreads

Trace: YES (with Intel ITT)

Other third-party libraries:
Intel IPP: 2019.0.0 Gold [2019.0.0]
at: /home/scolymus/Downloads/opencv-master/build/3rdparty/ippicv/ippicv_lnx/icv
Intel IPP IW: sources (2019.0.0)
at: /home/scolymus/Downloads/opencv-master/build/3rdparty/ippicv/ippicv_lnx/iw
Lapack: NO
Eigen: NO
Custom HAL: NO
Protobuf: build (3.5.1)

NVIDIA CUDA: YES (ver 10.1, CUFFT CUBLAS)
NVIDIA GPU arch: 53 60 61 70 75
NVIDIA PTX archs:

cuDNN: YES (ver 7.6.2)

OpenCL: YES (no extra features)
Include path: /home/scolymus/Downloads/opencv-master/3rdparty/include/opencl/1.2
Link libraries: Dynamic load

Python 2:
Interpreter: /usr/bin/python2.7 (ver 2.7.15)
Libraries: /usr/lib/x86_64-linux-gnu/libpython2.7.so (ver 2.7.15+)
numpy: /home/scolymus/.local/lib/python2.7/site-packages/numpy/core/include (ver 1.16.5)
install path: lib/python2.7/dist-packages/cv2/python-2.7

Python 3:
Interpreter: /usr/bin/python3 (ver 3.6.8)
Libraries: /usr/lib/x86_64-linux-gnu/libpython3.6m.so (ver 3.6.8)
numpy: /home/scolymus/.local/lib/python3.6/site-packages/numpy/core/include (ver 1.17.4)
install path: lib/python3.6/dist-packages/cv2/python-3.6

Python (for build): /usr/bin/python2.7

Java:
ant: NO
JNI: NO
Java wrappers: NO
Java tests: NO

Install to: /usr/local

Configuring done
Generating done`

@YashasSamaga

This comment has been minimized.

Copy link
Contributor Author

YashasSamaga commented Nov 18, 2019

@Scolymus GPUs with CC below 5.3 are not supported. The CUDA backend provides a half-precision target which requires features present in devices with CC 5.3 and above.

@sgriset

This comment has been minimized.

Copy link

sgriset commented Nov 19, 2019

I'm running the Jetson Nano getting outstanding performance on the classification networks running on the GPU, however I'm not seeing any improvements (like what was scene on x86 benchmarks published here) on the object detection networks over running CPU interferencing. I'm using jtop to monitor the system and I can see the models getting loaded to the GPU and the system using the GPU. Any suggestions or thoughts like pre-processing the object detection models. I notice that Jetpack uses UFF model format that why I ask.

@YashasSamaga

This comment has been minimized.

Copy link
Contributor Author

YashasSamaga commented Nov 19, 2019

@sgriset

There are several issues:

  1. depthwise convolutions are very slow (slower than CPU) with cuDNN
  2. CropAndResize/ROI Pooling/MVN layers do not have implementations yet. They will fallback to CPU which has a high penalty.
  3. More optimizations are being rolled out over time. This was the first PR.

It depends on what model you are running. MobileNet SSD will suffer from issue 1. All Faster RCNN based detectors suffer from issue 2. The issue 2 will be resolved before Christmas.

You can set OPENCV_LOG_LEVEL=INFO environment variable to check if any fallbacks are being used.

@isra60

This comment has been minimized.

Copy link

isra60 commented Nov 19, 2019

What about YOLO detector?? Have you test performance against Darknet?? https://github.com/AlexeyAB/darknet

@charlestamz

This comment has been minimized.

Copy link

charlestamz commented Nov 22, 2019

@isra60 i didn't run a full test, on my test, the new dnn module( the master branch) is about 2/3 fps of https://github.com/AlexeyAB/darknet.

@YashasSamaga

This comment has been minimized.

Copy link
Contributor Author

YashasSamaga commented Nov 22, 2019

Do not take this post very seriously. I have my end semester exams going on. I just scrambled something to test darknet and OpenCV CUDA backend. I will be doing this again next week on different devices.

YOLO v3 Investigation

The OpenCV CUDA backend uses the CPU to perform NMS. In region layer, the data is moved from the GPU to the host for performing NMS.

If we look at the layerwise timings, it appears that the OpenCV CUDA backend beats route, resize, shortcut, etc. layers. The region layer was performing very badly compared to darknet. This was probably due to the NMS being performed on the CPU and the D2H data transfer involved.

Timings:

INVALID BENCHMARK: NMS not included in Darknet bench

OpenCV benchmark code: https://gist.github.com/YashasSamaga/71157cf0c3768c497e5e70fb95435596
Darknet: ./darknet detector test cfg/coco.data cfg/yolov3.cfg yolov3.weights data/dog.jpg

The CUDA timings include the forward pass only. I am not sure if darknet includes the preprocessing part.

Device: GTX 1050
Image size: 416 x 416

CUDA backend: 62ms
Darknet: 60ms

Both hit 75ms if synchronization is forced after every layer.

Layerwise timings

I have forced synchronization after every layer in both darknet and the OpenCV CUDA backend to allow the layerwise timings to be measured. Both suffer equally (GPU goes idle during layer switch). The code used to obtain layerwise timings of the CUDA backend is similar to: YashasSamaga@55ad843. I uncommented the timing code for darknet.

Darknet measures timings using std::chrono between layers.

Darknet layerwise timings

Update: OpenCV layerwise timings (using std::chrono)

OpenCV layerwise timings (using CUDA event API)

Darknet mostly includes the ReLU timings in the convolution timings. The convolution time + ReLU time from the CUDA backend correlate with the darknet convolution time.

Notes:

CUDA backend does not support tensor cores. It's trivial to enable it for cuDNN. Darknet will benefit from tensor cores on 7.x CC GPUs.

@charlestamz on what device did you benchmark and how?

@charlestamz

This comment has been minimized.

Copy link

charlestamz commented Nov 23, 2019

@YashasSamaga
on gtx 1080.
I simply run the DnnObjectDetector sample of opencv and get about 22 fps.
With darknet, I just ran the darknet command line, which gave about 38-40 fps.
The opencvdnn module I'm using is the latest master branch of http://github.com/opencv/opencv which merged your dnn code, not directly from the repository you provide.

@YashasSamaga

This comment has been minimized.

Copy link
Contributor Author

YashasSamaga commented Nov 24, 2019

INVALID BENCHMARK: NMS not included in Darknet bench

@charlestamz The FPS reported by the object detector sample depends on the camera FPS and what not.

YOLO v3 on GTX 1080 Ti:

Only the forward pass time is included.

Darknet: 15.5ms for (64 FPS)
OCV CUDA: 17.9ms (56FPS)

I am not sure if Darknet includes the time spent to transfer the output from the GPU to the host. It's included in the OCV CUDA timings.

@phil-ludewig

This comment has been minimized.

Copy link

phil-ludewig commented Dec 3, 2019

I've tested your cuda backend on jetson nano and it worked flawlessly at 28fps when executed with tiny-yolov3 (320x320) in normal c++ code!

Now integrated it with the dnn_detect ROS node, which works fine for about 100 images, but then crashes the jetson entirely.
I've compiled all required ROS packages with opencv 4.1.2-dev in catkin_make without build errors. Do you have an idea what might be going on here? Uploaded the node code here: https://pastebin.com/hVWKnQ2d

edit: found out the 5A power supply wasn't adequate! Switching the jetson nano to 5W mode fixed the crashing!

@albertchristianto

This comment has been minimized.

Copy link

albertchristianto commented Dec 6, 2019

@charlestamz So, OpenCV support CUDA back end in the latest release, isn't it?

@dkurt

This comment has been minimized.

Copy link
Member

dkurt commented Dec 6, 2019

@albertchristianto, no, it has been merged after 4.1.2 release. Check the dates.

@thedevleon

This comment has been minimized.

Copy link

thedevleon commented Dec 7, 2019

Does someone have a precompiled binary for Windows x64? I'm having some issues building OpenCV with CUDA because of clashing CUDA versions and various other issues.

@YashasSamaga

This comment has been minimized.

Copy link
Contributor Author

YashasSamaga commented Dec 8, 2019

I think I had earlier made a mistake in reporting the YOLOv3 timings for the CUDA backend and Darknet. I hadn't included NMS timings in the darknet bench. I also hadn't take the average of several runs using a loop (instead took average of what darknet said in its output across several runs).

YOLO Benchmark

  • GTX 1050 and 7700HQ
  • made attempts to minimize CPU and GPU utilization by other applications

Warmup runs: 3
Benchmark runs: 100 x 3 (rank benchmark program three times)
Version: cuDNN 7.6.5, CUDA 10.2

Model CUDA backend (PR16096, PR16092, PR16063) CUDA backend (master) Darknet
YOLOv3 53.85ms 57.42ms 56.884ms
YOLOv3 Tiny 6.95ms - 8.01ms
YOLOv3 Tiny PRN 5.60ms - 6.492ms

Benchmark code (for both darknet and opencv): https://gist.github.com/YashasSamaga/26eb2eb16be2cc749e3394d300a7585e

DISCLAIMER: I am not very comfortable editing darknet code but I hope the benchmark is fair and correct (would be great if somebody could validate).

NOTE: I have an experimental patch (part of a more general experimental graph patch) which can save another 2ms. YOLOv3 has three region layers. The output of the first two are copied to the CPU for NMS simultaneously as the GPU continues computing the remaining layers. This way the GPU to CPU memory transfer of the two region layers can be completely hidden. The NMS for the first two region layers begin on CPU even before the forward pass on GPU finishes fully.


Please note the corrections @crackwitz @charlestamz @isra60 @sgriset

There are few open PRs for ROI pooling and CropAndResize. These should improve the performance of many detection models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.