cv::cuda::transpose() limitation #22782

chacha21 · 2022-11-09T20:52:46Z

System Information

OpenCV 4.6.0
Windows 7 64 bits
Visual Studio 2019 (latest)
NVidia CUDA SDK (10.2)

Detailed description

As claimed by the doc, CV_16UC1 is currently not supported by cv::cuda::transpose()
Internally, it is limited by CV_Assert( elemSize == 1 || elemSize == 4 || elemSize == 8 );

However, the limitation is hard to understand.

Currently, for (elemSize == 1), nppiTranspose_8u_C1R() is called

However, nppiTranspose_16u_C1R() does exist (among others). I looked at NPPI old release notes, it was already present in the CUDA SDK 8.0 (https://docs.nvidia.com/cuda/archive/8.0/pdf/NPP_Library_Image_Support_And_Data_Exchange.pdf)

Thus, I don't understand why only nppiTranspose_8u_C1R() is to be used.

Steps to reproduce

cv::cuda::GpuMat src(1, 2, CV_16UC1);
cv::cuda::GpuMat dst;
cv::cuda::transpose(src, dst);//unfortunately not supported

Issue submission checklist

I report the issue, it's not a question
I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found any solution
I updated to the latest OpenCV version and the issue is still there
There is reproducer code and related data files (videos, images, onnx, etc)

The text was updated successfully, but these errors were encountered:

chacha21 · 2022-11-09T20:54:36Z

PR proposal : opencv/opencv_contrib#3371

cudawarped · 2022-11-10T07:01:38Z

Its possible the other data types didn't exist in npp when this was implemented or they could have been buggy or slow. What is the performance impact of using npp for elemSize == 4 and elemSize == 8?

chacha21 · 2022-11-10T10:19:14Z

Its possible the other data types didn't exist in npp when this was implemented or they could have been buggy or slow. What is the performance impact of using npp for elemSize == 4 and elemSize == 8?

~~I don't know how to benchmark that~~
I will have a look at perf tests, but I am limited to a single, old CUDA 5.0 NVidia GPU

cudawarped · 2022-11-10T10:26:37Z

I don't know how to benchmark that

I would think it should be OK, in the past npp was notoriously bad under some circumstances, unecessary start up costs for some routines, wrong results for others but I guess it should be OK for a transpose operation on SDK 11+.

The only issue could be that those using SDK 9 or older may see a slow down in going from gridTranspose to npp but testing all combinations of SDK would be difficult.

If it was me I would run the test cases before and after the change through Nvidia Nsight Systems to see if there is anything off or look at the results from the perf tests if they exist.

chacha21 · 2022-11-10T13:25:52Z

I have interesting results after a first performance investigation.
This is not a standard perf_test result, I wanted something more customizable first.

The observation tends to show that nppiTranspose always performs better than gridTranspose() for an equivalent elemSize

My results :

*** CUDA Device Query (Runtime API) version (CUDART static linking) ***

Device count: 1

Device 0: "NVIDIA GeForce GTX 750"
  CUDA Driver Version / Runtime Version          11.40 / 10.20
  CUDA Capability Major/Minor version number:    5.0
  Total amount of global memory:                 2048 MBytes (2147483648 bytes)
  GPU Clock Speed:                               1.14 GHz
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and execution:                 Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           1 / 0
  Compute Mode:
      Default (multiple host threads can use ::cudaSetDevice() with device simultaneously)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version  = 11.40, CUDA Runtime Version = 10.20, NumDevs = 1

transposing 1000 times a (1280x1024)matrix
the same matrix is interpreted as CV_8UC1, CV_8UC4, CV_16UC1, CV_16UC4, CV_32SC1, CV_32FC1, to get different elemSizes
nppiTranspose_8u_C1R (elemSize == 1) : 247.7ms
nppiTranspose_8u_C4R (elemSize == 4) : 56.043ms
nppiTranspose_16u_C1R (elemSize == 2) : 105.141ms
nppiTranspose_16u_C4R (elemSize == 8) : 35.3116ms
nppiTranspose_32s_C1R (elemSize == 4) : 54.322ms
nppiTranspose_32f_C1R (elemSize == 4) : 54.9982ms
gridTranspose (elemSize == 1) : 539.93ms
gridTranspose (elemSize == 2) : 230.971ms
gridTranspose (elemSize == 4) : 130.77ms
gridTranspose (elemSize == 8) : 77.7393ms

chacha21 · 2022-11-11T12:08:37Z

After a few more tests, using cuda::transpose() either with NPPI backend or with gridTranspose() backend :
NPPI stil performs better than gridTranspose() in all cases that I tested (with different image sizes, different strides...)

However, this is still only tested on a GTX 750 limited to cuda 5.0 compute capability.

chacha21 added the bug label Nov 9, 2022

chacha21 linked a pull request Nov 9, 2022 that will close this issue

More data types supported in cv::cuda::transpose() opencv/opencv_contrib#3371

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cv::cuda::transpose() limitation #22782

cv::cuda::transpose() limitation #22782

chacha21 commented Nov 9, 2022

chacha21 commented Nov 9, 2022 •

edited by alalek

cudawarped commented Nov 10, 2022

chacha21 commented Nov 10, 2022 •

edited

cudawarped commented Nov 10, 2022

chacha21 commented Nov 10, 2022 •

edited

chacha21 commented Nov 11, 2022

cv::cuda::transpose() limitation #22782

cv::cuda::transpose() limitation #22782

Comments

chacha21 commented Nov 9, 2022

System Information

Detailed description

Steps to reproduce

Issue submission checklist

chacha21 commented Nov 9, 2022 • edited by alalek

cudawarped commented Nov 10, 2022

chacha21 commented Nov 10, 2022 • edited

cudawarped commented Nov 10, 2022

chacha21 commented Nov 10, 2022 • edited

chacha21 commented Nov 11, 2022

chacha21 commented Nov 9, 2022 •

edited by alalek

chacha21 commented Nov 10, 2022 •

edited

chacha21 commented Nov 10, 2022 •

edited