Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cv::cuda::transpose() limitation #22782

Open
2 of 4 tasks
chacha21 opened this issue Nov 9, 2022 · 6 comments · May be fixed by opencv/opencv_contrib#3371
Open
2 of 4 tasks

cv::cuda::transpose() limitation #22782

chacha21 opened this issue Nov 9, 2022 · 6 comments · May be fixed by opencv/opencv_contrib#3371
Labels
bug category: gpu/cuda (contrib) OpenCV 4.0+: moved to opencv_contrib

Comments

@chacha21
Copy link
Contributor

chacha21 commented Nov 9, 2022

System Information

OpenCV 4.6.0
Windows 7 64 bits
Visual Studio 2019 (latest)
NVidia CUDA SDK (10.2)

Detailed description

As claimed by the doc, CV_16UC1 is currently not supported by cv::cuda::transpose()
Internally, it is limited by CV_Assert( elemSize == 1 || elemSize == 4 || elemSize == 8 );

However, the limitation is hard to understand.

Currently, for (elemSize == 1), nppiTranspose_8u_C1R() is called

However, nppiTranspose_16u_C1R() does exist (among others). I looked at NPPI old release notes, it was already present in the CUDA SDK 8.0 (https://docs.nvidia.com/cuda/archive/8.0/pdf/NPP_Library_Image_Support_And_Data_Exchange.pdf)

Thus, I don't understand why only nppiTranspose_8u_C1R() is to be used.

Steps to reproduce

cv::cuda::GpuMat src(1, 2, CV_16UC1);
cv::cuda::GpuMat dst;
cv::cuda::transpose(src, dst);//unfortunately not supported

Issue submission checklist

  • I report the issue, it's not a question
  • I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found any solution
  • I updated to the latest OpenCV version and the issue is still there
  • There is reproducer code and related data files (videos, images, onnx, etc)
@chacha21 chacha21 added the bug label Nov 9, 2022
@chacha21
Copy link
Contributor Author

chacha21 commented Nov 9, 2022

PR proposal : opencv/opencv_contrib#3371

@cudawarped
Copy link
Contributor

Its possible the other data types didn't exist in npp when this was implemented or they could have been buggy or slow. What is the performance impact of using npp for elemSize == 4 and elemSize == 8?

@chacha21
Copy link
Contributor Author

chacha21 commented Nov 10, 2022

Its possible the other data types didn't exist in npp when this was implemented or they could have been buggy or slow. What is the performance impact of using npp for elemSize == 4 and elemSize == 8?

I don't know how to benchmark that
I will have a look at perf tests, but I am limited to a single, old CUDA 5.0 NVidia GPU

@cudawarped
Copy link
Contributor

I don't know how to benchmark that

I would think it should be OK, in the past npp was notoriously bad under some circumstances, unecessary start up costs for some routines, wrong results for others but I guess it should be OK for a transpose operation on SDK 11+.

The only issue could be that those using SDK 9 or older may see a slow down in going from gridTranspose to npp but testing all combinations of SDK would be difficult.

If it was me I would run the test cases before and after the change through Nvidia Nsight Systems to see if there is anything off or look at the results from the perf tests if they exist.

@chacha21
Copy link
Contributor Author

chacha21 commented Nov 10, 2022

I have interesting results after a first performance investigation.
This is not a standard perf_test result, I wanted something more customizable first.

The observation tends to show that nppiTranspose always performs better than gridTranspose() for an equivalent elemSize

My results :

*** CUDA Device Query (Runtime API) version (CUDART static linking) ***

Device count: 1

Device 0: "NVIDIA GeForce GTX 750"
  CUDA Driver Version / Runtime Version          11.40 / 10.20
  CUDA Capability Major/Minor version number:    5.0
  Total amount of global memory:                 2048 MBytes (2147483648 bytes)
  GPU Clock Speed:                               1.14 GHz
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and execution:                 Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           1 / 0
  Compute Mode:
      Default (multiple host threads can use ::cudaSetDevice() with device simultaneously)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version  = 11.40, CUDA Runtime Version = 10.20, NumDevs = 1

transposing 1000 times a (1280x1024)matrix
the same matrix is interpreted as CV_8UC1, CV_8UC4, CV_16UC1, CV_16UC4, CV_32SC1, CV_32FC1, to get different elemSizes
nppiTranspose_8u_C1R (elemSize == 1) : 247.7ms
nppiTranspose_8u_C4R (elemSize == 4) : 56.043ms
nppiTranspose_16u_C1R (elemSize == 2) : 105.141ms
nppiTranspose_16u_C4R (elemSize == 8) : 35.3116ms
nppiTranspose_32s_C1R (elemSize == 4) : 54.322ms
nppiTranspose_32f_C1R (elemSize == 4) : 54.9982ms
gridTranspose (elemSize == 1) : 539.93ms
gridTranspose (elemSize == 2) : 230.971ms
gridTranspose (elemSize == 4) : 130.77ms
gridTranspose (elemSize == 8) : 77.7393ms

@chacha21
Copy link
Contributor Author

After a few more tests, using cuda::transpose() either with NPPI backend or with gridTranspose() backend :
NPPI stil performs better than gridTranspose() in all cases that I tested (with different image sizes, different strides...)

However, this is still only tested on a GTX 750 limited to cuda 5.0 compute capability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug category: gpu/cuda (contrib) OpenCV 4.0+: moved to opencv_contrib
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants