Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More data types supported in cv::cuda::transpose() #3371

Open
wants to merge 9 commits into
base: 4.x
Choose a base branch
from

Conversation

chacha21
Copy link

@chacha21 chacha21 commented Nov 9, 2022

Fixes opencv/opencv#22782

proposal for opencv/opencv#22782

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

Comment on lines 68 to 69

CV_Assert( elemSize == 1 || elemSize == 4 || elemSize == 8 );
//CV_Assert( elemSize == 1 || elemSize == 4 || elemSize == 8 );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if elemSize == 5?

What kind of error message we would show to the user?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is handled by

...
else
      CV_Error(Error::StsUnsupportedFormat, "");

Comment on lines 84 to 85
if (src.empty())
dst.release();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty input should be a error as nobody want to process "nothing" in real use cases: opencv/opencv#8300

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I thought it was better since Allow empty matrices in most functions (personnaly I prefer a no-op rather than an exception)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will also add a fastpath for size == (1, 1), where transpose is just a copy

(srcType == CV_32SC1) || (srcType == CV_32SC3) || (srcType == CV_32SC4) ||
(srcType == CV_32FC1) || (srcType == CV_32FC3) || (srcType == CV_32FC4);
const bool isElemSizeSupportedByNppi =
((elemSize != 0) && !(elemSize%1) && ((elemSize/1)<=4)) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if I have completely misunderstood the logic but
Isn't elemSize%1 == 0 always?
When can elemSize == 0?
Isn't this already taken care of by
(srcType == CV_8UC1) || (srcType == CV_8UC3) || (srcType == CV_8UC4) || (srcType == CV_16UC1)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, but it was on purpose for code clarity/readibility
elemSize%1 and elemSize/1 will be optimized out by the compiler, this is just to balance with the 2, 4, 8 cases

Ok to get rid of the (elemSize != 0) check

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity I would use elemSize1 and channels?
For what CV data type will isNppiNativelySupported == false and isElemSizeSupportedByNppi == true or isElemSizeSupportedByGridTranspose == true?

Copy link
Author

@chacha21 chacha21 Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is overlap between isNppiNativelySupported, isElemSizeSupportedByNppi and isElemSizeSupportedByGridTranspose
The idea is that if anything goes wrong, they can be disabled independently.
isNppiNativelySupported means that the data type is directly mapped to an nppiTranspose call
isElemSizeSupportedByNppi means that we can cheat on the real data type to use another data type of the same size (assuming that NPP is only using memory copies, otherwise it would not work)
isElemSizeSupportedByGridTranspose is the fallback if we don't rely on the NPP implementation

For instance, 16UC2 is not supported by NPP but can be mapped to 32SC1 (isNppiNativelySupported == false and isElemSizeSupportedByNppi == true)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a complete overlap, i.e. is isElemSizeSupportedByNppi and isElemSizeSupportedByGridTranspose redundant unless
isNppiNativelySupported is manually set to false?

Copy link
Author

@chacha21 chacha21 Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a complete overlap, i.e. is isElemSizeSupportedByNppi and isElemSizeSupportedByGridTranspose redundant unless
isNppiNativelySupported is manually set to false?

There are cases where (isNppiNativelySupported == false) and (isElemSizeSupportedByNppi== true) (e.g. 16UC2)

But there are currently no cases where (isElemSizeSupportedByNppi== false) and (isElemSizeSupportedByGridTranspose == true). This is a fallback implementation if :

  • a (future) opencv global option disables NPP
  • performance tests reveals that some NPP calls are slower than gridTranspose(). Specific cases could be excluded from isElemSizeSupportedByNppi.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 16UC2 is handled by the below block?

else if (!(elemSize%2) && ((elemSize/2)==2))
  nppSafeCall( nppiTranspose_8u_C4R(src.ptr<Npp8u>(), static_cast<int>(src.step),
    dst.ptr<Npp8u>(), static_cast<int>(dst.step), sz) );

If so wouldn't it be better to have this under

else if (srcType == CV_16UC2)

Copy link
Author

@chacha21 chacha21 Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, this is not a native nppi Call (isNppiNativelySupported == false) but it can be handled by cheating on the data type (isElemSizeSupportedByNppi == true)
That's why the tests inside if (isElemSizeSupportedByNppi) is explicitely focusing on elemSize rather than srcType

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this makes it easier or harder for the next person who looks at it to see whats going on, what do you think @alalek ?

Either way I think having overlap between both makes it difficult to understand, maybe if the "cheat" only included the cases where it was applicable it would be more obvious?

@cudawarped
Copy link
Contributor

cudawarped commented Nov 10, 2022

Would it be possible to incorportate the newer stream api for newer versions of CUDA aswell?
See #3338

Fixed a typo in gridTranspose() usage
Added fast path for single cell/row/col matrix
Throw error for empty matrix instead of no-op
Code style
@chacha21
Copy link
Author

chacha21 commented Nov 10, 2022

Would it be possible to incorportate the newer stream api for newer versions of CUDA aswell?
See #3338

I am currently based on 4.x branch, usage of _Ctx seems to be in master.
I would prefer deferring that work to another PR

@@ -201,6 +199,9 @@ void cv::cuda::transpose(InputArray _src, OutputArray _dst, Stream& stream)
else if (!(elemSize%8) && ((elemSize/8)==2))
nppSafeCall( nppiTranspose_32f_C4R(src.ptr<Npp32f>(), static_cast<int>(src.step),
dst.ptr<Npp32f>(), static_cast<int>(dst.step), sz) );

if (!stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be applied to gridTranspose() as well, I can't understand why it was previously missing, surely the result could have been in flight on returns from this function when the default stream is passed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the opposite question : why is it ever needed ? I let them here, but I don't understand the purpose of this extra synchronization when the default stream (sync by default) is used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kernel launches are asynchronous with respect. The default stream syncs with other streams by default in legacy mode.

My interpretation is that the OpenCV API works on the assumption that if a stream isn't passed the user wants synchronization.

Copy link
Author

@chacha21 chacha21 Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that now that exist both cudaStreamPerThread and cudaStreamLegacy, people using OpenCV always expect the cudaStreamLegacy behaviour with the default stream, thus requiring the cudaDeviceSynchronize() ?
If so, I agree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No users of the api will expect functions which they don't pass a stream to, to be synchronous with respect to the host when they return. This would not be the case with either cudaStreamPerThread or cudaStreamLegacy as the kernel launch is asynchronous in both cases.

In the example from the docs

1) k_1<<<1, 1, 0, s>>>();
2) k_2<<<1, 1>>>();
3) k_3<<<1, 1, 0, s>>>()
4) ...

k_2 waits on k_1 because k_2 is in the legacy default stream, then k_3 waits on the legacy stream. Because of the specific way this has been set up k_1 and k_2 have finished executing before the call to k_3<<<1, 1, 0, s>>>() however the result from k_3 may still be in flight after control has returned to the host when you reach line 4.

Now I haven't used per thread default streams (I always use explicit streams) but my understanding is that if the CUDA_API_PER_THREAD_DEFAULT_STREAM macro was used to enable per thread default streams k_1 would run before k_3 but both would be asynchronous with respect to k_2. Either way when control returns to the host on line 4 they may still all be in flight.

On the other hand if we have the following which is our case if no stream is passed

1) k_1<<<1, 1>>>();
2) k_2<<<1, 1>>>();
3) k_3<<<1, 1>>>();
4) 

in either case cudaStreamPerThread or cudaStreamLegacy all three kernels may still be in flight when control returns to the host on line 4 if we don't explicitly syncronize.

Copy link
Author

@chacha21 chacha21 Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, anyway, I just checked and gridTranspose() does ultimately call cudaDeviceSynchronize()for null stream (see https://github.com/opencv/opencv_contrib/blob/4.x/modules/cudev/include/opencv2/cudev/grid/detail/transpose.hpp)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cudawarped Do you have open issues with the PR. I want to merge, if you do not mind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asmorkalov I am not 100% convinced that isNppiNativelySupported and isElemSizeSupportedByNppi flags are needed. I can understand that @chacha21 wants a fall back option in case something goes wrong and because of that wants to seperate out the two logic paths but I am not sure if the redundant calls under isElemSizeSupportedByNppi make the function harder to maintain, what do you think?

Additionally elemSize1 is unused, the isElemSizeSupportedByNppi logic path won't be fully tested and due to the names of the nppi functions it may make more sense to just use the bit size instead of examining elemSize, i.e.

      else if (!(elemSize%2) && ((elemSize/2)==2))
        nppSafeCall( nppiTranspose_8u_C4R(src.ptr<Npp8u>(), static_cast<int>(src.step),
          dst.ptr<Npp8u>(), static_cast<int>(dst.step), sz) );

could be

      else if (elemSize==4)
        nppSafeCall( nppiTranspose_8u_C4R(src.ptr<Npp8u>(), static_cast<int>(src.step),
          dst.ptr<Npp8u>(), static_cast<int>(dst.step), sz) );

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever the decision, I will certainly agree : I explained my initial code structure, but I am OK to adapt for a more "OpenCV style"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Conditions like if (!(elemSize%2) && ((elemSize/2)==2)) looks very cryptic. I understand the logic behind it, but if (elemSize==4) definitely more readable.
  2. The only reason for fallback I see is 2 channel matrix. All other cases are handled with regular types. I propose to merge "native" support and just "support" in single case and use relevant Npp call. No duplicated/dead branches, mor obvious testing.

if (rows == 1), the stride has no importance for the reshape()
@@ -201,6 +199,9 @@ void cv::cuda::transpose(InputArray _src, OutputArray _dst, Stream& stream)
else if (!(elemSize%8) && ((elemSize/8)==2))
nppSafeCall( nppiTranspose_32f_C4R(src.ptr<Npp32f>(), static_cast<int>(src.step),
dst.ptr<Npp32f>(), static_cast<int>(dst.step), sz) );

if (!stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cudawarped Do you have open issues with the PR. I want to merge, if you do not mind.

@chacha21
Copy link
Author

I could not test performance with other GPUs than mine (an old GTX 750). I am pretty confident, but it would be good if someone could have a look with more modern ones.

@asmorkalov
Copy link
Contributor

Transpose perf test results for GF 1080 with CUDA 10.2, Ubuntu 18.04:

Geometric mean (ms)

             Name of Test               before     after     after   
                                       transpose transpose transpose 
                                                               vs    
                                                             before  
                                                           transpose 
                                                           (x-factor)
Transpose::Sz_Type::(1280x720, 8UC1)     0.024     0.024      1.00   
Transpose::Sz_Type::(1280x720, 32SC1)    0.050     0.037      1.36   
Transpose::Sz_Type::(1280x720, 64FC1)    0.072     0.073      0.99   
Transpose::Sz_Type::(1280x720, 16UC2)    0.050     0.037      1.36   
Transpose::Sz_Type::(1280x720, 16SC2)    0.051     0.037      1.36   
Transpose::Sz_Type::(1280x720, 32SC2)    0.072     0.073      0.98   
Transpose::Sz_Type::(1280x720, 8UC4)     0.050     0.037      1.36   
Transpose::Sz_Type::(1280x1024, 8UC1)    0.034     0.034      1.00   
Transpose::Sz_Type::(1280x1024, 32SC1)   0.076     0.050      1.53   
Transpose::Sz_Type::(1280x1024, 64FC1)   0.101     0.096      1.06   
Transpose::Sz_Type::(1280x1024, 16UC2)   0.076     0.050      1.53   
Transpose::Sz_Type::(1280x1024, 16SC2)   0.076     0.049      1.53   
Transpose::Sz_Type::(1280x1024, 32SC2)   0.101     0.095      1.06   
Transpose::Sz_Type::(1280x1024, 8UC4)    0.076     0.049      1.54   
Transpose::Sz_Type::(1920x1080, 8UC1)    0.050     0.050      1.00   
Transpose::Sz_Type::(1920x1080, 32SC1)   0.124     0.082      1.51   
Transpose::Sz_Type::(1920x1080, 64FC1)   0.170     0.169      1.01   
Transpose::Sz_Type::(1920x1080, 16UC2)   0.124     0.082      1.51   
Transpose::Sz_Type::(1920x1080, 16SC2)   0.124     0.082      1.50   
Transpose::Sz_Type::(1920x1080, 32SC2)   0.170     0.169      1.00   
Transpose::Sz_Type::(1920x1080, 8UC4)    0.124     0.082      1.51

The execution time is very small. Not sure if, it's real speedup or some system side effect. In any case, the PR does not introduce regressions.

more obvious elemSize test logic to dispatch to the right nppiTranspose variant
added modern NPP _ctx stream support
Comment on lines 141 to 164
else if (elemSize == 1)
nppSafeCall( nppiTranspose_8u_C1R(src.ptr<Npp8u>(), static_cast<int>(src.step),
dst.ptr<Npp8u>(), static_cast<int>(dst.step), sz) );
else if (elemSize == 2)
nppSafeCall( nppiTranspose_16u_C1R(src.ptr<Npp16u>(), static_cast<int>(src.step),
dst.ptr<Npp16u>(), static_cast<int>(dst.step), sz) );
else if (elemSize == 3)
nppSafeCall( nppiTranspose_8u_C3R(src.ptr<Npp8u>(), static_cast<int>(src.step),
dst.ptr<Npp8u>(), static_cast<int>(dst.step), sz) );
else if (elemSize == 4)
nppSafeCall( nppiTranspose_32s_C1R(src.ptr<Npp32s>(), static_cast<int>(src.step),
dst.ptr<Npp32s>(), static_cast<int>(dst.step), sz) );
else if (elemSize == 6)
nppSafeCall( nppiTranspose_16u_C3R(src.ptr<Npp16u>(), static_cast<int>(src.step),
dst.ptr<Npp16u>(), static_cast<int>(dst.step), sz) );
else if (elemSize == 8)
nppSafeCall( nppiTranspose_16u_C4R(src.ptr<Npp16u>(), static_cast<int>(src.step),
dst.ptr<Npp16u>(), static_cast<int>(dst.step), sz) );
else if (elemSize == 12)
nppSafeCall( nppiTranspose_32s_C3R(src.ptr<Npp32s>(), static_cast<int>(src.step),
dst.ptr<Npp32s>(), static_cast<int>(dst.step), sz) );
else if (elemSize == 16)
nppSafeCall( nppiTranspose_32s_C4R(src.ptr<Npp32s>(), static_cast<int>(src.step),
dst.ptr<Npp32s>(), static_cast<int>(dst.step), sz) );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interpretation block does the transpose second time, if elemSize duplicates some of srcType

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interpretation block does the transpose second time, if elemSize duplicates some of srcType

Well... no ? This is still an "else if"

Comment on lines 219 to 242
else if (elemSize == 1)
nppSafeCall( nppiTranspose_8u_C1R_Ctx(src.ptr<Npp8u>(), static_cast<int>(src.step),
dst.ptr<Npp8u>(), static_cast<int>(dst.step), sz, ctx) );
else if (elemSize == 2)
nppSafeCall( nppiTranspose_16u_C1R_Ctx(src.ptr<Npp16u>(), static_cast<int>(src.step),
dst.ptr<Npp16u>(), static_cast<int>(dst.step), sz, ctx) );
else if (elemSize == 3)
nppSafeCall( nppiTranspose_8u_C3R_Ctx(src.ptr<Npp8u>(), static_cast<int>(src.step),
dst.ptr<Npp8u>(), static_cast<int>(dst.step), sz, ctx) );
else if (elemSize == 4)
nppSafeCall( nppiTranspose_32s_C1R_Ctx(src.ptr<Npp32s>(), static_cast<int>(src.step),
dst.ptr<Npp32s>(), static_cast<int>(dst.step), sz, ctx) );
else if (elemSize == 6)
nppSafeCall( nppiTranspose_16u_C3R_Ctx(src.ptr<Npp16u>(), static_cast<int>(src.step),
dst.ptr<Npp16u>(), static_cast<int>(dst.step), sz, ctx) );
else if (elemSize == 8)
nppSafeCall( nppiTranspose_16u_C4R_Ctx(src.ptr<Npp16u>(), static_cast<int>(src.step),
dst.ptr<Npp16u>(), static_cast<int>(dst.step), sz, ctx) );
else if (elemSize == 12)
nppSafeCall( nppiTranspose_32s_C3R_Ctx(src.ptr<Npp32s>(), static_cast<int>(src.step),
dst.ptr<Npp32s>(), static_cast<int>(dst.step), sz, ctx) );
else if (elemSize == 16)
nppSafeCall( nppiTranspose_32s_C4R_Ctx(src.ptr<Npp32s>(), static_cast<int>(src.step),
dst.ptr<Npp32s>(), static_cast<int>(dst.step), sz, ctx) );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here.


if (!stream)
CV_CUDEV_SAFE_CALL( cudaDeviceSynchronize() );
}
CV_CUDEV_SAFE_CALL( cudaDeviceSynchronize() );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Syncronious NPP call is used if !stream. No need to sync again. Right?

Comment on lines 250 to 258
if (elemSize == 1)
gridTranspose(globPtr<unsigned char>(src), globPtr<unsigned char>(dst), stream);
else if (elemSize == 2)
gridTranspose(globPtr<unsigned short>(src), globPtr<unsigned short>(dst), stream);
else if (elemSize == 4)
{
gridTranspose(globPtr<int>(src), globPtr<int>(dst), stream);
}
else // if (elemSize == 8)
{
gridTranspose(globPtr<double>(src), globPtr<double>(dst), stream);
}
gridTranspose(globPtr<signed int>(src), globPtr<signed int>(dst), stream);
else if (elemSize == 8)
gridTranspose(globPtr<double>(src), globPtr<double>(dst), stream);
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove dead code.

@cudawarped
Copy link
Contributor

cudawarped commented Dec 24, 2022

@chacha21 If you use the new stream ctx api you should also include the old api for compatibility with pre CUDA 10.1, see #3338.

but I am OK to adapt for a more "OpenCV style"

I was thinking about this and a more "OpenCV style" would be with a table (e.g. warp.cpp) to reduce the code redundancy.

#endif
cudaStream_t _stream = StreamAccessor::getStream(stream);

if (!_stream || !useNppStreamCtx)
Copy link
Contributor

@cudawarped cudawarped Jan 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to always use the old api when CUDA SDK < 10.1 and the newer one otherwise regardless of whether there is a stream or not? That way it should be easier to carve out the older code when it is depreciated in newer CUDA versions.

See https://github.com/cudawarped/opencv_contrib/blob/e40c43d96a22edaa3fbb880e957a3753938dc4f0/modules/cudaarithm/src/cuda/transpose.cu

void cv::cuda::transpose(InputArray _src, OutputArray _dst, Stream& stream)
{
GpuMat src = getInputMat(_src, stream);
#define USE_NPP_STREAM_CONTEXT (NPP_VERSION >= (10 * 1000 + 1 * 100 + 0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to define this globally somewhere although not sure where, maybe modules/cudev/include/opencv2/cudev/common.hpp.

@cudawarped
Copy link
Contributor

cudawarped commented Jan 5, 2023

@chacha21 I like the changes apart from my comments. For reference I was thinking the modification could look like
cudawarped@e40c43d

I would wait for @asmorkalov to take a look before making any more updates as he will ultimately decide on the required format.


syncOutput(dst, _dst, stream);
const int srcType = src.type();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this redundant?

(elemSize == 1) || (elemSize == 2) || (elemSize == 3) || (elemSize == 4) ||
(elemSize == 6) || (elemSize == 8) || (elemSize == 12) || (elemSize == 16);

if (!isSupported)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would CV_Assert() be better here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cv::cuda::transpose() limitation
4 participants