Accuracy cudaarithm test fails after CUDA/GpuMat_SetTo.Zero test case #3361

asenyaev · 2022-10-11T13:49:56Z

Accuracy:cudaarithm fails always on a next test after CUDA/GpuMat_SetTo.Zero test case(not the first next, but one from the next test case). I tried to filter tests, but it's a never-ending process.

By default, it fails here:

[ RUN      ] CUDA/GpuMat_SetTo.SameVal/8, where GetParam() = (Quadro P2000, 128x128, 8SC1, whole matrix)
unknown file: Failure
C++ exception with description "OpenCV(4.6.0-dev) /home/ci/opencv_contrib/modules/cudev/include/opencv2/cudev/grid/detail/transform.hpp:264: error: (-217:Gpu API call) invalid resource handle in function 'call'
" thrown in the test body.
[  FAILED  ] CUDA/GpuMat_SetTo.SameVal/8, where GetParam() = (Quadro P2000, 128x128, 8SC1, whole matrix) (224 ms)

Details about the never-ending process

If to disable CUDA/GpuMat_SetTo.SameVal test case, then the next test case will fail:

[ RUN      ] CUDA/GpuMat_SetTo.DifferentVal/2, where GetParam() = (Quadro P2000, 128x128, 8UC2, whole matrix)
unknown file: Failure
C++ exception with description "OpenCV(4.6.0-dev) /home/ci/opencv_contrib/modules/cudev/include/opencv2/cudev/grid/detail/transform.hpp:264: error: (-217:Gpu API call) invalid resource handle in function 'call'
" thrown in the test body.
[  FAILED  ] CUDA/GpuMat_SetTo.DifferentVal/2, where GetParam() = (Quadro P2000, 128x128, 8UC2, whole matrix) (215 ms)

If to disable CUDA/GpuMat_SetTo.DifferentVal test case, then the next test case will fail:

[ RUN      ] CUDA/GpuMat_SetTo.Masked/0, where GetParam() = (Quadro P2000, 128x128, 8UC1, whole matrix)
unknown file: Failure
C++ exception with description "OpenCV(4.6.0-dev) /home/ci/opencv_contrib/modules/cudev/include/opencv2/cudev/grid/detail/transform.hpp:264: error: (-217:Gpu API call) invalid resource handle in function 'call'
" thrown in the test body.
[  FAILED  ] CUDA/GpuMat_SetTo.Masked/0, where GetParam() = (Quadro P2000, 128x128, 8UC1, whole matrix) (220 ms)

System Information

OpenCV version: 4.6.0
Operating System / Platform: Ubuntu 20.04
Compiler & compiler version: GCC 9.4.0
GPU: Quadro P2000

The text was updated successfully, but these errors were encountered:

cudawarped · 2022-10-12T17:47:49Z

It looks like a device error cause by CUDA_Event/AsyncEvent.Timing/1, which comes before CUDA/GpuMat_SetTo.Zero.

The call to cudaEventElapsedTime() returns cudaErrorInvalidResourceHandle but because it was returned and not queried by cudaGetLastError() the last error code is not reset to cudaSuccess and the next call to CV_CUDEV_SAFE_CALL(cudaGetLastError()); throws an exception. Do you think it would be better for checkCudaError() to internally call cudaGetLastError() when an error is thrown to reset the last error code to cudaSuccess, or would it be better for cv::cuda::Event::elapsedTime() to clear the error which is not fatal?

It doesn't fail if I disable it (--gtest_filter=-CUDA_Event/AsyncEvent.W*) does on your end if you disable it?

asmorkalov · 2022-10-13T08:01:46Z

The next CUDA call in the sequence should ignore the last error code and set it's own status. If I understand correctly, there is some cudaGetLastError() call that does not have useful CUDA call before. If it's true, the test should be updated, but not error reporting function.

cudawarped · 2022-10-13T08:13:25Z

The next CUDA call in the sequence should ignore the last error code and set it's own status

There are two cases:

cudaSafeCall(cudaGetLastError()), and
cudaSafeCall(cudaApiFuncion())

In case 1 there is no problem, if cudaGetLastError() returns an error code OpenCV will throw an exception and the error code will be cleared by the initial call to cudaGetLastError().

In case 2 cudaApiFuncion() returns the error code OpenCV throws an exception but the error code remains and any future internal calls to cudaSafeCall(cudaGetLastError()) made by OpenCV functions could throw an exception.

Therefore the next CUDA call can't ignore the last error code if it calls cudaSafeCall(cudaGetLastError()) or similiar macro anywhere internally.

Shouldn't OpenCV be responsible for clearing this internally if it throws an exeption?

The alternative would be for OpenCV to expose cudaGetLastError() and then it be the users responsibility to call it after exceptions are thrown.

asenyaev · 2022-10-13T08:43:38Z

@cudawarped, if I run tests with --gtest_filter=-CUDA_Event/AsyncEvent.W* flag, then one test after CUDA/GpuMat_SetTo.Zero fails. However, if to run with the following flag --gtest_filter=-CUDA_Event/AsyncEvent.Timing/*, all tests are passed.

cudawarped · 2022-10-13T08:47:38Z

@asenyaev sorry wrong flag that was to run only CUDA_Event/AsyncEvent.Timing from CUDA_Event/AsyncEvent to make sure that it was the test causing the error.

asmorkalov · 2022-10-13T09:14:26Z

In case 2 cudaApiFuncion() returns the error code OpenCV throws an exception but the error code remains and any future internal calls to cudaSafeCall(cudaGetLastError()) made by OpenCV functions could throw an exception.

It means that we have sequence like this:
CudaAPICall -> cudaGetLastError -> Exception -> Non-CUDA code -> cudaGetLastError. The last cudaGetLastError raises the issue. It means that out-of-order cudaGetLastError call reports some random state. I propose to fix the sequence, but not force drop the error state.
The last error code could be useful for caller side for debugging purposes. Not all OpenCV function print details.

cudawarped · 2022-10-13T09:43:17Z

It means that we have sequence like this: CudaAPICall -> cudaGetLastError -> Exception -> Non-CUDA code -> cudaGetLastError. The last cudaGetLastError raises the issue. It means that out-of-order cudaGetLastError call reports some random state. I propose to fix the sequence, but not force drop the error state.

Adding more detail the sequence is
CudaAPICall -> cudaSafeCall on return code -> Exception -> OpenCV cuda::|cudacodec:: code then calls cudaSafeCall(cudaGetLastError) internally
The only way I can see to fix the sequence is

Change the way case 2 is handled (CudaAPICall -> Fails -> cudaGetLastError() to clear internal state -> cudaSafeCall on previous state -> Exception) which would probably amount to the same thing as calling cudaGetLastError() internally in the cudaSafeCall macro.
Remove all calls to cudaSafeCall(cudaGetLastError).

Anything else, such as clearing the error before CUDA api calls will most likely fail because the runtime calls can be asynchronous with respect to the host.

The last error code could be useful for caller side for debugging purposes. Not all OpenCV function print details.

I agree but if we don't expose a way to clear it through OpenCV users who don't link agains the cuda toolkit could face exceptions in OpenCV cuda::|cudacodec:: code then calls cudaSafeCall(cudaGetLastError) internally from previous calls to case 2 which threw an exception.

If we clear the error when an exception is thrown then shouldn't the error code always be reported to the user in the exception message.

asenyaev added the category: cuda label Oct 11, 2022

asenyaev mentioned this issue Oct 11, 2022

Ubuntu 20.04 x86_64 with CUDA workflow opencv/ci-gha-workflow#70

Merged

cudawarped mentioned this issue Oct 12, 2022

cv::cudacodec::createVideoReader Not Working as expected #3359

Closed

cudawarped mentioned this issue Oct 13, 2022

Reset cuda runtime error code to cudasuccess on runtime failure. opencv/opencv#22633

Merged

6 tasks

asmorkalov closed this as completed Oct 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accuracy cudaarithm test fails after CUDA/GpuMat_SetTo.Zero test case #3361

Accuracy cudaarithm test fails after CUDA/GpuMat_SetTo.Zero test case #3361

asenyaev commented Oct 11, 2022 •

edited

cudawarped commented Oct 12, 2022 •

edited

asmorkalov commented Oct 13, 2022

cudawarped commented Oct 13, 2022

asenyaev commented Oct 13, 2022

cudawarped commented Oct 13, 2022

asmorkalov commented Oct 13, 2022

cudawarped commented Oct 13, 2022 •

edited

Accuracy cudaarithm test fails after CUDA/GpuMat_SetTo.Zero test case #3361

Accuracy cudaarithm test fails after CUDA/GpuMat_SetTo.Zero test case #3361

Comments

asenyaev commented Oct 11, 2022 • edited

System Information

cudawarped commented Oct 12, 2022 • edited

asmorkalov commented Oct 13, 2022

cudawarped commented Oct 13, 2022

asenyaev commented Oct 13, 2022

cudawarped commented Oct 13, 2022

asmorkalov commented Oct 13, 2022

cudawarped commented Oct 13, 2022 • edited

asenyaev commented Oct 11, 2022 •

edited

cudawarped commented Oct 12, 2022 •

edited

cudawarped commented Oct 13, 2022 •

edited