Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accuracy cudaarithm test fails after CUDA/GpuMat_SetTo.Zero test case #3361

Closed
asenyaev opened this issue Oct 11, 2022 · 7 comments
Closed

Comments

@asenyaev
Copy link
Contributor

asenyaev commented Oct 11, 2022

Accuracy:cudaarithm fails always on a next test after CUDA/GpuMat_SetTo.Zero test case(not the first next, but one from the next test case). I tried to filter tests, but it's a never-ending process.

By default, it fails here:

[ RUN      ] CUDA/GpuMat_SetTo.SameVal/8, where GetParam() = (Quadro P2000, 128x128, 8SC1, whole matrix)
unknown file: Failure
C++ exception with description "OpenCV(4.6.0-dev) /home/ci/opencv_contrib/modules/cudev/include/opencv2/cudev/grid/detail/transform.hpp:264: error: (-217:Gpu API call) invalid resource handle in function 'call'
" thrown in the test body.
[  FAILED  ] CUDA/GpuMat_SetTo.SameVal/8, where GetParam() = (Quadro P2000, 128x128, 8SC1, whole matrix) (224 ms)
Details about the never-ending process

If to disable CUDA/GpuMat_SetTo.SameVal test case, then the next test case will fail:

[ RUN      ] CUDA/GpuMat_SetTo.DifferentVal/2, where GetParam() = (Quadro P2000, 128x128, 8UC2, whole matrix)
unknown file: Failure
C++ exception with description "OpenCV(4.6.0-dev) /home/ci/opencv_contrib/modules/cudev/include/opencv2/cudev/grid/detail/transform.hpp:264: error: (-217:Gpu API call) invalid resource handle in function 'call'
" thrown in the test body.
[  FAILED  ] CUDA/GpuMat_SetTo.DifferentVal/2, where GetParam() = (Quadro P2000, 128x128, 8UC2, whole matrix) (215 ms)

If to disable CUDA/GpuMat_SetTo.DifferentVal test case, then the next test case will fail:

[ RUN      ] CUDA/GpuMat_SetTo.Masked/0, where GetParam() = (Quadro P2000, 128x128, 8UC1, whole matrix)
unknown file: Failure
C++ exception with description "OpenCV(4.6.0-dev) /home/ci/opencv_contrib/modules/cudev/include/opencv2/cudev/grid/detail/transform.hpp:264: error: (-217:Gpu API call) invalid resource handle in function 'call'
" thrown in the test body.
[  FAILED  ] CUDA/GpuMat_SetTo.Masked/0, where GetParam() = (Quadro P2000, 128x128, 8UC1, whole matrix) (220 ms)

System Information

  • OpenCV version: 4.6.0
  • Operating System / Platform: Ubuntu 20.04
  • Compiler & compiler version: GCC 9.4.0
  • GPU: Quadro P2000
@cudawarped
Copy link
Contributor

cudawarped commented Oct 12, 2022

It looks like a device error cause by CUDA_Event/AsyncEvent.Timing/1, which comes before CUDA/GpuMat_SetTo.Zero.

The call to cudaEventElapsedTime() returns cudaErrorInvalidResourceHandle but because it was returned and not queried by cudaGetLastError() the last error code is not reset to cudaSuccess and the next call to CV_CUDEV_SAFE_CALL(cudaGetLastError()); throws an exception. Do you think it would be better for checkCudaError() to internally call cudaGetLastError() when an error is thrown to reset the last error code to cudaSuccess, or would it be better for cv::cuda::Event::elapsedTime() to clear the error which is not fatal?

It doesn't fail if I disable it (--gtest_filter=-CUDA_Event/AsyncEvent.W*) does on your end if you disable it?

@asmorkalov
Copy link
Contributor

The next CUDA call in the sequence should ignore the last error code and set it's own status. If I understand correctly, there is some cudaGetLastError() call that does not have useful CUDA call before. If it's true, the test should be updated, but not error reporting function.

@cudawarped
Copy link
Contributor

The next CUDA call in the sequence should ignore the last error code and set it's own status

There are two cases:

  1. cudaSafeCall(cudaGetLastError()), and
  2. cudaSafeCall(cudaApiFuncion())

In case 1 there is no problem, if cudaGetLastError() returns an error code OpenCV will throw an exception and the error code will be cleared by the initial call to cudaGetLastError().

In case 2 cudaApiFuncion() returns the error code OpenCV throws an exception but the error code remains and any future internal calls to cudaSafeCall(cudaGetLastError()) made by OpenCV functions could throw an exception.

Therefore the next CUDA call can't ignore the last error code if it calls cudaSafeCall(cudaGetLastError()) or similiar macro anywhere internally.

Shouldn't OpenCV be responsible for clearing this internally if it throws an exeption?

The alternative would be for OpenCV to expose cudaGetLastError() and then it be the users responsibility to call it after exceptions are thrown.

@asenyaev
Copy link
Contributor Author

@cudawarped, if I run tests with --gtest_filter=-CUDA_Event/AsyncEvent.W* flag, then one test after CUDA/GpuMat_SetTo.Zero fails. However, if to run with the following flag --gtest_filter=-CUDA_Event/AsyncEvent.Timing/*, all tests are passed.

@cudawarped
Copy link
Contributor

@asenyaev sorry wrong flag that was to run only CUDA_Event/AsyncEvent.Timing from CUDA_Event/AsyncEvent to make sure that it was the test causing the error.

@asmorkalov
Copy link
Contributor

In case 2 cudaApiFuncion() returns the error code OpenCV throws an exception but the error code remains and any future internal calls to cudaSafeCall(cudaGetLastError()) made by OpenCV functions could throw an exception.

It means that we have sequence like this:
CudaAPICall -> cudaGetLastError -> Exception -> Non-CUDA code -> cudaGetLastError. The last cudaGetLastError raises the issue. It means that out-of-order cudaGetLastError call reports some random state. I propose to fix the sequence, but not force drop the error state.
The last error code could be useful for caller side for debugging purposes. Not all OpenCV function print details.

@cudawarped
Copy link
Contributor

cudawarped commented Oct 13, 2022

It means that we have sequence like this: CudaAPICall -> cudaGetLastError -> Exception -> Non-CUDA code -> cudaGetLastError. The last cudaGetLastError raises the issue. It means that out-of-order cudaGetLastError call reports some random state. I propose to fix the sequence, but not force drop the error state.

Adding more detail the sequence is
CudaAPICall -> cudaSafeCall on return code -> Exception -> OpenCV cuda::|cudacodec:: code then calls cudaSafeCall(cudaGetLastError) internally
The only way I can see to fix the sequence is

  1. Change the way case 2 is handled (CudaAPICall -> Fails -> cudaGetLastError() to clear internal state -> cudaSafeCall on previous state -> Exception) which would probably amount to the same thing as calling cudaGetLastError() internally in the cudaSafeCall macro.
  2. Remove all calls to cudaSafeCall(cudaGetLastError).

Anything else, such as clearing the error before CUDA api calls will most likely fail because the runtime calls can be asynchronous with respect to the host.

The last error code could be useful for caller side for debugging purposes. Not all OpenCV function print details.

I agree but if we don't expose a way to clear it through OpenCV users who don't link agains the cuda toolkit could face exceptions in OpenCV cuda::|cudacodec:: code then calls cudaSafeCall(cudaGetLastError) internally from previous calls to case 2 which threw an exception.

If we clear the error when an exception is thrown then shouldn't the error code always be reported to the user in the exception message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants