Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arm Mali/OpenCL - OpenCL error CL_OUT_OF_RESOURCES #13108

Open
ayaromenok opened this issue Nov 11, 2018 · 16 comments
Open

Arm Mali/OpenCL - OpenCL error CL_OUT_OF_RESOURCES #13108

ayaromenok opened this issue Nov 11, 2018 · 16 comments
Labels
category: ocl platform: arm ARM boards related issues: RPi, NVIDIA TK/TX, etc RFC

Comments

@ayaromenok
Copy link

ayaromenok commented Nov 11, 2018

_update: this bug is specific to Mali Midgard(T8xx\T7xx\T6xx) GPU only. More resent Mali Bifrost(Gxx) series not affected

System information (version)
  • OpenCV => 4.0beta
  • Operating System / Platform => Ubuntu 18.04/arm64
  • Compiler => Ubuntu/Linaro 7.3.0-27
  • OpenCL => OpenCL 1.2 v1.r14p0-01rel0-git(966ed26)
  • GPU => ARM Mali T860(Midgard 4gen)
Detailed description

It is specific bug to arm\mali hardware in OpenCL, which related to limited resources on this mobile device.
OpenCL error CL_OUT_OF_RESOURCES (-5) during call: clEnqueueNDRangeKernel('stage1_with_sobel', dims=2, globalsize=4096x4096x1, localsize=32x8x1) sync=false

reducing maxWorkGoupSize_ value of cl::Device from default 256 to 128 (to give more resources to kernel) remove that particular bug in cv::GausianBlur with size 3x3(and may be many more), but bigger parameters, like size '5x5' or other kernel, like described in #11503 issue, still lead to CL_OUT_OF_RESOURCES

I am continue to evaluate another CL-acceletared methods on Arm Mali-t860 to figure out another reasons of CL_OUT_OF_RESOURCES.

Steps to reproduce
#include "opencv2/opencv.hpp"
using namespace cv;
 
int main(int argc, char** argv)
{
    UMat img, gray;
    imread("lena.jpg", IMREAD_COLOR).copyTo(img); //use lena.jpg from CV examples
    cvtColor(img, gray, COLOR_BGR2GRAY);
    GaussianBlur(gray, gray,Size(3, 3), 1.5);
    Canny(gray, gray, 0, 50);   
    return 0;
}
@tomoaki0705
Copy link
Contributor

You're too quickly jumping to the conclusion.
Did you check the result of GaussianBlur and Canny ? Was it really broken ?

You can't expect that the device has an infinite memory.
If the device is not capable of computation, such as insufficient memory, it should notice the user, and that's what exactly you're seeing.
The message OpenCL error CL_OUT_OF_RESOURCES looks like an error, but in general case, OpenCV has a fallback CPU version implementation.
So at the end, you get a correct result, and also know that you gave a task which was too heavy for Mali.

Is there any other reason you think this case is a bug ?

@ayaromenok
Copy link
Author

ayaromenok commented Nov 11, 2018

Well, I believe, that a bug in specific combination of hardware/vendor driver and OpenCV/OpenCL (maybe even worse - graphic system, which is X11 and arm64 architecture/or kernel 4.4).

While result may be correct, fallback from hardware implementation(CL/GPU) to software implementation(CPU or CL over CPU) in case, if GPU hardware able to work with such kernels/algorithms - it's bug.

Proposed solution to decrease a maximum size of maxWorkGroupSize_ for arm/mali works for test case from above (soon I will prepare a patch), but for other test case with CL_OUT_OF_RESOURCES is not.

Creating this ticket, I am trying to evaluate (by receive some feedback from people, who get similar errors on arm/mali/opencl), how many issues can be solve with such solution (my expectation is about 5%, may be even less. looks like for more flexible Nvidia GPU it works much better).

I am continue to evaluate, how many CL kernels have issues on that particular hardware, so will update this ticket, probably for a long time :)

PS: can't assign ticket to myself (I don't won't to disturb core dev's with that issue)

@tomoaki0705
Copy link
Contributor

FYI, if you run entire opencv test suite, you'll get more than thousand lines of CL_OUT_OF_RESOURCES
Here's a grep from my local test result.

$ grep CL_OUT_OF_RESOURCES arm-firefly-rk3399-opencv_test_* | wc -l
2312

Firefly RK3399 has Mali T860.

Just for curiosity, how much memory does your Mali has ?

@ayaromenok
Copy link
Author

ayaromenok commented Nov 11, 2018

number of CL_OUT_OF_RESOURCES errors depends from algorithm - it can happens in every line for 2D algorithm or even at more complicated template. so, it's not representative - only number of tests, which is passed/or not. especially opencv_perf_*.

While my issues happens on RK3399/T860/4 too, with driver/CV version from above, without yours detailed information about driver/window system/OS + OpenCV version, information from your grep is useless for this ticket:(

@tomoaki0705
Copy link
Contributor

I'm not following the situation.
Does the test fails in your situation ?

It passed on my firefly
Here's my log of opencv_test_imgproc and opencv_perf_imgproc

BTW, my Firefly has

  • GCC 5.4.0
  • Ubunntu 16.04
  • Driver version is OpenCL 1.2 v1.r13p0-00rel0-git(a4271c9).04dadb82d3612c978f88c00109101694
  • So, it's different platform. Just Mali T860 is common

Now your explanation is getting uncertain. Are you talking about specific test failure ? or are you just talking about printed log CL_OUT_OF_RESOURCES ?

ayaromenok added a commit to ayaromenok/opencv that referenced this issue Nov 26, 2018
@ayaromenok
Copy link
Author

ayaromenok commented Nov 26, 2018

sorry for delay - just return to RK3399 board ;)

start test with opencv_test_core:

  • with workgroup size 256(default for T860): 598 CL_OUT_OF_RESOURCES errors
  • with workgroup size 128(default/2 for T860): 13 CL_OUT_OF_RESOURCES errors

Just for curiosity, how much memory does your Mali has ?

from my board: Max memory allocation size = 493 MB 19 KB. in your logs it's a 494 MB 34 KB

@ayaromenok
Copy link
Author

ayaromenok commented Nov 26, 2018

removing duplicate error messages(multiply call of FFT, for example) from opencv_test_core:
for (default) 256

62:OpenCL error CL_OUT_OF_RESOURCES (-5) during call: clEnqueueNDRangeKernel('gemm', dims=2, globalsize=16x16x1, localsize=16x16x1) sync=false
6926:OpenCL error CL_OUT_OF_RESOURCES (-5) during call: clEnqueueNDRangeKernel('reduce', dims=1, globalsize=1024x1x1, localsize=256x1x1) sync=true
6946:OpenCL error CL_OUT_OF_RESOURCES (-5) during call: clEnqueueNDRangeKernel('meanStdDev', dims=1, globalsize=1024x1x1, localsize=256x1x1) sync=false
7833:OpenCL error CL_OUT_OF_RESOURCES (-5) during call: clEnqueueNDRangeKernel('minmaxloc', dims=1, globalsize=1024x1x1, localsize=256x1x1) sync=true
8251:OpenCL error CL_OUT_OF_RESOURCES (-5) during call: clEnqueueNDRangeKernel('reduce', dims=1, globalsize=1024x1x1, localsize=256x1x1) sync=true
8624:OpenCL error CL_OUT_OF_RESOURCES (-5) during call: clEnqueueNDRangeKernel('minmaxloc', dims=1, globalsize=1024x1x1, localsize=256x1x1) sync=true
8892:OpenCL error CL_OUT_OF_RESOURCES (-5) during call: clEnqueueNDRangeKernel('reduce', dims=1, globalsize=1024x1x1, localsize=256x1x1) sync=true
9780:OpenCL error CL_OUT_OF_RESOURCES (-5) during call: clEnqueueNDRangeKernel('minmaxloc', dims=1, globalsize=1024x1x1, localsize=256x1x1) sync=true 
9786:OpenCL error CL_OUT_OF_RESOURCES (-5) during call: clEnqueueNDRangeKernel('reduce', dims=1, globalsize=1024x1x1, localsize=256x1x1) sync=true
12761:OpenCL error CL_OUT_OF_RESOURCES (-5) during call: clEnqueueNDRangeKernel('fft_multi_radix_rows', dims=2, globalsize=160x768x1, localsize=160x1x1) sync=false
12899:OpenCL error CL_OUT_OF_RESOURCES (-5) during call: clEnqueueNDRangeKernel('ifft_multi_radix_rows', dims=2, globalsize=160x768x1, localsize=160x1x1) sync=false
13167:OpenCL error CL_OUT_OF_RESOURCES (-5) during call: clEnqueueNDRangeKernel('gemm', dims=2, globalsize=16x80x1, localsize=16x16x1) sync=false

for (default/2) 128

7162:OpenCL error CL_OUT_OF_RESOURCES (-5) during call: clEnqueueNDRangeKernel('meanStdDev', dims=1, globalsize=512x1x1, localsize=128x1x1) sync=false
12370:OpenCL error CL_OUT_OF_RESOURCES (-5) during call: clEnqueueNDRangeKernel('ifft_multi_radix_cols', dims=2, globalsize=641x96x1, localsize=1x96x1) sync=false

so, looks like solution to decrease size of working group is working and useful, but required additional testing

@tomoaki0705
Copy link
Contributor

Are you talking about specific test failure ? or are you just talking about printed log CL_OUT_OF_RESOURCES ?

:(

@ayaromenok
Copy link
Author

ayaromenok commented Nov 27, 2018

Are you talking about specific test failure ? or are you just talking about printed log CL_OUT_OF_RESOURCES ?

CL_OUT_OF_RESOURCES may inform about different issues inside of OpenCL subsystem, like out_of_memory, out_of_stack_memory, out_of_heap_memory, to much nesting loops, out_of_registers and many more. Only detailed evaluation of every error with OpenCL debugger may give certain reason.

So it's not a specific test failure, since it's appear in different algorithms, like reduce, fft, ifft. And it's not ALL issues(only some of them), which described by CL_OUT_OF_RESOURCES in printed log.

Issue, which can not be fixed by proposed reducing WGSize(nesting loops), required more complicated evaluation/fixes, like rewriting OpenCL kernel and/or change data flow in C++ code.

BTW, my board if NanoPI M4 with Ubuntu 18.04/arm64 and slightly new driver, than yours. I will install a Ubuntu 16.04/arm32 to have more comparable results. Also I am trying to build OpenCV with OpenCL for ARM Bifrost(Mali G71) for android to make some test too.

@alalek
Copy link
Member

alalek commented Nov 27, 2018

BTW, There is one place in OpenCL with "dynamic" adjusting of OpenCL group size / kernel re-compilation for selected group size: filter2D implementation.
But this code doesn't look straightforward.

@ayaromenok
Copy link
Author

There is one place in OpenCL with "dynamic" adjusting of OpenCL group size / kernel re-compilation for selected group size: filter2D implementation.

Thank you for this info - it can very helpful in future.

At present moment my idea is to detect a ARM Mali during initialization of CL device, and decrease a WorgGroup Size only for algorithms, where it may help (like gemm and reduce, but not for ifft) - to use "static" setup. Or even use one reduced workgroup size for all (if performance penalty would be small and\or would be even performance boost).

And need a lot of test before decide which solution is better.

@ayaromenok
Copy link
Author

some updates:

  • this bug affect only older ARM Midgard GPU(T6xx/T7xx/T8xx), while latest ARM Bifrost(Gxx)(which have a bit different hardware technically) works just fine;
  • It takes a lot of time for me(due to issue with I/O on my ARM board) to run perf test on OpenCL mode with different WorkGroupSize: 256(default), 192, 128, 64, 32. While I continue to analyze results, looks like 128(default/2) is optimal value.

@ayaromenok
Copy link
Author

ayaromenok commented Jan 2, 2019

finally, I get stability from my board and can return to this issue.
I make a test with different size of working group(default 256, 192, 128, 64, 32) and get follow results:

  • core, imgproc, calib3d, stitch modules tested (with both test and perf);
  • 20 unique OpenCL method(with different arguments) affected;
  • up to 9 of them can fixed by reduced maxWorkGroupSize;
  • while reduced size not really affect a performance on typical configuration of Arm Midgard GPU(4 cores, like on tested Rk3399), it can have bigger performance influence on very wide(8 cores - like Exynos 7420) or very narrow(1 core - like Exynos 7870)

full table

WG size 256 192 128 64 32
total methods 20 19 13 11 11
stereoBM 1 1 1 1 1
stage1_with_sobel 4 4 OK OK OK
fft_multi_radix_rows 4 2 OK OK OK
ifft_multi_radix_rows 4 2 OK OK OK
ifft_multi_radix_cols 3 2 1 OK OK
gemm 6 OK OK OK OK
meanStdDev 1 1 OK OK OK
reduce 1 1 OK OK OK
pyrDown 8 8 8 8 8
row_filter_C1_D0 6 6 6 6 6
morph 5 5 5 5 5
row_filter 6 6 6 6 6
laplacian 5 5 OK OK OK
pyrUp 5 5 5 5 5
minmaxloc 1 1 OK OK OK
medianFilter3 1 1 1 1 1
medianFilter5 5 5 5 5 5
sobel3 6 6 6 OK OK
sobel5 1 1 1 1 1
sobel7 4 4 4 4 4

I am evaluating every method from above on target hardware(with some successful fixes already) an will update with results soon.

ayaromenok added a commit to ayaromenok/opencv that referenced this issue Jan 2, 2019
ayaromenok added a commit to ayaromenok/opencv that referenced this issue Jan 2, 2019
@liyingchi
Copy link

Thank you for your evaluation on this topic.
I would like to ask how to change the maxWorkGroupSize in the system by using python or C++ ?

@tomoaki0705
Copy link
Contributor

@liyingchi #11797 (comment)

@Dithermaster
Copy link

Dithermaster commented Nov 13, 2023

Are these CL_RGB images (and not CL_RGBA)? If so, check clGetSupportedImageFormats for your device, because CL_RGB is not often supported. Strangely, driver will let you create the unsupported image, compile the kernel, set the argument, but when you call enqueueNDRangeKernel it will return a status of -5 "CL_OUT_OF_RESOURCES". I have seen this on Windows NVIDIA and Intel GPU drivers; I did not test on AMD, or Mac.

Sorry for the very late answer, but I just ran into this today and this thread came up in my searching. Hat tip to Pragmataraxia at this link who solved it: https://forums.developer.nvidia.com/t/what-else-causes-cl-out-of-resources/11426/2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: ocl platform: arm ARM boards related issues: RPi, NVIDIA TK/TX, etc RFC
Projects
None yet
Development

No branches or pull requests

5 participants