Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Level Zero] sycl::parallel_for with ranges larger than INT_MAX deadlocks or aborts #4255

Open
masterleinad opened this issue Aug 4, 2021 · 22 comments · Fixed by #5115
Open
Labels
bug Something isn't working runtime Runtime library related issue

Comments

@masterleinad
Copy link
Contributor

masterleinad commented Aug 4, 2021

Describe the bug
Running

#include <iostream>
#include <CL/sycl.hpp>

int main(int, char**) {
   cl::sycl::default_selector device_selector;
   cl::sycl::queue queue(device_selector);
   std::cout << "Running on "
             << queue.get_device().get_info<cl::sycl::info::device::name>()
             << "\n";
   size_t N = INT_MAX; //breaks for CUDA
   // size_t N = 5000000000; // breaks for Intel
   sycl::range<1> range(N+1);
   auto parallel_for_event = queue.submit([&](sycl::handler& cgh) {
     cgh.parallel_for(range, [=](sycl::item<1> /*item*/) {});
   });

   return 0;
}

deadlocks on CUDA devices or gives

C++ exception with description "PI backend failed. PI backend returns: -54 (CL_INVALID_WORK_GROUP_SIZE) -54 (CL_INVALID_WORK_GROUP_SIZE)" thrown in the test body.

on Intel GPUs when compiled and run via

clang++ -fsycl -fsycl-unnamed-lambda -fno-sycl-id-queries-fit-in-int -fsycl-targets=nvptx64-nvidia-cuda-sycldevice && ./a.out

resp.

clang++ -fsycl -fsycl-unnamed-lambda -fno-sycl-id-queries-fit-in-int dummy.cc && ./a.out

Environment:

  • OS: Linux
  • Target device and vendor: Intel GPU, NVIDIA GPU
  • DPC++ version: nightly release 20210621
@zjin-lcf
Copy link
Contributor

zjin-lcf commented Aug 13, 2021

Running on Intel(R) UHD Graphics P630 [0x3e96]
terminate called after throwing an instance of 'cl::sycl::runtime_error'
what(): Provided range is out of integer limits. Pass `-fno-sycl-id-queries-fit-in-int' to disable range check. -30 (CL_INVALID_VALUE)

Is it right that the real issue is sycl::range should not be limited to the range of an integer ? Thanks.

@bader
Copy link
Contributor

bader commented Aug 16, 2021

Is it right that the real issue is sycl::range should not be limited to the range of an integer ? Thanks.

Yes. It's done for performance reasons and can be relaxed with -fno-sycl-id-queries-fit-in-int flag if needed.

@masterleinad
Copy link
Contributor Author

As said in the initial post, I was using -fno-sycl-id-queries-fit-in-in already.

@AerialMantis AerialMantis added cuda CUDA back-end runtime Runtime library related issue labels Aug 23, 2021
@AerialMantis AerialMantis added this to Selected in oneAPI DPC++ Aug 25, 2021
@AerialMantis AerialMantis moved this from Selected to Needs triage in oneAPI DPC++ Aug 25, 2021
@AerialMantis AerialMantis moved this from Needs triage to Selected in oneAPI DPC++ Nov 29, 2021
@AerialMantis AerialMantis moved this from Selected to In review in oneAPI DPC++ Dec 7, 2021
@masterleinad
Copy link
Contributor Author

Enabling

size_t N = 5000000000lu; // breaks for Intel

the test still fails with

Running on Intel(R) Graphics [0x020a]
terminate called after throwing an instance of 'cl::sycl::runtime_error'
  what():  PI backend failed. PI backend returns: -54 (CL_INVALID_WORK_GROUP_SIZE) -54 (CL_INVALID_WORK_GROUP_SIZE)
Aborted

on Intel GPUs with a nightly build from 10/25.

npmiller added a commit to npmiller/llvm that referenced this issue Dec 9, 2021
This is the equivalent for HIP of the changes in intel#5095.

It also fixes intel#4255 for the HIP plugin.
@AerialMantis
Copy link
Contributor

Now that #5095 is merged this should address the problem for the CUDA backend, so I will remove the CUDA label.

@bader I believe the remaining issue here is with the OpenCL/Level Zero backend.

@AerialMantis AerialMantis removed the cuda CUDA back-end label Dec 13, 2021
@bader
Copy link
Contributor

bader commented Dec 13, 2021

HIP backend fix is not merged yet.

@bader I believe the remaining issue here is with the OpenCL/Level Zero backend.

I think exception with CL_INVALID_WORK_GROUP_SIZE error code might be expected here. Do you think OpenCL/Level Zero should support work size > 5000000000?

@AerialMantis AerialMantis added the hip Issues related to execution on HIP backend. label Dec 13, 2021
@AerialMantis
Copy link
Contributor

I'm not sure about Level Zero, but AFAICT OpenCL doesn't have any limitation to the global work size, the only thing I see is there's the CL_KERNEL_GLOBAL_WORK_SIZE query for clGetKernelWorkGroupInfo, though this is only for custom devices and built-in kernel functions, I believe in OpenCL any global size is expected to work.

Though 5000000000 is larger than the max value of a 32bit unsigned integer so I can see why this could fail.

@bader
Copy link
Contributor

bader commented Dec 16, 2021

@masterleinad, could you check if OpenCL back-end has such limitation by setting SYCL_DEVICE_FILTER=opencl:gpu, please?
I see that Level Zero plug-in is trying to set work-group size by using zeKernelSuggestGroupSize function with global size parameters type - uint32_t i.e. 32-bit integer. So, it looks like although SYCL uses size_t type to represent global work size, Level Zero plugin is able to support global work sizes up to UINT32_MAX.
OpenCL back-end is using clEnqueueNDRangeKernel directly, which accepts size_t global work sizes. Potentially it can support full range of values allowed for SYCL.

oneAPI DPC++ automation moved this from In review to Closed Dec 17, 2021
bader pushed a commit that referenced this issue Dec 17, 2021
…#5115)

This is the equivalent for HIP of the changes in #5095.

It also fixes #4255 for the HIP plugin.
@masterleinad
Copy link
Contributor Author

@masterleinad, could you check if OpenCL back-end has such limitation by setting SYCL_DEVICE_FILTER=opencl:gpu, please?

It seems to work with the OpenCL back end.

@bader bader reopened this Dec 18, 2021
oneAPI DPC++ automation moved this from Closed to In progress Dec 18, 2021
@bader bader removed the hip Issues related to execution on HIP backend. label Dec 18, 2021
@bader bader removed this from In progress in oneAPI DPC++ Dec 18, 2021
@bader bader changed the title sycl::parallel_for with ranges larger than INT_MAX deadlocks or aborts [Level Zero] sycl::parallel_for with ranges larger than INT_MAX deadlocks or aborts Dec 18, 2021
@TApplencourt
Copy link
Contributor

TApplencourt commented May 26, 2022

The bug is still present in Compiler 2022.1.0 (2022.x.0.20220503) with the L0 backend (agama 449)

$cat master.cpp
#include <iostream>
#include <CL/sycl.hpp>

int main(int, char**) {
   sycl::queue Q;
   size_t N = 4298000000;
   Q.parallel_for(N, [=](auto i) {}).wait();
}
$dpcpp master.cpp -fno-sycl-id-queries-fit-in-int
$./a.out
terminate called after throwing an instance of 'cl::sycl::runtime_error'
  what():  PI backend failed. PI backend returns: -54 (CL_INVALID_WORK_GROUP_SIZE) -54 (CL_INVALID_WORK_GROUP_SIZE)
Aborted

Also if we have a WA when range >= numeric_limit<int>::max() can we to the conversion at runtime?
Some of our users go bitten by this limitation. Just to add, Q.fill cannot be used to fill a buffer big than an int. I assure you that people allocate more than 4GB of memory and will try to set it... So having -fno-sycl-id-queries-fit-in-int by default may streamline user experience.

@bader
Copy link
Contributor

bader commented May 29, 2022

I've discussed that issue with @bashbaug a few months ago and he told me that Level Zero driver doesn't support work sizes larger than 2^{32}. The application aborts as it doesn't handle the exception DPC++ runtime library throws to report about unsupported work-size.
Is it possible to reduce the work size to meet low-level runtime requirements (e.g. by enqueuing kernel multiple times)?

@TApplencourt
Copy link
Contributor

TApplencourt commented May 31, 2022

I've discussed that issue with @bashbaug a few months ago and he told me that Level Zero driver doesn't support work sizes larger than 2^{32}. The application aborts as it doesn't handle the exception DPC++ runtime library throws to report about unsupported work-size.

Oh, I see. Thanks for the update! Let me gather more info and come back to you.
Compiling with fno-sycl-id-queries-fit-in-int make the run-time error disappear but I didn't yet check for the result correctness.

Is it possible to reduce the work size to meet low-level runtime requirements (e.g. by enqueuing kernel multiple times)?

It will be maybe more manageable to do it at the SYCL runtime level?

Indeed, each and every application will need to do that for each kernel submission (this can be a workaround with some nice abstraction). More painful, the work needs to be done also for each function that implicitly uses "parallel_for", for example, Q.fill. This one started being more tedious to implement as it required an understanding of the DPCPP runtime.

Edit: After talking to @jandres742, the "real" workaround is to set -ze-opt-greater-than-4GB-buffer-required when creating the module.

Edit2: Maybe also related to an IGC bug where get_global_id() only goes until UINT_MAX.

@bader
Copy link
Contributor

bader commented Sep 4, 2022

One more work-around idea: I suppose if we explicitly set a work-group size, so that the # of work-groups will be < 2^{32}, the code from the issue description should work with Level Zero back-end. This will require using parallel_for kernel invocation function with nd_range argument instead of range.

@TApplencourt
Copy link
Contributor

TApplencourt commented Sep 6, 2022

#include <iostream>
#include <CL/sycl.hpp>
#include <level_zero/ze_api.h>

int main(int, char**) {
   sycl::queue Q;
   sycl::device D = Q.get_device();

   auto zD = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(D);
   ze_device_compute_properties_t device_properties;
   zeDeviceGetComputeProperties(zD, &device_properties);

   //L0 spec may need to changed so this doesn't return an `uint32_t`
   uint32_t maxGroupCountX = device_properties.maxGroupCountX;
   uint32_t maxGroupSizeX = device_properties.maxGroupSizeX;
   size_t  maxWorkItemX = (size_t) maxGroupSizeX * maxGroupCountX;
   std::cout << "maxGroupSizeX " << maxGroupSizeX << std::endl;
   std::cout << "maxGroupCountX " << maxGroupCountX << std::endl;
   std::cout << "maxGroupSizeX*maxGroupCountX " << maxWorkItemX << std::endl;

   std::cout << "Sumiting kernel..." << std::endl;
   std::cout << "Submiting maxGroupCountX work-items kernel" << std::endl;
   Q.parallel_for(maxGroupCountX, [=](sycl::id<1> i) {}).wait();

   std::cout<< "Submitting maxGroupSizeX*maxGroupCountX work-items kernel" << std::endl;
   Q.parallel_for(maxWorkItemX, [=](sycl::id<1> i) {}).wait();
   // SYCL is a high-level language, that should run independently of any backend restriction
   std::cout<< "Submitting 2*maxGroupSizeX*maxGroupCountX work-items kernel" << std::endl;
   Q.parallel_for(2*maxWorkItemX, [=](sycl::id<1> i) {}).wait();
}

I wrote a simple set of reproducers. I think all of them should pass. Maybe it can help.
Just to be clear this issue is blocking a lot of applications of running their large problem size ;(

My understanding is that SYCL doesn't have any "kernel wise sync". So we should be able to always split large work-item into whatever chunk size who are is available by the backend (assuming the local-group size specified fit ofc) .

@bashbaug
Copy link
Contributor

bashbaug commented Sep 6, 2022

So we should be able to always split large work-item into whatever chunk size who are is available by the backend (assuming the local-group size specified fit ofc) .

FWIW, this is surprisingly difficult to do in the general case. Note that the "global offset" functionality provided by OpenCL and Level Zero offsets the global ID, not the group ID, so this isn't sufficient by itself to do the splitting in the higher-level runtimes. For CUDA, there is no "global offset" or similar. We could probably figure out a way to make it work, but it'd be complicated (and probably a little icky).

Just to be clear this issue is blocking a lot of applications of running their large problem size ;(

Is there some reasonable upper bound on a "large problem size", or should we plan for a full 64-bit range?

@TApplencourt
Copy link
Contributor

TApplencourt commented Sep 6, 2022

FWIW, this is surprisingly difficult to do in the general case. Note that the "global offset" functionality provided by OpenCL and Level Zero offsets the global ID, not the group ID, so this isn't sufficient by itself to do the splitting in the higher-level runtimes. For CUDA, there is no "global offset" or similar. We could probably figure out a way to make it work, but it'd be complicated (and probably a little icky).

I see, thanks for the explanation! As always, from the outside, everything looks easy :) I guess you will need to add a new kernel argument to handle the offset and the like. Sound icky indeed.
I hope that this workaround is not mandatory and that the L0 backend can fix this issue. I ear that CUDA and OpenCL backend handle my reproducer fine.

Is there some reasonable upper bound on a "large problem size", or should we plan for a full 64-bit range?

To be honest, I don't know... I guess my hand-wavy answer is "as much as they are used running on NVIDIA". More than 32-bit, this is for sure. And I think less or equal to maxGroupSizeX * maxGroupCountX :) I think that our priority should be to get

   std::cout<< "Submitting maxGroupSizeX*maxGroupCountX work-items kernel" << std::endl;
   Q.parallel_for(maxWorkItemX, [=](sycl::id<1> i) {}).wait();

working. We care less about the 2*.maxWorkItemX case.

@bashbaug
Copy link
Contributor

bashbaug commented Sep 6, 2022

To be honest, I don't know... I guess my hand-wavy answer is "as much as they are used running on NVIDIA". More than 32-bit, this is for sure. And I think less or equal to maxGroupSizeX * maxGroupCountX :)

OK thanks, this is helpful.

HW-wise our limit is on the number of work-groups we can launch and the max work-group size (pretty sure other HW is similar). This means that launching a global range equal to max_group_size * max_group_count should work if the group size is equal to max_group_size, but it won't work if the group size is smaller.

@TApplencourt
Copy link
Contributor

HW-wise our limit is on the number of work-groups we can launch and the max work-group size (pretty sure other HW is similar). This means that launching a global range equal to max_group_size * max_group_count should work if the group size is equal to max_group_size, but it won't work if the group size is smaller.

This sound like a valid limitation to me! If the user specifies a nd_range / group size they give up on some flexibilities.
And in the case of range / no group size the group size algorithm should choose the "correct" group size for me to run

@xtian-github
Copy link

@smaslov-intel @bader do we have ETA for this issue to be resolved? Thomas/ANL is asking for it. Thanks.

@smaslov-intel
Copy link
Contributor

A workaround is coming in #7321
It will allow some work sizes greater than UINT32_MAX (those that are exactly devisable by some legal WG size)

againull pushed a commit that referenced this issue Nov 9, 2022
)

Workaround for the issue described in
#4255
Signed-off-by: Sergey V Maslov <sergey.v.maslov@intel.com>
@KornevNikita
Copy link
Contributor

KornevNikita commented May 17, 2024

Hi! There have been no updates for at least the last 60 days, though the ticket has assignee(s).

@smaslov-intel, could I ask you to take one of the following actions? :)

  • Please provide an update if you have any (or just a small comment if you don't have any yet).
  • OR mark this issue with the 'confirmed' label if you have confirmed the problem/request and our team should work on it.
  • OR close the issue if it has been resolved.
  • OR take any other suitable action.

Thanks!

@xtian-github
Copy link

@KornevNikita SergeyM is on leave. I suggest SYCL to take a look to see what is a right fix to address this issue. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working runtime Runtime library related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants