[Level Zero] sycl::parallel_for with ranges larger than INT_MAX deadlocks or aborts #4255

masterleinad · 2021-08-04T15:30:47Z

Describe the bug
Running

#include <iostream>
#include <CL/sycl.hpp>

int main(int, char**) {
   cl::sycl::default_selector device_selector;
   cl::sycl::queue queue(device_selector);
   std::cout << "Running on "
             << queue.get_device().get_info<cl::sycl::info::device::name>()
             << "\n";
   size_t N = INT_MAX; //breaks for CUDA
   // size_t N = 5000000000; // breaks for Intel
   sycl::range<1> range(N+1);
   auto parallel_for_event = queue.submit([&](sycl::handler& cgh) {
     cgh.parallel_for(range, [=](sycl::item<1> /*item*/) {});
   });

   return 0;
}

deadlocks on CUDA devices or gives

C++ exception with description "PI backend failed. PI backend returns: -54 (CL_INVALID_WORK_GROUP_SIZE) -54 (CL_INVALID_WORK_GROUP_SIZE)" thrown in the test body.

on Intel GPUs when compiled and run via

clang++ -fsycl -fsycl-unnamed-lambda -fno-sycl-id-queries-fit-in-int -fsycl-targets=nvptx64-nvidia-cuda-sycldevice && ./a.out

resp.

clang++ -fsycl -fsycl-unnamed-lambda -fno-sycl-id-queries-fit-in-int dummy.cc && ./a.out

Environment:

OS: Linux
Target device and vendor: Intel GPU, NVIDIA GPU
DPC++ version: nightly release 20210621

The text was updated successfully, but these errors were encountered:

zjin-lcf · 2021-08-13T21:31:25Z

Running on Intel(R) UHD Graphics P630 [0x3e96]
terminate called after throwing an instance of 'cl::sycl::runtime_error'
what(): Provided range is out of integer limits. Pass `-fno-sycl-id-queries-fit-in-int' to disable range check. -30 (CL_INVALID_VALUE)

Is it right that the real issue is sycl::range should not be limited to the range of an integer ? Thanks.

bader · 2021-08-16T08:05:21Z

Is it right that the real issue is sycl::range should not be limited to the range of an integer ? Thanks.

Yes. It's done for performance reasons and can be relaxed with -fno-sycl-id-queries-fit-in-int flag if needed.

masterleinad · 2021-08-17T21:33:43Z

As said in the initial post, I was using -fno-sycl-id-queries-fit-in-in already.

masterleinad · 2021-12-07T19:52:19Z

Enabling

size_t N = 5000000000lu; // breaks for Intel

the test still fails with

Running on Intel(R) Graphics [0x020a]
terminate called after throwing an instance of 'cl::sycl::runtime_error'
  what():  PI backend failed. PI backend returns: -54 (CL_INVALID_WORK_GROUP_SIZE) -54 (CL_INVALID_WORK_GROUP_SIZE)
Aborted

on Intel GPUs with a nightly build from 10/25.

This is the equivalent for HIP of the changes in intel#5095. It also fixes intel#4255 for the HIP plugin.

AerialMantis · 2021-12-13T15:21:56Z

Now that #5095 is merged this should address the problem for the CUDA backend, so I will remove the CUDA label.

@bader I believe the remaining issue here is with the OpenCL/Level Zero backend.

bader · 2021-12-13T15:37:20Z

HIP backend fix is not merged yet.

@bader I believe the remaining issue here is with the OpenCL/Level Zero backend.

I think exception with CL_INVALID_WORK_GROUP_SIZE error code might be expected here. Do you think OpenCL/Level Zero should support work size > 5000000000?

AerialMantis · 2021-12-16T12:02:00Z

I'm not sure about Level Zero, but AFAICT OpenCL doesn't have any limitation to the global work size, the only thing I see is there's the CL_KERNEL_GLOBAL_WORK_SIZE query for clGetKernelWorkGroupInfo, though this is only for custom devices and built-in kernel functions, I believe in OpenCL any global size is expected to work.

Though 5000000000 is larger than the max value of a 32bit unsigned integer so I can see why this could fail.

bader · 2021-12-16T16:44:18Z

@masterleinad, could you check if OpenCL back-end has such limitation by setting SYCL_DEVICE_FILTER=opencl:gpu, please?
I see that Level Zero plug-in is trying to set work-group size by using zeKernelSuggestGroupSize function with global size parameters type - uint32_t i.e. 32-bit integer. So, it looks like although SYCL uses size_t type to represent global work size, Level Zero plugin is able to support global work sizes up to UINT32_MAX.
OpenCL back-end is using clEnqueueNDRangeKernel directly, which accepts size_t global work sizes. Potentially it can support full range of values allowed for SYCL.

…#5115) This is the equivalent for HIP of the changes in #5095. It also fixes #4255 for the HIP plugin.

masterleinad · 2021-12-17T21:55:44Z

@masterleinad, could you check if OpenCL back-end has such limitation by setting SYCL_DEVICE_FILTER=opencl:gpu, please?

It seems to work with the OpenCL back end.

TApplencourt · 2022-05-26T23:06:55Z

The bug is still present in Compiler 2022.1.0 (2022.x.0.20220503) with the L0 backend (agama 449)

$cat master.cpp
#include <iostream>
#include <CL/sycl.hpp>

int main(int, char**) {
   sycl::queue Q;
   size_t N = 4298000000;
   Q.parallel_for(N, [=](auto i) {}).wait();
}
$dpcpp master.cpp -fno-sycl-id-queries-fit-in-int
$./a.out
terminate called after throwing an instance of 'cl::sycl::runtime_error'
  what():  PI backend failed. PI backend returns: -54 (CL_INVALID_WORK_GROUP_SIZE) -54 (CL_INVALID_WORK_GROUP_SIZE)
Aborted

Also if we have a WA when range >= numeric_limit<int>::max() can we to the conversion at runtime?
Some of our users go bitten by this limitation. Just to add, Q.fill cannot be used to fill a buffer big than an int. I assure you that people allocate more than 4GB of memory and will try to set it... So having -fno-sycl-id-queries-fit-in-int by default may streamline user experience.

bader · 2022-05-29T09:08:23Z

I've discussed that issue with @bashbaug a few months ago and he told me that Level Zero driver doesn't support work sizes larger than 2^{32}. The application aborts as it doesn't handle the exception DPC++ runtime library throws to report about unsupported work-size.
Is it possible to reduce the work size to meet low-level runtime requirements (e.g. by enqueuing kernel multiple times)?

TApplencourt · 2022-05-31T15:07:29Z

I've discussed that issue with @bashbaug a few months ago and he told me that Level Zero driver doesn't support work sizes larger than 2^{32}. The application aborts as it doesn't handle the exception DPC++ runtime library throws to report about unsupported work-size.

Oh, I see. Thanks for the update! Let me gather more info and come back to you.
Compiling with fno-sycl-id-queries-fit-in-int make the run-time error disappear but I didn't yet check for the result correctness.

Is it possible to reduce the work size to meet low-level runtime requirements (e.g. by enqueuing kernel multiple times)?

It will be maybe more manageable to do it at the SYCL runtime level?

Indeed, each and every application will need to do that for each kernel submission (this can be a workaround with some nice abstraction). More painful, the work needs to be done also for each function that implicitly uses "parallel_for", for example, Q.fill. This one started being more tedious to implement as it required an understanding of the DPCPP runtime.

Edit: After talking to @jandres742, the "real" workaround is to set -ze-opt-greater-than-4GB-buffer-required when creating the module.

Edit2: Maybe also related to an IGC bug where get_global_id() only goes until UINT_MAX.

bader · 2022-09-04T16:03:42Z

One more work-around idea: I suppose if we explicitly set a work-group size, so that the # of work-groups will be < 2^{32}, the code from the issue description should work with Level Zero back-end. This will require using parallel_for kernel invocation function with nd_range argument instead of range.

TApplencourt · 2022-09-06T16:09:32Z

#include <iostream>
#include <CL/sycl.hpp>
#include <level_zero/ze_api.h>

int main(int, char**) {
   sycl::queue Q;
   sycl::device D = Q.get_device();

   auto zD = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(D);
   ze_device_compute_properties_t device_properties;
   zeDeviceGetComputeProperties(zD, &device_properties);

   //L0 spec may need to changed so this doesn't return an `uint32_t`
   uint32_t maxGroupCountX = device_properties.maxGroupCountX;
   uint32_t maxGroupSizeX = device_properties.maxGroupSizeX;
   size_t  maxWorkItemX = (size_t) maxGroupSizeX * maxGroupCountX;
   std::cout << "maxGroupSizeX " << maxGroupSizeX << std::endl;
   std::cout << "maxGroupCountX " << maxGroupCountX << std::endl;
   std::cout << "maxGroupSizeX*maxGroupCountX " << maxWorkItemX << std::endl;

   std::cout << "Sumiting kernel..." << std::endl;
   std::cout << "Submiting maxGroupCountX work-items kernel" << std::endl;
   Q.parallel_for(maxGroupCountX, [=](sycl::id<1> i) {}).wait();

   std::cout<< "Submitting maxGroupSizeX*maxGroupCountX work-items kernel" << std::endl;
   Q.parallel_for(maxWorkItemX, [=](sycl::id<1> i) {}).wait();
   // SYCL is a high-level language, that should run independently of any backend restriction
   std::cout<< "Submitting 2*maxGroupSizeX*maxGroupCountX work-items kernel" << std::endl;
   Q.parallel_for(2*maxWorkItemX, [=](sycl::id<1> i) {}).wait();
}

I wrote a simple set of reproducers. I think all of them should pass. Maybe it can help.
Just to be clear this issue is blocking a lot of applications of running their large problem size ;(

My understanding is that SYCL doesn't have any "kernel wise sync". So we should be able to always split large work-item into whatever chunk size who are is available by the backend (assuming the local-group size specified fit ofc) .

bashbaug · 2022-09-06T19:06:31Z

So we should be able to always split large work-item into whatever chunk size who are is available by the backend (assuming the local-group size specified fit ofc) .

FWIW, this is surprisingly difficult to do in the general case. Note that the "global offset" functionality provided by OpenCL and Level Zero offsets the global ID, not the group ID, so this isn't sufficient by itself to do the splitting in the higher-level runtimes. For CUDA, there is no "global offset" or similar. We could probably figure out a way to make it work, but it'd be complicated (and probably a little icky).

Just to be clear this issue is blocking a lot of applications of running their large problem size ;(

Is there some reasonable upper bound on a "large problem size", or should we plan for a full 64-bit range?

TApplencourt · 2022-09-06T20:03:28Z

FWIW, this is surprisingly difficult to do in the general case. Note that the "global offset" functionality provided by OpenCL and Level Zero offsets the global ID, not the group ID, so this isn't sufficient by itself to do the splitting in the higher-level runtimes. For CUDA, there is no "global offset" or similar. We could probably figure out a way to make it work, but it'd be complicated (and probably a little icky).

I see, thanks for the explanation! As always, from the outside, everything looks easy :) I guess you will need to add a new kernel argument to handle the offset and the like. Sound icky indeed.
I hope that this workaround is not mandatory and that the L0 backend can fix this issue. I ear that CUDA and OpenCL backend handle my reproducer fine.

Is there some reasonable upper bound on a "large problem size", or should we plan for a full 64-bit range?

To be honest, I don't know... I guess my hand-wavy answer is "as much as they are used running on NVIDIA". More than 32-bit, this is for sure. And I think less or equal to maxGroupSizeX * maxGroupCountX :) I think that our priority should be to get

   std::cout<< "Submitting maxGroupSizeX*maxGroupCountX work-items kernel" << std::endl;
   Q.parallel_for(maxWorkItemX, [=](sycl::id<1> i) {}).wait();

working. We care less about the 2*.maxWorkItemX case.

bashbaug · 2022-09-06T22:04:49Z

To be honest, I don't know... I guess my hand-wavy answer is "as much as they are used running on NVIDIA". More than 32-bit, this is for sure. And I think less or equal to maxGroupSizeX * maxGroupCountX :)

OK thanks, this is helpful.

HW-wise our limit is on the number of work-groups we can launch and the max work-group size (pretty sure other HW is similar). This means that launching a global range equal to max_group_size * max_group_count should work if the group size is equal to max_group_size, but it won't work if the group size is smaller.

TApplencourt · 2022-09-06T22:33:11Z

HW-wise our limit is on the number of work-groups we can launch and the max work-group size (pretty sure other HW is similar). This means that launching a global range equal to max_group_size * max_group_count should work if the group size is equal to max_group_size, but it won't work if the group size is smaller.

This sound like a valid limitation to me! If the user specifies a nd_range / group size they give up on some flexibilities.
And in the case of range / no group size the group size algorithm should choose the "correct" group size for me to run

xtian-github · 2022-11-04T19:44:52Z

@smaslov-intel @bader do we have ETA for this issue to be resolved? Thomas/ANL is asking for it. Thanks.

smaslov-intel · 2022-11-08T19:51:34Z

A workaround is coming in #7321
It will allow some work sizes greater than UINT32_MAX (those that are exactly devisable by some legal WG size)

) Workaround for the issue described in #4255 Signed-off-by: Sergey V Maslov <sergey.v.maslov@intel.com>

KornevNikita · 2024-05-17T11:36:35Z

Hi! There have been no updates for at least the last 60 days, though the ticket has assignee(s).

@smaslov-intel, could I ask you to take one of the following actions? :)

Please provide an update if you have any (or just a small comment if you don't have any yet).
OR mark this issue with the 'confirmed' label if you have confirmed the problem/request and our team should work on it.
OR close the issue if it has been resolved.
OR take any other suitable action.

Thanks!

xtian-github · 2024-05-17T15:33:18Z

@KornevNikita SergeyM is on leave. I suggest SYCL to take a look to see what is a right fix to address this issue. Thanks.

masterleinad added the bug Something isn't working label Aug 4, 2021

masterleinad mentioned this issue Aug 4, 2021

Enforce allowing large ranges for SYCL kokkos/kokkos#4211

Merged

AerialMantis added cuda CUDA back-end runtime Runtime library related issue labels Aug 23, 2021

AerialMantis added this to Selected in oneAPI DPC++ Aug 25, 2021

AerialMantis moved this from Selected to Needs triage in oneAPI DPC++ Aug 25, 2021

Michoumichmich mentioned this issue Aug 26, 2021

[SYCL][PI] New device information descriptors: max_global_work_groups and max_work_groups #4064

Merged

AerialMantis moved this from Needs triage to Selected in oneAPI DPC++ Nov 29, 2021

pgorlani mentioned this issue Dec 6, 2021

[SYCL][CUDA][PI] Fix infinite loop when parallel_for range exceeds INT_MAX #5095

Merged

AerialMantis moved this from Selected to In review in oneAPI DPC++ Dec 7, 2021

npmiller added a commit to npmiller/llvm that referenced this issue Dec 9, 2021

[SYCL][HIP] Fix infinite loop when parallel_for range exceeds INT_MAX

94f5c73

This is the equivalent for HIP of the changes in intel#5095. It also fixes intel#4255 for the HIP plugin.

npmiller mentioned this issue Dec 9, 2021

[SYCL][HIP] Fix infinite loop when parallel_for range exceeds INT_MAX #5115

Merged

AerialMantis removed the cuda CUDA back-end label Dec 13, 2021

AerialMantis added the hip Issues related to execution on HIP backend. label Dec 13, 2021

bader closed this as completed in #5115 Dec 17, 2021

oneAPI DPC++ automation moved this from In review to Closed Dec 17, 2021

bader pushed a commit that referenced this issue Dec 17, 2021

[SYCL][HIP] Fix infinite loop when parallel_for range exceeds INT_MAX (…

fd0b108

…#5115) This is the equivalent for HIP of the changes in #5095. It also fixes #4255 for the HIP plugin.

bader reopened this Dec 18, 2021

oneAPI DPC++ automation moved this from Closed to In progress Dec 18, 2021

bader removed the hip Issues related to execution on HIP backend. label Dec 18, 2021

bader removed this from In progress in oneAPI DPC++ Dec 18, 2021

bader changed the title ~~sycl::parallel_for with ranges larger than INT_MAX deadlocks or aborts~~ [Level Zero] sycl::parallel_for with ranges larger than INT_MAX deadlocks or aborts Dec 18, 2021

alexbatashev assigned smaslov-intel Feb 7, 2022

zjin-lcf mentioned this issue May 27, 2022

zeKernelSuggestGroupSize issue oneapi-src/level-zero#91

Open

smaslov-intel mentioned this issue Nov 8, 2022

[SYCL][L0] Support some kernel work sizes greater than UINT32_MAX #7321

Merged

againull pushed a commit that referenced this issue Nov 9, 2022

[SYCL][L0] Support some kernel work sizes greater than UINT32_MAX (#7321

62dd13d

) Workaround for the issue described in #4255 Signed-off-by: Sergey V Maslov <sergey.v.maslov@intel.com>

bader unassigned smaslov-intel May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Level Zero] sycl::parallel_for with ranges larger than INT_MAX deadlocks or aborts #4255

[Level Zero] sycl::parallel_for with ranges larger than INT_MAX deadlocks or aborts #4255

masterleinad commented Aug 4, 2021 •

edited

Loading

zjin-lcf commented Aug 13, 2021 •

edited

Loading

bader commented Aug 16, 2021

masterleinad commented Aug 17, 2021

masterleinad commented Dec 7, 2021

AerialMantis commented Dec 13, 2021

bader commented Dec 13, 2021

AerialMantis commented Dec 16, 2021

bader commented Dec 16, 2021

masterleinad commented Dec 17, 2021

TApplencourt commented May 26, 2022 •

edited

Loading

bader commented May 29, 2022

TApplencourt commented May 31, 2022 •

edited

Loading

bader commented Sep 4, 2022

TApplencourt commented Sep 6, 2022 •

edited

Loading

bashbaug commented Sep 6, 2022

TApplencourt commented Sep 6, 2022 •

edited

Loading

bashbaug commented Sep 6, 2022

TApplencourt commented Sep 6, 2022

xtian-github commented Nov 4, 2022

smaslov-intel commented Nov 8, 2022

KornevNikita commented May 17, 2024 •

edited

Loading

xtian-github commented May 17, 2024

[Level Zero] sycl::parallel_for with ranges larger than INT_MAX deadlocks or aborts #4255

[Level Zero] sycl::parallel_for with ranges larger than INT_MAX deadlocks or aborts #4255

Comments

masterleinad commented Aug 4, 2021 • edited Loading

zjin-lcf commented Aug 13, 2021 • edited Loading

bader commented Aug 16, 2021

masterleinad commented Aug 17, 2021

masterleinad commented Dec 7, 2021

AerialMantis commented Dec 13, 2021

bader commented Dec 13, 2021

AerialMantis commented Dec 16, 2021

bader commented Dec 16, 2021

masterleinad commented Dec 17, 2021

TApplencourt commented May 26, 2022 • edited Loading

bader commented May 29, 2022

TApplencourt commented May 31, 2022 • edited Loading

bader commented Sep 4, 2022

TApplencourt commented Sep 6, 2022 • edited Loading

bashbaug commented Sep 6, 2022

TApplencourt commented Sep 6, 2022 • edited Loading

bashbaug commented Sep 6, 2022

TApplencourt commented Sep 6, 2022

xtian-github commented Nov 4, 2022

smaslov-intel commented Nov 8, 2022

KornevNikita commented May 17, 2024 • edited Loading

xtian-github commented May 17, 2024

masterleinad commented Aug 4, 2021 •

edited

Loading

zjin-lcf commented Aug 13, 2021 •

edited

Loading

TApplencourt commented May 26, 2022 •

edited

Loading

TApplencourt commented May 31, 2022 •

edited

Loading

TApplencourt commented Sep 6, 2022 •

edited

Loading

TApplencourt commented Sep 6, 2022 •

edited

Loading

KornevNikita commented May 17, 2024 •

edited

Loading