-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Level Zero] sycl::parallel_for with ranges larger than INT_MAX deadlocks or aborts #4255
Comments
Running on Intel(R) UHD Graphics P630 [0x3e96] Is it right that the real issue is sycl::range should not be limited to the range of an integer ? Thanks. |
Yes. It's done for performance reasons and can be relaxed with |
As said in the initial post, I was using |
Enabling
the test still fails with
on Intel GPUs with a nightly build from 10/25. |
This is the equivalent for HIP of the changes in intel#5095. It also fixes intel#4255 for the HIP plugin.
HIP backend fix is not merged yet.
I think exception with |
I'm not sure about Level Zero, but AFAICT OpenCL doesn't have any limitation to the global work size, the only thing I see is there's the Though |
@masterleinad, could you check if OpenCL back-end has such limitation by setting |
It seems to work with the |
The bug is still present in
Also if we have a WA when |
I've discussed that issue with @bashbaug a few months ago and he told me that Level Zero driver doesn't support work sizes larger than 2^{32}. The application aborts as it doesn't handle the exception DPC++ runtime library throws to report about unsupported work-size. |
Oh, I see. Thanks for the update! Let me gather more info and come back to you.
It will be maybe more manageable to do it at the SYCL runtime level? Indeed, each and every application will need to do that for each kernel submission (this can be a workaround with some nice abstraction). More painful, the work needs to be done also for each function that implicitly uses "parallel_for", for example, Edit: After talking to @jandres742, the "real" workaround is to set -ze-opt-greater-than-4GB-buffer-required when creating the module. Edit2: Maybe also related to an IGC bug where |
One more work-around idea: I suppose if we explicitly set a work-group size, so that the # of work-groups will be < 2^{32}, the code from the issue description should work with Level Zero back-end. This will require using |
#include <iostream>
#include <CL/sycl.hpp>
#include <level_zero/ze_api.h>
int main(int, char**) {
sycl::queue Q;
sycl::device D = Q.get_device();
auto zD = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(D);
ze_device_compute_properties_t device_properties;
zeDeviceGetComputeProperties(zD, &device_properties);
//L0 spec may need to changed so this doesn't return an `uint32_t`
uint32_t maxGroupCountX = device_properties.maxGroupCountX;
uint32_t maxGroupSizeX = device_properties.maxGroupSizeX;
size_t maxWorkItemX = (size_t) maxGroupSizeX * maxGroupCountX;
std::cout << "maxGroupSizeX " << maxGroupSizeX << std::endl;
std::cout << "maxGroupCountX " << maxGroupCountX << std::endl;
std::cout << "maxGroupSizeX*maxGroupCountX " << maxWorkItemX << std::endl;
std::cout << "Sumiting kernel..." << std::endl;
std::cout << "Submiting maxGroupCountX work-items kernel" << std::endl;
Q.parallel_for(maxGroupCountX, [=](sycl::id<1> i) {}).wait();
std::cout<< "Submitting maxGroupSizeX*maxGroupCountX work-items kernel" << std::endl;
Q.parallel_for(maxWorkItemX, [=](sycl::id<1> i) {}).wait();
// SYCL is a high-level language, that should run independently of any backend restriction
std::cout<< "Submitting 2*maxGroupSizeX*maxGroupCountX work-items kernel" << std::endl;
Q.parallel_for(2*maxWorkItemX, [=](sycl::id<1> i) {}).wait();
} I wrote a simple set of reproducers. I think all of them should pass. Maybe it can help. My understanding is that SYCL doesn't have any "kernel wise sync". So we should be able to always split large work-item into whatever chunk size who are is available by the backend (assuming the local-group size specified fit ofc) . |
FWIW, this is surprisingly difficult to do in the general case. Note that the "global offset" functionality provided by OpenCL and Level Zero offsets the global ID, not the group ID, so this isn't sufficient by itself to do the splitting in the higher-level runtimes. For CUDA, there is no "global offset" or similar. We could probably figure out a way to make it work, but it'd be complicated (and probably a little icky).
Is there some reasonable upper bound on a "large problem size", or should we plan for a full 64-bit range? |
I see, thanks for the explanation! As always, from the outside, everything looks easy :) I guess you will need to add a new kernel argument to handle the offset and the like. Sound icky indeed.
To be honest, I don't know... I guess my hand-wavy answer is "as much as they are used running on NVIDIA". More than 32-bit, this is for sure. And I think less or equal to
working. We care less about the |
OK thanks, this is helpful. HW-wise our limit is on the number of work-groups we can launch and the max work-group size (pretty sure other HW is similar). This means that launching a global range equal to |
This sound like a valid limitation to me! If the user specifies a |
@smaslov-intel @bader do we have ETA for this issue to be resolved? Thomas/ANL is asking for it. Thanks. |
A workaround is coming in #7321 |
) Workaround for the issue described in #4255 Signed-off-by: Sergey V Maslov <sergey.v.maslov@intel.com>
Hi! There have been no updates for at least the last 60 days, though the ticket has assignee(s). @smaslov-intel, could I ask you to take one of the following actions? :)
Thanks! |
@KornevNikita SergeyM is on leave. I suggest SYCL to take a look to see what is a right fix to address this issue. Thanks. |
Describe the bug
Running
deadlocks on CUDA devices or gives
on Intel GPUs when compiled and run via
resp.
Environment:
The text was updated successfully, but these errors were encountered: