[SYCL][PI] New device information descriptors: max_global_work_groups and max_work_groups #4064

Michoumichmich · 2021-07-06T21:08:40Z

SYCL currently does not provide a way to query a device to get the maximum number of work groups that can be submitted in each dimension as well as the number of work groups that can be submitted across all the dimensions.
This query does not exist in openCL, but now that GPU are offered through the PI, this query becomes more relevant as different vendors/devices have their own limits.

This commit implements the feature for the host device, level-zero, openCL, ROCm and CUDA. If the query is not applicable, the maximum acceptable value is returned.

Descriptors added:

ext_oneapi_max_global_work_groups
ext_oneapi_max_work_groups_1d
ext_oneapi_max_work_groups_2d
ext_oneapi_max_work_groups_3d

Feature test macro:

SYCL_EXT_ONEAPI_MAX_WORK_GROUP_QUERY defined to 1

Signed-off-by: Michel Migdal michel.migdal@codeplay.com

SYCL currently does not provide a way to query a device to get the maximum number of work groups that can be submitted in each dimension. This query does not exist in openCL, but now that GPU are offered through the PI, this query becomes more relevant as different vendors/devices have different limits. This commit implements the feature for the host device, level-zero, openCL, ROCm and CUDA. If the query is not applicable, the maximum acceptable value is returned.

mkinsner · 2021-07-07T13:14:59Z

Hello. Thanks for adding this! A few questions/comments:

max_global_work_sizes implies to me the maximum number of work-items globally (num work-items per work-group times number work-groups in each dimension), and not the maximum number of work-groups. Should the query name be something more like max_number_work_groups?
Can the actual number of work-groups allowed to be enqueued at runtime be smaller than the value returned by this query, if the size of each workgroup is large? Or do you expect this maximum number of work-groups to always be possible to enqueue, regardless of the number of work-items within each work-group? Said another way, should the value returned here ever depend on a work-group size for any of the backends?
SYCL 2020 now defines namespace and naming requirements for extensions, that I think should be followed here. Specifically, I think the query should be in the sycl::ext::oneapi namespace, until it eventually gets folded into the core SYCL spec. The extension should also define a feature test macro, something along the lines of SYCL_EXT_ONEAPI_MAX_GLOBAL_WORK_SIZE.

If you agree with changes falling out from the above but want me to propose the wording for anything, I'm happy to help.

For reference, the wording of the CL_DEVICE_MAX_WORK_GROUP_SIZE query in OpenCL might be useful. It provides different information, but it already factors in sensitivity to other kernel details that might make the maximum not possible to enqueue with.

Michoumichmich · 2021-07-07T13:48:26Z

Hello, thanks for your comments.

Indeed, max_number_work_groups seems like a better name. Still I believe that we miss in SYCL a name for the space where work-groups are created (which is not unbound anymore).
Good question, this query is does not have access to your work-group size. I haven't seen (yet) information about your work-group size influencing the number of work groups you can submit. Well, of course you need to have enough memory, but that's not the point I guess. Can you enqueue a kernel with all the maxed-out work-group counts ? No. Not in DPC++ I guess. When enqueuing kernels, in DPC++ there is a check that bounds the product of all the dimensions to be smaller than std::numeric_limits<int>::max. So the only thing reasonable I found to return from the query is that limit for each dimension. Let's say someone uses only one dimension, he should get the maximum he can submit which is min(device_max, std::numeric_limits<int>::max). For an openCL device, if you use the max count on every dimension it will certainly overflow. So you can't. Maybe we could add a query to get the maximum size across (product) all the dimensions? That would certainly be better.

At least if that value could be accessible in a header for info queries it would prevent future errors.

Okay, I will move that to the extension namespace. I'm currently opening an issue/discussion on the sycl-spec to get more feedback

Do you think there could be a way to specialise max_number_work_groups so you get max_number_work_groups <1/2/3>. Turns out that with the CUDA backend (at least) there is an ordering trick which changes the order of the dimensions. So we could get :

id<1> gpu_sizes = gpu.get_info<info::device::max_number_work_groups?>();
range<1>(gpu_sizes[0]); // gpu_size[0] = 2**31 - 1

And

id<2> gpu_sizes = gpu.get_info<info::device::max_number_work_groups?>();
range<2>(gpu_sizes[0], gpu_sizes[1]);  // gpu_size[0] = 65565 &  gpu_size[1] = 2**31 - 1

mkinsner · 2021-07-07T15:42:50Z

There are already some queries that are tied to a specific kernel. Backends seem to have kernel-independent queries for max number of work-groups, but to make sure that you're aware of the possibility, check Table 133 at https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#_kernel_information_descriptors. These are queries from the kernel class, so can factor in things like memory utilization.

Maybe we could add a query to get the maximum size across (product) all the dimensions?

SYCL already has something like this for the number of work-items in a work-group. For individual dimensions one can query info::device::max_work_item_sizes<3>, and to get a scalarized limit one can instead query info::device::max_work_group_size. Something similar could be done here if useful.

Do you think there could be a way to specialize max_number_work_groups so you get max_number_work_groups <1/2/3>

There has been talk about this before, but I don't think it exists in any spec yet. This capability probably should exist, though. @Pennycook @gmlueck do either of you know of any existing precedent for this? I suspect that we'd want to pass the dimensionality information as part of the param type in template <typename param> typename param::return_type get_info() const;, and that would then impact the query return type.

Michoumichmich · 2021-07-07T15:56:58Z

Maybe we could add a query to get the maximum size across (product) all the dimensions?

SYCL already has something like this for the number of work-items in a work-group. For individual dimensions one can query info::device::max_work_item_sizes<3>, and to get a scalarized limit one can instead query info::device::max_work_group_size. Something similar could be done here if useful.

Yes, that's exactly why I was proposing that, maybe something like max_global_number_work_groups in addition of max_number_work_groups ?

…ro) and added bound check The bound check is probably not usefull yet for cuda and rocm

keryell · 2021-07-07T19:09:14Z

By looking at the current spec I realize that there is some lack of uniformity.
There is no use of "number", only 1 case with a "num" in info::device::max_num_sub_groups which probably should be renamed info::device::max_sub_groups...
So what about ext_oneapi_max_global_work_groups and ext_oneapi_max_work_groups instead?

Michoumichmich · 2021-07-07T19:26:31Z

By looking at the current spec I realize that there is some lack of uniformity.
There is no use of "number", only 1 case with a "num" in info::device::max_num_sub_groups which probably should be renamed info::device::max_sub_groups...
So what about ext_oneapi_max_global_work_groups and ext_oneapi_max_work_groups instead?
It makes the names shorter and

It would make the naming shorter and more consistent, for sure. But the name then becomes (almost) a substring of max_work_group_sizes ? Can't that lead to errors? Especially since max_work_group_sizes is (I assume) more used than max_work_groups.
What about nd_range ? we would have max_nd_range_sizes and max_global_nd_range[_size].

gmlueck · 2021-07-14T13:20:03Z

Do you think there could be a way to specialize max_number_work_groups so you get max_number_work_groups <1/2/3>

There has been talk about this before, but I don't think it exists in any spec yet. This capability probably should exist, though. @Pennycook @gmlueck do either of you know of any existing precedent for this? I suspect that we'd want to pass the dimensionality information as part of the param type in template <typename param> typename param::return_type get_info() const;, and that would then impact the query return type.

Maybe I don't understand the question, but it seems like info::device::max_work_item_sizes is an example. There are three specializations, which return an id<1>, id<2>, or an id<3>:

info::device::max_work_item_sizes<1>
info::device::max_work_item_sizes<2>
info::device::max_work_item_sizes<3>

See: https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#_device_information_descriptors

Michoumichmich · 2021-07-14T14:50:14Z

Do you think there could be a way to specialize max_number_work_groups so you get max_number_work_groups <1/2/3>

There has been talk about this before, but I don't think it exists in any spec yet. This capability probably should exist, though. @Pennycook @gmlueck do either of you know of any existing precedent for this? I suspect that we'd want to pass the dimensionality information as part of the param type in template <typename param> typename param::return_type get_info() const;, and that would then impact the query return type.

Maybe I don't understand the question, but it seems like info::device::max_work_item_sizes is an example. There are three specializations, which return an id<1>, id<2>, or an id<3>:

info::device::max_work_item_sizes<1>

info::device::max_work_item_sizes<2>

info::device::max_work_item_sizes<3>

See: https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#_device_information_descriptors

Hello,
Yes, but these are not implemented in DPC++ as it is using enums. If we moved to (templated) structs you could implemenent it as in the spec. But that change would break the ABI

gmlueck · 2021-07-14T20:08:56Z

Yes, but these are not implemented in DPC++ as it is using enums. If we moved to (templated) structs you could implemenent it as in the spec. But that change would break the ABI

Agreed, that following the info::device::max_work_item_sizes<> model would need to wait for the rest of the DPC++ info descriptors to be migrated to the SYCL 2020 info descriptor mechanism. If you need to add this extension sooner, I suppose you could just add three enums:

ext_oneapi_max_number_work_groups_1d,
ext_oneapi_max_number_work_groups_2d,
ext_oneapi_max_number_work_groups_3d

It's a little unfortunate, though, to add a temporary extension like this that will end up changing once DPC++ implements the SYCL 2020 info descriptors.

sycl/doc/extensions/DeviceInfoWorkSizes/README.md

sycl/include/CL/sycl/feature_test.hpp

…work_sizes

sycl/doc/extensions/MaxWorkGroupQueries/max_work_group_query.md

…m into max_global_work_sizes

steffenlarsen · 2021-09-07T10:04:25Z

Good stuff! It is unfortunate that it can't use the template variants of info descriptors yet.

Maybe it would be worth considering having only the 3D variant of info::device::ext_oneapi_max_work_groups for now, a bit like the current version of max_work_item_sizes. It is less user friendly due to the flipping, but the extension could have a note about how 3D maps to 2D and 1D.

When the info descriptors are made SYCL 2020 compliant in the future we can make a template variant of info::device::ext_oneapi_max_work_groups that defaults to 3D. This means existing user-code won't have to adapt immediately as they would still get the 3D variant.

gmlueck · 2021-09-07T14:41:46Z

Maybe it would be worth considering having only the 3D variant of info::device::ext_oneapi_max_work_groups for now, a bit like the current version of max_work_item_sizes. It is less user friendly due to the flipping, but the extension could have a note about how 3D maps to 2D and 1D.

Why is this better than adding the 3D, 2D, and 1D variations now, and then adding the template version later when the DPC++ info descriptors are made conformant with SYCL 2020? I was thinking that we can deprecate the 3D, 2D, and 1D variations once we have the templated one, and then eventually remove them. Doing it this way avoids the need to document (or support) the 3D version as a way to get info about 2D or 1D loops.

steffenlarsen · 2021-09-07T14:50:18Z

Why is this better than adding the 3D, 2D, and 1D variations now, and then adding the template version later when the DPC++ info descriptors are made conformant with SYCL 2020? I was thinking that we can deprecate the 3D, 2D, and 1D variations once we have the templated one, and then eventually remove them. Doing it this way avoids the need to document (or support) the 3D version as a way to get info about 2D or 1D loops.

"Better" is such a strong word. W.r.t. ABI it isn't better, but it comes with the benefit of users not having to change their code once the descriptor is changed. Say a user wants to use the 2D variant they can write their own converter from 3D right now. When templated descriptors are introduced, info::device::ext_oneapi_max_work_groups would be changed to something like:

 template<int dimensions = 3> struct ext_oneapi_max_work_groups;

This means that any code using info::device::ext_oneapi_max_work_groups wouldn't be using a deprecated descriptor, but would still get the 3D version (because 3 is the default dimensionality) and their conversion would still be valid albeit outdated. It wouldn't warn the user that features they want have been added, but it means less deprecated features in the inevitable future.

gmlueck · 2021-09-07T15:35:02Z

I agree that approach allows some user code to continue working even after we move to the template version of the info descriptors. However, I see two downsides:

I think it would be better for the long-term API if the template parameter did not have a default. This will cause the compiler to give an error if a size isn't specified, which will force users to think about the dimensionality of their loop and use the appropriate query. This is the strategy we have for the max_work_item_sizes query that's in the spec now.
We need to retain documented support for applying the 3D query to 2D and 1D loops into the indefinite future. That seems a bit ugly, and is also inconsistent with the max_work_item_sizes query.

Since this is an experimental API, I thought it would not be problematic if we eventually deprecate and remove the non-templated versions of the queries. (Our definition of "experimental API" means we can change the API even without going through a deprecation process.)

I guess another option is to proceed as you propose, but document the default template parameter as deprecated, and also deprecate the language about using the 3D query for 2D and 1D loops. We would then remove those from the spec at some point after deprecation.

steffenlarsen · 2021-09-07T16:21:21Z

I completely agree, it definitely comes with its own set of drawbacks. I am not sure which of the solutions I think is the best, but I just wanted to throw the spanner in the works before a final conclusion was made. I apologize that it was a bit late in the process.

Michoumichmich · 2021-09-07T16:37:20Z

I all the cases the API will be broken, but if we go ahead with the 1/2/3d version, at least the API/query semantics will remain unchanged. Changing the code later will be easier. If we go with one query version, programmers will have to do two index flips: today, and when the ABI freeze is lifted.

steffenlarsen · 2021-09-07T17:48:36Z

Changing the code later will be easier.

I don't think it will be difficult either way. In the hard-coded dimensionality option you would have two descriptors doing the same job however, until the deprecated version is removed.

If we go with one query version, programmers will have to do two index flips: today, and when the ABI freeze is lifted.

Should hopefully only be at most one flip. If you have to flip from 3D, then that logic can just be scrapped when moving to <3D. Granted it might be confusing to the user when that happens, but we'll have the same problem with max_work_item_sizes (not that it is an argument for it.) That said, they don't have to adapt while the 3D default stays in place, and when it is removed they are free to use the 3D version and their own flipping logic.

If consensus is that the _(1|2|3)d variants is the most advantageous I am not opposed to it, but I think both sides have their benefits and drawbacks.

…work_sizes

bader · 2021-10-14T17:33:35Z

Folks, what is the status here? I see that #4563 is pending on these changes, so I'd like to make sure it moves forward.

It looks like we need to resolve merge conflicts at least.

Michoumichmich · 2021-10-14T17:36:45Z

Folks, what is the status here? I see that #4563 is pending on these changes, so I'd like to make sure it moves forward.

It looks like we need to resolve merge conflicts at least.

Hello,
I was resolving the conflicts, but I stopped given that this PR wasn't getting merged. If you want to, I can solve them

bader · 2021-10-14T18:02:44Z

There are quite a lot of comments here already and I'm trying to understand what is the blocker here.
If you just wait this to be merged, I assume we need reviewers to approve this change.
Please, resolve merge conflicts and I'll ping reviewers.

Pulldown

Michoumichmich · 2021-10-14T18:08:04Z

There are quite a lot of comments here already and I'm trying to understand what is the blocker here.
If you just wait this to be merged, I assume we need reviewers to approve this change.
Please, resolve merge conflicts and I'll ping reviewers.

Done!

bader · 2021-10-14T18:14:05Z

@againull, could you take a look, please?

bader

Approving to trigger CI system.

bader · 2021-10-15T14:20:21Z

@Michoumichmich, it looks like we need to update tests checking ABI consistency.

Michoumichmich · 2021-10-15T14:21:32Z

@Michoumichmich, it looks like we need to update tests checking ABI consistency.

Sure, I will do that! I wasn't sure whether I had the "right to" because of the ABI freeze

bader · 2021-10-15T14:28:46Z

@Michoumichmich, it looks like we need to update tests checking ABI consistency.

Sure, I will do that! I wasn't sure whether I had the "right to" because of the ABI freeze

https://github.com/intel/llvm/blob/sycl/CONTRIBUTING.md#development states that "breaking changes are not allowed".

Note (October, 2020): DPC++ runtime and compiler ABI is currently in frozen state. This means that no ABI-breaking changes will be accepted by default. Project maintainers may still approve breaking changes in some cases. Please, see ABI Policy Guide for more information.

The log says that adding new APIs does not break ABI.

There are new symbols in the new library. It is a non-breaking change. Refer to sycl/doc/ABIPolicyGuide.md for further instructions.

According to my understanding the test validates that all symbols are covered by the test to check for "ABI breaking changes".
I think to fix the test, we need add missing symbols to this test.
Adding @alexbatashev to confirm.

bader

@againull, ping.

Michoumichmich requested review from smaslov-intel and a team as code owners July 6, 2021 21:08

Merge branch 'intel:sycl' into max_global_work_sizes

f8856e8

Michoumichmich added 3 commits July 7, 2021 18:10

Renaming of the device query, requested fixes (namespace and test mac…

942d689

…ro) and added bound check The bound check is probably not usefull yet for cuda and rocm

Fixed ROCm

8bedae3

Added ext_oneapi_max_global_number_work_groups and refactoring

4d91d94

Michoumichmich added 3 commits July 8, 2021 13:32

Merge branch 'intel:sycl' into max_global_work_sizes

e46454d

Merge branch 'intel:sycl' into max_global_work_sizes

b9bb42b

Merge branch 'intel:sycl' into max_global_work_sizes

685b71d

gmlueck reviewed Jul 14, 2021

View reviewed changes

sycl/doc/extensions/DeviceInfoWorkSizes/README.md Outdated Show resolved Hide resolved

sycl/doc/extensions/DeviceInfoWorkSizes/README.md Outdated Show resolved Hide resolved

sycl/include/CL/sycl/feature_test.hpp Outdated Show resolved Hide resolved

Michoumichmich added 2 commits July 16, 2021 19:30

Merge branch 'sycl' of https://github.com/intel/llvm into max_global_…

97ebfc7

…work_sizes

Correcting the doc and feature-test macros as suggested in remarks.

367ea28

Michoumichmich requested a review from gmlueck July 18, 2021 02:19

Fix to documentation

d6e8171

gmlueck reviewed Jul 19, 2021

View reviewed changes

sycl/doc/extensions/MaxWorkGroupQueries/max_work_group_query.md Outdated Show resolved Hide resolved

sycl/doc/extensions/MaxWorkGroupQueries/max_work_group_query.md Outdated Show resolved Hide resolved

Michoumichmich added 2 commits July 19, 2021 16:07

Adding three descriptors for 1d, 2d and 3d calls

80e38a5

Fixing ordering in enum

f62d7e3

Michoumichmich changed the title ~~[SYCL][PI] New device information descriptor: max_global_work_sizes~~ [SYCL][PI] New device information descriptors: max_global_number_work_groups and max_number_work_groups Jul 19, 2021

Merge branch 'max_global_work_sizes' of github.com:Michoumichmich/llv…

a149879

…m into max_global_work_sizes

Merge branch 'sycl' of https://github.com/intel/llvm into max_global_…

dc35c7b

…work_sizes

Michoumichmich mentioned this pull request Sep 14, 2021

[SYCL][CUDA] report errors #2793

Closed

steffenlarsen mentioned this pull request Sep 14, 2021

[SYCL][CUDA] Improve error message for exceeding CUDA grid limits #4563

Merged

Michoumichmich added 2 commits October 14, 2021 20:04

Merge branch 'sycl' of https://github.com/intel/llvm into intel-sycl

fb81486

Merge branch 'intel-sycl' into max_global_work_sizes

50919bb

Michoumichmich requested a review from againull as a code owner October 14, 2021 18:07

Merge pull request #17 from intel/sycl

3795d0f

Pulldown

bader previously approved these changes Oct 14, 2021

View reviewed changes

Adding new symbols in ABI dumps

90097ea

Michoumichmich dismissed bader’s stale review via 90097ea October 15, 2021 14:39

bader approved these changes Oct 15, 2021

View reviewed changes

againull approved these changes Oct 15, 2021

View reviewed changes

bader merged commit 2fdf940 into intel:sycl Oct 18, 2021

Michoumichmich deleted the max_global_work_sizes branch October 18, 2021 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][PI] New device information descriptors: max_global_work_groups and max_work_groups #4064

[SYCL][PI] New device information descriptors: max_global_work_groups and max_work_groups #4064

Michoumichmich commented Jul 6, 2021 •

edited

Loading

mkinsner commented Jul 7, 2021

Michoumichmich commented Jul 7, 2021 •

edited

Loading

mkinsner commented Jul 7, 2021

Michoumichmich commented Jul 7, 2021

keryell commented Jul 7, 2021

Michoumichmich commented Jul 7, 2021 •

edited

Loading

gmlueck commented Jul 14, 2021

Michoumichmich commented Jul 14, 2021

gmlueck commented Jul 14, 2021

steffenlarsen commented Sep 7, 2021

gmlueck commented Sep 7, 2021

steffenlarsen commented Sep 7, 2021 •

edited

Loading

gmlueck commented Sep 7, 2021

steffenlarsen commented Sep 7, 2021

Michoumichmich commented Sep 7, 2021

steffenlarsen commented Sep 7, 2021

bader commented Oct 14, 2021

Michoumichmich commented Oct 14, 2021

bader commented Oct 14, 2021

Michoumichmich commented Oct 14, 2021

bader commented Oct 14, 2021

bader left a comment

bader commented Oct 15, 2021

Michoumichmich commented Oct 15, 2021 •

edited

Loading

bader commented Oct 15, 2021

bader left a comment

[SYCL][PI] New device information descriptors: max_global_work_groups and max_work_groups #4064

[SYCL][PI] New device information descriptors: max_global_work_groups and max_work_groups #4064

Conversation

Michoumichmich commented Jul 6, 2021 • edited Loading

mkinsner commented Jul 7, 2021

Michoumichmich commented Jul 7, 2021 • edited Loading

mkinsner commented Jul 7, 2021

Michoumichmich commented Jul 7, 2021

keryell commented Jul 7, 2021

Michoumichmich commented Jul 7, 2021 • edited Loading

gmlueck commented Jul 14, 2021

Michoumichmich commented Jul 14, 2021

gmlueck commented Jul 14, 2021

steffenlarsen commented Sep 7, 2021

gmlueck commented Sep 7, 2021

steffenlarsen commented Sep 7, 2021 • edited Loading

gmlueck commented Sep 7, 2021

steffenlarsen commented Sep 7, 2021

Michoumichmich commented Sep 7, 2021

steffenlarsen commented Sep 7, 2021

bader commented Oct 14, 2021

Michoumichmich commented Oct 14, 2021

bader commented Oct 14, 2021

Michoumichmich commented Oct 14, 2021

bader commented Oct 14, 2021

bader left a comment

Choose a reason for hiding this comment

bader commented Oct 15, 2021

Michoumichmich commented Oct 15, 2021 • edited Loading

bader commented Oct 15, 2021

bader left a comment

Choose a reason for hiding this comment

Michoumichmich commented Jul 6, 2021 •

edited

Loading

Michoumichmich commented Jul 7, 2021 •

edited

Loading

Michoumichmich commented Jul 7, 2021 •

edited

Loading

steffenlarsen commented Sep 7, 2021 •

edited

Loading

Michoumichmich commented Oct 15, 2021 •

edited

Loading