Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][PI] New device information descriptors: max_global_work_groups and max_work_groups #4064

Merged
merged 37 commits into from
Oct 18, 2021

Conversation

Michoumichmich
Copy link
Contributor

@Michoumichmich Michoumichmich commented Jul 6, 2021

SYCL currently does not provide a way to query a device to get the maximum number of work groups that can be submitted in each dimension as well as the number of work groups that can be submitted across all the dimensions.
This query does not exist in openCL, but now that GPU are offered through the PI, this query becomes more relevant as different vendors/devices have their own limits.

This commit implements the feature for the host device, level-zero, openCL, ROCm and CUDA. If the query is not applicable, the maximum acceptable value is returned.

Descriptors added:

  • ext_oneapi_max_global_work_groups
  • ext_oneapi_max_work_groups_1d
  • ext_oneapi_max_work_groups_2d
  • ext_oneapi_max_work_groups_3d

Feature test macro:

  • SYCL_EXT_ONEAPI_MAX_WORK_GROUP_QUERY defined to 1

Signed-off-by: Michel Migdal michel.migdal@codeplay.com

SYCL currently does not provide a way to query a device to get the maximum number of work groups that can be submitted in each dimension.
This query does not exist in openCL, but now that GPU are offered through the PI, this query becomes more relevant as different vendors/devices have different limits.

This commit implements the feature for the host device, level-zero, openCL, ROCm and CUDA. If the query is not applicable, the maximum acceptable value is returned.
@Michoumichmich Michoumichmich requested review from smaslov-intel and a team as code owners July 6, 2021 21:08
@mkinsner
Copy link

mkinsner commented Jul 7, 2021

Hello. Thanks for adding this! A few questions/comments:

  1. max_global_work_sizes implies to me the maximum number of work-items globally (num work-items per work-group times number work-groups in each dimension), and not the maximum number of work-groups. Should the query name be something more like max_number_work_groups?
  2. Can the actual number of work-groups allowed to be enqueued at runtime be smaller than the value returned by this query, if the size of each workgroup is large? Or do you expect this maximum number of work-groups to always be possible to enqueue, regardless of the number of work-items within each work-group? Said another way, should the value returned here ever depend on a work-group size for any of the backends?
  3. SYCL 2020 now defines namespace and naming requirements for extensions, that I think should be followed here. Specifically, I think the query should be in the sycl::ext::oneapi namespace, until it eventually gets folded into the core SYCL spec. The extension should also define a feature test macro, something along the lines of SYCL_EXT_ONEAPI_MAX_GLOBAL_WORK_SIZE.

If you agree with changes falling out from the above but want me to propose the wording for anything, I'm happy to help.

For reference, the wording of the CL_DEVICE_MAX_WORK_GROUP_SIZE query in OpenCL might be useful. It provides different information, but it already factors in sensitivity to other kernel details that might make the maximum not possible to enqueue with.

@Michoumichmich
Copy link
Contributor Author

Michoumichmich commented Jul 7, 2021

Hello, thanks for your comments.

  1. Indeed, max_number_work_groups seems like a better name. Still I believe that we miss in SYCL a name for the space where work-groups are created (which is not unbound anymore).
  2. Good question, this query is does not have access to your work-group size. I haven't seen (yet) information about your work-group size influencing the number of work groups you can submit. Well, of course you need to have enough memory, but that's not the point I guess. Can you enqueue a kernel with all the maxed-out work-group counts ? No. Not in DPC++ I guess. When enqueuing kernels, in DPC++ there is a check that bounds the product of all the dimensions to be smaller than std::numeric_limits<int>::max. So the only thing reasonable I found to return from the query is that limit for each dimension. Let's say someone uses only one dimension, he should get the maximum he can submit which is min(device_max, std::numeric_limits<int>::max). For an openCL device, if you use the max count on every dimension it will certainly overflow. So you can't. Maybe we could add a query to get the maximum size across (product) all the dimensions? That would certainly be better.

At least if that value could be accessible in a header for info queries it would prevent future errors.

  1. Okay, I will move that to the extension namespace. I'm currently opening an issue/discussion on the sycl-spec to get more feedback

Do you think there could be a way to specialise max_number_work_groups so you get max_number_work_groups <1/2/3>. Turns out that with the CUDA backend (at least) there is an ordering trick which changes the order of the dimensions. So we could get :

id<1> gpu_sizes = gpu.get_info<info::device::max_number_work_groups?>();
range<1>(gpu_sizes[0]); // gpu_size[0] = 2**31 - 1

And

id<2> gpu_sizes = gpu.get_info<info::device::max_number_work_groups?>();
range<2>(gpu_sizes[0], gpu_sizes[1]);  // gpu_size[0] = 65565 &  gpu_size[1] = 2**31 - 1

@mkinsner
Copy link

mkinsner commented Jul 7, 2021

There are already some queries that are tied to a specific kernel. Backends seem to have kernel-independent queries for max number of work-groups, but to make sure that you're aware of the possibility, check Table 133 at https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#_kernel_information_descriptors. These are queries from the kernel class, so can factor in things like memory utilization.

Maybe we could add a query to get the maximum size across (product) all the dimensions?

SYCL already has something like this for the number of work-items in a work-group. For individual dimensions one can query info::device::max_work_item_sizes<3>, and to get a scalarized limit one can instead query info::device::max_work_group_size. Something similar could be done here if useful.

Do you think there could be a way to specialize max_number_work_groups so you get max_number_work_groups <1/2/3>

There has been talk about this before, but I don't think it exists in any spec yet. This capability probably should exist, though. @Pennycook @gmlueck do either of you know of any existing precedent for this? I suspect that we'd want to pass the dimensionality information as part of the param type in template <typename param> typename param::return_type get_info() const;, and that would then impact the query return type.

@Michoumichmich
Copy link
Contributor Author

Maybe we could add a query to get the maximum size across (product) all the dimensions?

SYCL already has something like this for the number of work-items in a work-group. For individual dimensions one can query info::device::max_work_item_sizes<3>, and to get a scalarized limit one can instead query info::device::max_work_group_size. Something similar could be done here if useful.

Yes, that's exactly why I was proposing that, maybe something like max_global_number_work_groups in addition of max_number_work_groups ?

@keryell
Copy link
Contributor

keryell commented Jul 7, 2021

By looking at the current spec I realize that there is some lack of uniformity.
There is no use of "number", only 1 case with a "num" in info::device::max_num_sub_groups which probably should be renamed info::device::max_sub_groups...
So what about ext_oneapi_max_global_work_groups and ext_oneapi_max_work_groups instead?

@Michoumichmich
Copy link
Contributor Author

Michoumichmich commented Jul 7, 2021

By looking at the current spec I realize that there is some lack of uniformity.
There is no use of "number", only 1 case with a "num" in info::device::max_num_sub_groups which probably should be renamed info::device::max_sub_groups...
So what about ext_oneapi_max_global_work_groups and ext_oneapi_max_work_groups instead?
It makes the names shorter and

It would make the naming shorter and more consistent, for sure. But the name then becomes (almost) a substring of max_work_group_sizes ? Can't that lead to errors? Especially since max_work_group_sizes is (I assume) more used than max_work_groups.
What about nd_range ? we would have max_nd_range_sizes and max_global_nd_range[_size].

@gmlueck
Copy link
Contributor

gmlueck commented Jul 14, 2021

Do you think there could be a way to specialize max_number_work_groups so you get max_number_work_groups <1/2/3>

There has been talk about this before, but I don't think it exists in any spec yet. This capability probably should exist, though. @Pennycook @gmlueck do either of you know of any existing precedent for this? I suspect that we'd want to pass the dimensionality information as part of the param type in template <typename param> typename param::return_type get_info() const;, and that would then impact the query return type.

Maybe I don't understand the question, but it seems like info::device::max_work_item_sizes is an example. There are three specializations, which return an id<1>, id<2>, or an id<3>:

  • info::device::max_work_item_sizes<1>
  • info::device::max_work_item_sizes<2>
  • info::device::max_work_item_sizes<3>

See: https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#_device_information_descriptors

@Michoumichmich
Copy link
Contributor Author

Do you think there could be a way to specialize max_number_work_groups so you get max_number_work_groups <1/2/3>

There has been talk about this before, but I don't think it exists in any spec yet. This capability probably should exist, though. @Pennycook @gmlueck do either of you know of any existing precedent for this? I suspect that we'd want to pass the dimensionality information as part of the param type in template <typename param> typename param::return_type get_info() const;, and that would then impact the query return type.

Maybe I don't understand the question, but it seems like info::device::max_work_item_sizes is an example. There are three specializations, which return an id<1>, id<2>, or an id<3>:

  • info::device::max_work_item_sizes<1>
  • info::device::max_work_item_sizes<2>
  • info::device::max_work_item_sizes<3>

See: https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#_device_information_descriptors

Hello,
Yes, but these are not implemented in DPC++ as it is using enums. If we moved to (templated) structs you could implemenent it as in the spec. But that change would break the ABI

@gmlueck
Copy link
Contributor

gmlueck commented Jul 14, 2021

Yes, but these are not implemented in DPC++ as it is using enums. If we moved to (templated) structs you could implemenent it as in the spec. But that change would break the ABI

Agreed, that following the info::device::max_work_item_sizes<> model would need to wait for the rest of the DPC++ info descriptors to be migrated to the SYCL 2020 info descriptor mechanism. If you need to add this extension sooner, I suppose you could just add three enums:

ext_oneapi_max_number_work_groups_1d,
ext_oneapi_max_number_work_groups_2d,
ext_oneapi_max_number_work_groups_3d

It's a little unfortunate, though, to add a temporary extension like this that will end up changing once DPC++ implements the SYCL 2020 info descriptors.

sycl/doc/extensions/DeviceInfoWorkSizes/README.md Outdated Show resolved Hide resolved
sycl/doc/extensions/DeviceInfoWorkSizes/README.md Outdated Show resolved Hide resolved
sycl/include/CL/sycl/feature_test.hpp Outdated Show resolved Hide resolved
@Michoumichmich Michoumichmich changed the title [SYCL][PI] New device information descriptor: max_global_work_sizes [SYCL][PI] New device information descriptors: max_global_number_work_groups and max_number_work_groups Jul 19, 2021
@steffenlarsen
Copy link
Contributor

Good stuff! It is unfortunate that it can't use the template variants of info descriptors yet.

Maybe it would be worth considering having only the 3D variant of info::device::ext_oneapi_max_work_groups for now, a bit like the current version of max_work_item_sizes. It is less user friendly due to the flipping, but the extension could have a note about how 3D maps to 2D and 1D.

When the info descriptors are made SYCL 2020 compliant in the future we can make a template variant of info::device::ext_oneapi_max_work_groups that defaults to 3D. This means existing user-code won't have to adapt immediately as they would still get the 3D variant.

@gmlueck
Copy link
Contributor

gmlueck commented Sep 7, 2021

Maybe it would be worth considering having only the 3D variant of info::device::ext_oneapi_max_work_groups for now, a bit like the current version of max_work_item_sizes. It is less user friendly due to the flipping, but the extension could have a note about how 3D maps to 2D and 1D.

Why is this better than adding the 3D, 2D, and 1D variations now, and then adding the template version later when the DPC++ info descriptors are made conformant with SYCL 2020? I was thinking that we can deprecate the 3D, 2D, and 1D variations once we have the templated one, and then eventually remove them. Doing it this way avoids the need to document (or support) the 3D version as a way to get info about 2D or 1D loops.

@steffenlarsen
Copy link
Contributor

steffenlarsen commented Sep 7, 2021

Why is this better than adding the 3D, 2D, and 1D variations now, and then adding the template version later when the DPC++ info descriptors are made conformant with SYCL 2020? I was thinking that we can deprecate the 3D, 2D, and 1D variations once we have the templated one, and then eventually remove them. Doing it this way avoids the need to document (or support) the 3D version as a way to get info about 2D or 1D loops.

"Better" is such a strong word. W.r.t. ABI it isn't better, but it comes with the benefit of users not having to change their code once the descriptor is changed. Say a user wants to use the 2D variant they can write their own converter from 3D right now. When templated descriptors are introduced, info::device::ext_oneapi_max_work_groups would be changed to something like:

 template<int dimensions = 3> struct ext_oneapi_max_work_groups;

This means that any code using info::device::ext_oneapi_max_work_groups wouldn't be using a deprecated descriptor, but would still get the 3D version (because 3 is the default dimensionality) and their conversion would still be valid albeit outdated. It wouldn't warn the user that features they want have been added, but it means less deprecated features in the inevitable future.

@gmlueck
Copy link
Contributor

gmlueck commented Sep 7, 2021

I agree that approach allows some user code to continue working even after we move to the template version of the info descriptors. However, I see two downsides:

  • I think it would be better for the long-term API if the template parameter did not have a default. This will cause the compiler to give an error if a size isn't specified, which will force users to think about the dimensionality of their loop and use the appropriate query. This is the strategy we have for the max_work_item_sizes query that's in the spec now.

  • We need to retain documented support for applying the 3D query to 2D and 1D loops into the indefinite future. That seems a bit ugly, and is also inconsistent with the max_work_item_sizes query.

Since this is an experimental API, I thought it would not be problematic if we eventually deprecate and remove the non-templated versions of the queries. (Our definition of "experimental API" means we can change the API even without going through a deprecation process.)

I guess another option is to proceed as you propose, but document the default template parameter as deprecated, and also deprecate the language about using the 3D query for 2D and 1D loops. We would then remove those from the spec at some point after deprecation.

@steffenlarsen
Copy link
Contributor

I completely agree, it definitely comes with its own set of drawbacks. I am not sure which of the solutions I think is the best, but I just wanted to throw the spanner in the works before a final conclusion was made. I apologize that it was a bit late in the process.

@Michoumichmich
Copy link
Contributor Author

I all the cases the API will be broken, but if we go ahead with the 1/2/3d version, at least the API/query semantics will remain unchanged. Changing the code later will be easier. If we go with one query version, programmers will have to do two index flips: today, and when the ABI freeze is lifted.

@steffenlarsen
Copy link
Contributor

Changing the code later will be easier.

I don't think it will be difficult either way. In the hard-coded dimensionality option you would have two descriptors doing the same job however, until the deprecated version is removed.

If we go with one query version, programmers will have to do two index flips: today, and when the ABI freeze is lifted.

Should hopefully only be at most one flip. If you have to flip from 3D, then that logic can just be scrapped when moving to <3D. Granted it might be confusing to the user when that happens, but we'll have the same problem with max_work_item_sizes (not that it is an argument for it.) That said, they don't have to adapt while the 3D default stays in place, and when it is removed they are free to use the 3D version and their own flipping logic.

If consensus is that the _(1|2|3)d variants is the most advantageous I am not opposed to it, but I think both sides have their benefits and drawbacks.

@bader
Copy link
Contributor

bader commented Oct 14, 2021

Folks, what is the status here? I see that #4563 is pending on these changes, so I'd like to make sure it moves forward.

It looks like we need to resolve merge conflicts at least.

@Michoumichmich
Copy link
Contributor Author

Folks, what is the status here? I see that #4563 is pending on these changes, so I'd like to make sure it moves forward.

It looks like we need to resolve merge conflicts at least.

Hello,
I was resolving the conflicts, but I stopped given that this PR wasn't getting merged. If you want to, I can solve them

@bader
Copy link
Contributor

bader commented Oct 14, 2021

There are quite a lot of comments here already and I'm trying to understand what is the blocker here.
If you just wait this to be merged, I assume we need reviewers to approve this change.
Please, resolve merge conflicts and I'll ping reviewers.

@Michoumichmich
Copy link
Contributor Author

There are quite a lot of comments here already and I'm trying to understand what is the blocker here.
If you just wait this to be merged, I assume we need reviewers to approve this change.
Please, resolve merge conflicts and I'll ping reviewers.

Done!

@bader
Copy link
Contributor

bader commented Oct 14, 2021

@againull, could you take a look, please?

bader
bader previously approved these changes Oct 14, 2021
Copy link
Contributor

@bader bader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to trigger CI system.

@bader
Copy link
Contributor

bader commented Oct 15, 2021

@Michoumichmich, it looks like we need to update tests checking ABI consistency.

@Michoumichmich
Copy link
Contributor Author

Michoumichmich commented Oct 15, 2021

@Michoumichmich, it looks like we need to update tests checking ABI consistency.

Sure, I will do that! I wasn't sure whether I had the "right to" because of the ABI freeze

@bader
Copy link
Contributor

bader commented Oct 15, 2021

@Michoumichmich, it looks like we need to update tests checking ABI consistency.

Sure, I will do that! I wasn't sure whether I had the "right to" because of the ABI freeze

https://github.com/intel/llvm/blob/sycl/CONTRIBUTING.md#development states that "breaking changes are not allowed".

Note (October, 2020): DPC++ runtime and compiler ABI is currently in frozen state. This means that no ABI-breaking changes will be accepted by default. Project maintainers may still approve breaking changes in some cases. Please, see ABI Policy Guide for more information.

The log says that adding new APIs does not break ABI.

There are new symbols in the new library. It is a non-breaking change. Refer to sycl/doc/ABIPolicyGuide.md for further instructions.

According to my understanding the test validates that all symbols are covered by the test to check for "ABI breaking changes".
I think to fix the test, we need add missing symbols to this test.
Adding @alexbatashev to confirm.

Copy link
Contributor

@bader bader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@againull, ping.

@bader bader merged commit 2fdf940 into intel:sycl Oct 18, 2021
@Michoumichmich Michoumichmich deleted the max_global_work_sizes branch October 18, 2021 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants