Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SYCL: Filter GPU devices #6758

Merged
merged 4 commits into from
Feb 8, 2024
Merged

Conversation

masterleinad
Copy link
Contributor

The discussion in https://kokkosteam.slack.com/archives/C5BGU5NDQ/p1705683106903069 shows that users might accidentally use SYCL backends that we don't actively support. This pull request filters the sycl::devices so that only the sycl::backend::ext_oneapi_* ones are visible if a device architecture is enabled in the configuration.

@masterleinad
Copy link
Contributor Author

masterleinad commented Jan 29, 2024

https://github.com/pzehner Would you mind having a look if this works for you?

@pzehner
Copy link
Contributor

pzehner commented Feb 2, 2024

I tested it successfully on a GPU Max 1550. Everything is good for me.

@masterleinad masterleinad marked this pull request as ready for review February 2, 2024 12:44
@masterleinad masterleinad marked this pull request as draft February 2, 2024 12:45
@masterleinad
Copy link
Contributor Author

Retest this please.

@masterleinad
Copy link
Contributor Author

@dalg24 and I discussed this pull request offline.

@dalg24's point of view is:

  • this change messes with the visible devices
  • we should consider what the runtime command line tool gives us when selecting the device to use (sycl-ls)
  • we should print all devices in print_configuration and that should match sycl-ls
  • we should discuss how we want to handle initialization when the user doesn't specify a device id. For other backends, we could take the currently active device and for SYCL possibly the first usable one.

My point of view is:

  • we only care about ext_oneapi_* devices anyway (if we request a GPU device). Only those are maintained by Intel for Aurora (intel/llvm might maintain opencl as well but we know that our unit tests are not passing for example).
  • other Kokkos backends can't target multiple low-level backends for the same architecture. Hence, there is no inconsistency.
  • we can still print all devices but again we should only consider ext_oneapi_* ones since only those are feasible.
  • we are already using the first usable device for SYCL if the user doesn't specify a device id.

We agreed that we should check in the backward initialization that we are given an ext_oneapi_* backend anyway. Even with this pull request, execution space instances created from sycL::queues could use an opencl:gpu device.

For reference, on Sunspot/Aurora compute nodes the default for ONEAPI_DEVICE_SELECTOR is level_zero:gpu and sycl-ls shows

[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:3] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:4] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:5] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]

unsetting that environment variable gives

[opencl:gpu:0] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [23.17.26241.22]
[opencl:gpu:1] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [23.17.26241.22]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [23.17.26241.22]
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [23.17.26241.22]
[opencl:gpu:4] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [23.17.26241.22]
[opencl:gpu:5] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [23.17.26241.22]
[opencl:cpu:6] Intel(R) OpenCL, Intel(R) Xeon(R) CPU Max 9470C OpenCL 3.0 (Build 0) [2023.16.10.0.09_181109.xmain-eng]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:3] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:4] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:5] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]

so everything discussed here only makes a difference if ONEAPI_DEVICE_SELECTOR isn't already specified.

@dalg24 Please correct this comment if I misrepresented something.

Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer if you applied my suggestions but I am fine with the PR as is if you do not agree with them.

core/src/SYCL/Kokkos_SYCL.hpp Outdated Show resolved Hide resolved
core/src/SYCL/Kokkos_SYCL.hpp Outdated Show resolved Hide resolved
@dalg24
Copy link
Member

dalg24 commented Feb 8, 2024

Ignoring HIP ROCm 5.2 failure

@dalg24 dalg24 merged commit 7ff87a5 into kokkos:develop Feb 8, 2024
30 of 31 checks passed
@aelovikov-intel
Copy link
Contributor

FYI: I've merged intel/llvm#12719 in the SYCL RT for a different take on this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants