SYCL: Filter GPU devices #6758

masterleinad · 2024-01-29T20:34:34Z

The discussion in https://kokkosteam.slack.com/archives/C5BGU5NDQ/p1705683106903069 shows that users might accidentally use SYCL backends that we don't actively support. This pull request filters the sycl::devices so that only the sycl::backend::ext_oneapi_* ones are visible if a device architecture is enabled in the configuration.

core/src/SYCL/Kokkos_SYCL.hpp

masterleinad · 2024-01-29T21:04:16Z

https://github.com/pzehner Would you mind having a look if this works for you?

pzehner · 2024-02-02T11:16:21Z

I tested it successfully on a GPU Max 1550. Everything is good for me.

masterleinad · 2024-02-02T13:14:12Z

Retest this please.

masterleinad · 2024-02-02T18:12:28Z

@dalg24 and I discussed this pull request offline.

@dalg24's point of view is:

this change messes with the visible devices
we should consider what the runtime command line tool gives us when selecting the device to use (sycl-ls)
we should print all devices in print_configuration and that should match sycl-ls
we should discuss how we want to handle initialization when the user doesn't specify a device id. For other backends, we could take the currently active device and for SYCL possibly the first usable one.

My point of view is:

we only care about ext_oneapi_* devices anyway (if we request a GPU device). Only those are maintained by Intel for Aurora (intel/llvm might maintain opencl as well but we know that our unit tests are not passing for example).
other Kokkos backends can't target multiple low-level backends for the same architecture. Hence, there is no inconsistency.
we can still print all devices but again we should only consider ext_oneapi_* ones since only those are feasible.
we are already using the first usable device for SYCL if the user doesn't specify a device id.

We agreed that we should check in the backward initialization that we are given an ext_oneapi_* backend anyway. Even with this pull request, execution space instances created from sycL::queues could use an opencl:gpu device.

For reference, on Sunspot/Aurora compute nodes the default for ONEAPI_DEVICE_SELECTOR is level_zero:gpu and sycl-ls shows

[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:3] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:4] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:5] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]

unsetting that environment variable gives

[opencl:gpu:0] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [23.17.26241.22]
[opencl:gpu:1] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [23.17.26241.22]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [23.17.26241.22]
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [23.17.26241.22]
[opencl:gpu:4] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [23.17.26241.22]
[opencl:gpu:5] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [23.17.26241.22]
[opencl:cpu:6] Intel(R) OpenCL, Intel(R) Xeon(R) CPU Max 9470C OpenCL 3.0 (Build 0) [2023.16.10.0.09_181109.xmain-eng]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:3] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:4] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]
[ext_oneapi_level_zero:gpu:5] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26241]

so everything discussed here only makes a difference if ONEAPI_DEVICE_SELECTOR isn't already specified.

@dalg24 Please correct this comment if I misrepresented something.

dalg24

Would prefer if you applied my suggestions but I am fine with the PR as is if you do not agree with them.

core/src/SYCL/Kokkos_SYCL.hpp

dalg24 · 2024-02-08T19:29:45Z

Ignoring HIP ROCm 5.2 failure

aelovikov-intel · 2024-02-21T18:49:47Z

FYI: I've merged intel/llvm#12719 in the SYCL RT for a different take on this issue.

SYCL: Filter GPU devices

e5c0e52

masterleinad mentioned this pull request Jan 29, 2024

Add runtime function to query the number of devices and make device ID consistent with KOKKOS_VISIBLE_DEVICES #6713

Merged

dalg24 reviewed Jan 29, 2024

View reviewed changes

core/src/SYCL/Kokkos_SYCL.hpp Outdated Show resolved Hide resolved

dalg24 reviewed Jan 29, 2024

View reviewed changes

core/src/SYCL/Kokkos_SYCL.hpp Outdated Show resolved Hide resolved

Error out if no GPU was found

2389f6e

masterleinad force-pushed the sycl_filter_gpu_devices branch from 6f0f170 to 2389f6e Compare January 29, 2024 21:02

masterleinad marked this pull request as ready for review February 2, 2024 12:44

masterleinad marked this pull request as draft February 2, 2024 12:45

masterleinad mentioned this pull request Feb 2, 2024

SYCL: Error out on initialization if the backend is different from ext_oneapi_* #6784

Merged

masterleinad marked this pull request as ready for review February 7, 2024 19:52

masterleinad added the Backend - SYCL label Feb 7, 2024

dalg24 approved these changes Feb 8, 2024

View reviewed changes

core/src/SYCL/Kokkos_SYCL.hpp Outdated Show resolved Hide resolved

core/src/SYCL/Kokkos_SYCL.hpp Outdated Show resolved Hide resolved

masterleinad added 2 commits February 8, 2024 10:54

Move definition of get_sycl_devices() to Kokkos_SYCL.cpp

9a91f15

Don't error out when no GPUs are available

96cb412

masterleinad mentioned this pull request Feb 8, 2024

Add support for rocThrust in sort when using HIP #6793

Merged

Rombur approved these changes Feb 8, 2024

View reviewed changes

masterleinad mentioned this pull request Feb 8, 2024

SYCL: Improve print_configuration #6795

Merged

dalg24 merged commit 7ff87a5 into kokkos:develop Feb 8, 2024
30 of 31 checks passed

masterleinad mentioned this pull request Feb 9, 2024

SYCL: Cleanup device selection #6800

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SYCL: Filter GPU devices #6758

SYCL: Filter GPU devices #6758

masterleinad commented Jan 29, 2024

masterleinad commented Jan 29, 2024 •

edited

Loading

pzehner commented Feb 2, 2024 •

edited

Loading

masterleinad commented Feb 2, 2024

masterleinad commented Feb 2, 2024

dalg24 left a comment

dalg24 commented Feb 8, 2024

aelovikov-intel commented Feb 21, 2024

SYCL: Filter GPU devices #6758

SYCL: Filter GPU devices #6758

Conversation

masterleinad commented Jan 29, 2024

masterleinad commented Jan 29, 2024 • edited Loading

pzehner commented Feb 2, 2024 • edited Loading

masterleinad commented Feb 2, 2024

masterleinad commented Feb 2, 2024

dalg24 left a comment

Choose a reason for hiding this comment

dalg24 commented Feb 8, 2024

aelovikov-intel commented Feb 21, 2024

masterleinad commented Jan 29, 2024 •

edited

Loading

pzehner commented Feb 2, 2024 •

edited

Loading