-
Notifications
You must be signed in to change notification settings - Fork 437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add runtime function to query the number of devices and make device ID consistent with KOKKOS_VISIBLE_DEVICES
#6713
Conversation
KOKKOS_VISIBLE_DEVICES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. How did you test that the mask for the visible devices works correctly? We don't have a machine in the CI with multiple visible GPUS.
As far as I understand it was resolved in kokkos#5492
We have a test for the infamous kokkos/core/unit_test/TestParseCmdLineArgsAndEnvVars.cpp Lines 392 to 439 in d560c47
and I had check by hand on a system back when we introduced KOKKOS_VISIBLE_DEVICES .That is not bulletproof obviously but it is something at least. Do you have suggestions on how to improve? |
@@ -1235,12 +1235,10 @@ if (NOT KOKKOS_HAS_TRILINOS) | |||
INPUT TestDeviceAndThreads.py | |||
${USE_SOURCE_PERMISSIONS_WHEN_SUPPORTED} | |||
) | |||
if(NOT Kokkos_ENABLE_OPENMPTARGET) # FIXME_OPENMPTARGET does not select the right device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oversight in #5492
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like OpenMPTarget still doesn't do the right thing.
@rgayatri23 please fix.
Withdrawing the commit in the meantime.
No I was just wondering how you tested it. Maybe it's something Sandia could test in their nightly |
Ignoring the CI that is headed straight to timeout following the massive retrigger of testing all PRs. |
Kokkos::num_devices()
is intended as a replacement for{Cuda,HIP}::detect_device_count()
deprecated in #6710I propose to make it legal to call it before
Kokkos::initialize
so it may be leveraged to select what GPU to use.This PR also fixes a defect in
Kokkos::device_id()
when theKOKKOS_VISIBLE_DEVICES
environment variable is set.The Kokkos runtime was returning the device id according to the underlying backend runtime but not taking into consideration whether the devices were masked out. This was an oversight when we introduced it in #5855 (first released in 4.1)
Assume the current system has 4 GPUs, the table below shows the value returned by the
device_id()
runtimedevice_id=1
KOKKOS_VISIBLE_DEVICES=0
KOKKOS_VISIBLE_DEVICES=3
KOKKOS_VISIBLE_DEVICES=1,0
device_id=1
KOKKOS_VISIBLE_DEVICES=1,0
I also went ahead and masked out devices in
{Cuda,HIP}::print_config