Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add runtime function to query the number of devices and make device ID consistent with KOKKOS_VISIBLE_DEVICES #6713

Merged
merged 11 commits into from
Jan 29, 2024

Conversation

dalg24
Copy link
Member

@dalg24 dalg24 commented Jan 12, 2024

Kokkos::num_devices() is intended as a replacement for {Cuda,HIP}::detect_device_count() deprecated in #6710
I propose to make it legal to call it before Kokkos::initialize so it may be leveraged to select what GPU to use.

This PR also fixes a defect in Kokkos::device_id() when the KOKKOS_VISIBLE_DEVICES environment variable is set.
The Kokkos runtime was returning the device id according to the underlying backend runtime but not taking into consideration whether the devices were masked out. This was an oversight when we introduced it in #5855 (first released in 4.1)
Assume the current system has 4 GPUs, the table below shows the value returned by the device_id() runtime

initialization settings current this PR
<none> 0 0
device_id=1 1 1
KOKKOS_VISIBLE_DEVICES=0 0 0
KOKKOS_VISIBLE_DEVICES=3 3 0
KOKKOS_VISIBLE_DEVICES=1,0 1 0
device_id=1 KOKKOS_VISIBLE_DEVICES=1,0 0 1

I also went ahead and masked out devices in {Cuda,HIP}::print_config

@dalg24 dalg24 added the Feature Request Create new capability; will potentially require voting label Jan 12, 2024
@dalg24 dalg24 added the Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos) label Jan 25, 2024
@dalg24 dalg24 changed the title Add runtime function to query the number of devices Add runtime function to query the number of devices and make device ID consistent with KOKKOS_VISIBLE_DEVICES Jan 25, 2024
@dalg24 dalg24 marked this pull request as ready for review January 25, 2024 16:47
Copy link
Member

@Rombur Rombur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. How did you test that the mask for the visible devices works correctly? We don't have a machine in the CI with multiple visible GPUS.

As far as I understand it was resolved in kokkos#5492
@dalg24
Copy link
Member Author

dalg24 commented Jan 26, 2024

How did you test that the mask for the visible devices works correctly? We don't have a machine in the CI with multiple visible GPUS.

We have a test for the infamous Impl::get_visible_devices(int device_count) -> std::vector<int>

TEST(defaultdevicetype, visible_devices) {
#define KOKKOS_TEST_VISIBLE_DEVICES(ENV, CNT, DEV) \
do { \
EnvVarsHelper ev{ENV}; \
SKIP_IF_ENVIRONMENT_VARIABLE_ALREADY_SET(ev); \
auto computed = Kokkos::Impl::get_visible_devices(CNT); \
std::vector<int> expected = DEV; \
EXPECT_EQ(expected.size(), computed.size()) \
<< ev << "device count: " << CNT; \
auto n = std::min<int>(expected.size(), computed.size()); \
for (int i = 0; i < n; ++i) { \
EXPECT_EQ(expected[i], computed[i]) \
<< "devices differ at index " << i << '\n' \
<< ev << "device count: " << CNT; \
} \
} while (false)
#define DEV(...) \
std::vector<int> { __VA_ARGS__ }
#define ENV(...) std::unordered_map<std::string, std::string>{__VA_ARGS__}
// first test with all environment variables that are involved in determining
// the visible devices so user set var do not mess up the logic below.
// KOKKOS_NUM_DEVICES and KOKKOS_SKIP_DEVICE are deprecated since 3.7 and are
// not taken into account anymore.
KOKKOS_TEST_VISIBLE_DEVICES(
ENV({"KOKKOS_VISIBLE_DEVICES", "2,1"}, {"KOKKOS_NUM_DEVICES", "8"},
{"KOKKOS_SKIP_DEVICE", "1"}),
6, DEV(2, 1));
KOKKOS_TEST_VISIBLE_DEVICES(
ENV({"KOKKOS_VISIBLE_DEVICES", "2,1"}, {"KOKKOS_NUM_DEVICES", "8"}, ), 6,
DEV(2, 1));
KOKKOS_TEST_VISIBLE_DEVICES(ENV({"KOKKOS_NUM_DEVICES", "3"}), 6,
DEV(0, 1, 2, 3, 4, 5));
KOKKOS_TEST_VISIBLE_DEVICES(
ENV({"KOKKOS_NUM_DEVICES", "4"}, {"KOKKOS_SKIP_DEVICE", "1"}, ), 6,
DEV(0, 1, 2, 3, 4, 5));
KOKKOS_TEST_VISIBLE_DEVICES(ENV({"KOKKOS_VISIBLE_DEVICES", "1,3,4"}), 6,
DEV(1, 3, 4));
KOKKOS_TEST_VISIBLE_DEVICES(
ENV({"KOKKOS_VISIBLE_DEVICES", "2,1"}, {"KOKKOS_SKIP_DEVICE", "1"}, ), 6,
DEV(2, 1));
KOKKOS_TEST_VISIBLE_DEVICES(ENV(), 4, DEV(0, 1, 2, 3));
#undef ENV
#undef DEV
#undef KOKKOS_TEST_VISIBLE_DEVICES
}

and I had check by hand on a system back when we introduced KOKKOS_VISIBLE_DEVICES.
That is not bulletproof obviously but it is something at least.
Do you have suggestions on how to improve?

@@ -1235,12 +1235,10 @@ if (NOT KOKKOS_HAS_TRILINOS)
INPUT TestDeviceAndThreads.py
${USE_SOURCE_PERMISSIONS_WHEN_SUPPORTED}
)
if(NOT Kokkos_ENABLE_OPENMPTARGET) # FIXME_OPENMPTARGET does not select the right device
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oversight in #5492

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like OpenMPTarget still doesn't do the right thing.
@rgayatri23 please fix.
Withdrawing the commit in the meantime.

@Rombur
Copy link
Member

Rombur commented Jan 26, 2024

Do you have suggestions on how to improve?

No I was just wondering how you tested it. Maybe it's something Sandia could test in their nightly

core/src/impl/Kokkos_Core.cpp Outdated Show resolved Hide resolved
@dalg24
Copy link
Member Author

dalg24 commented Jan 29, 2024

Ignoring the CI that is headed straight to timeout following the massive retrigger of testing all PRs.

@dalg24 dalg24 merged commit d2913cb into kokkos:develop Jan 29, 2024
30 of 31 checks passed
@dalg24 dalg24 mentioned this pull request Jan 29, 2024
@dalg24 dalg24 deleted the num_devices branch January 30, 2024 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos) Feature Request Create new capability; will potentially require voting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants