Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor device selection at initialization #5211

Merged
merged 12 commits into from
Jul 14, 2022

Conversation

dalg24
Copy link
Member

@dalg24 dalg24 commented Jul 13, 2022

Deprecate num_devices and skip_device settings, --kokkos-num-devices command line argument, KOKKOS_NUM_DEVICES, KOKKOS_SKIP_DEVICES, and KOKKOS_RAND_DEVICES environment variable.

Introduce as a replacement

  • map_device_id_by setting along with --kokkos-map-device-id-by and KOKKOS_MAP_DEVICE_ID_BY which can be set to "random" or "mpi_rank".
  • KOKKOS_VISIBLE_DEVICES environment variable to specify device ids as a comma-separated sequence of integers

@crtrott
Copy link
Member

crtrott commented Jul 13, 2022

Summary:

KOKKOS_VISIBLE_DEVICES -> externally enforced mask

// These also exist as members in the initialization struct
--kokkos-device-id=2
--kokkos-map-device-id-by=random
--kokkos-map-device-id-by=mpi-rank
(--kokkos-map-device-id-by=[specific-device(2)|2])

@crtrott crtrott added the Blocks Promotion Overview issue for release-blocking bugs label Jul 13, 2022
@dalg24 dalg24 force-pushed the refactor_device_selection branch from db45bd3 to ae03cc4 Compare July 14, 2022 02:18
@dalg24 dalg24 marked this pull request as ready for review July 14, 2022 02:18
@dalg24 dalg24 force-pushed the refactor_device_selection branch from ae03cc4 to 963df48 Compare July 14, 2022 02:26
@fnrizzi
Copy link
Contributor

fnrizzi commented Jul 14, 2022

@dalg24 You said KOKKOS_VISIBLE_DEVICES stores a comma separated list of integers, so this is like NVIDIA_VISIBLE_DEVICES env variable right? Since it is a list of integers, what about renaming it KOKKOS_VISIBLE_DEVICE_IDS? I realize this departs from the cuda naming, but IMO it is more expressive of what this env var actually stores.

When we say "device id", it is implied id with respect to enumeration given the by, e.g., nvidia-smi right?

@dalg24
Copy link
Member Author

dalg24 commented Jul 14, 2022

@dalg24 You said KOKKOS_VISIBLE_DEVICES stores a comma separated list of integers, so this is like NVIDIA_VISIBLE_DEVICES env variable right?

I was going for {CUDA,HIP}_VISIBLE_DEVICES. I had actually forgotten about NVIDIA_VISIBLE_DEVICES. That one actually also supports all, none, void or empty or unset but these make little sense for us.

Since it is a list of integers, what about renaming it KOKKOS_VISIBLE_DEVICE_IDS? I realize this departs from the cuda naming, but IMO it is more expressive of what this env var actually stores.

I like KOKKOS_VISIBLE_DEVICES better because it is shorter, aligned with current vendor practice, and somehow making "device ids visible" rubs me wrong. But we will sees if others agree with you.

When we say "device id", it is implied id with respect to enumeration given the by, e.g., nvidia-smi right?

Yes, in whatever order the vendor runtimes detects them.

core/src/impl/Kokkos_Core.cpp Outdated Show resolved Hide resolved
core/src/impl/Kokkos_Core.cpp Outdated Show resolved Hide resolved
core/src/impl/Kokkos_Core.cpp Show resolved Hide resolved
core/src/impl/Kokkos_Core.cpp Outdated Show resolved Hide resolved
core/src/impl/Kokkos_Core.cpp Outdated Show resolved Hide resolved
core/src/impl/Kokkos_Core.cpp Show resolved Hide resolved
core/src/impl/Kokkos_Core.cpp Show resolved Hide resolved
core/src/impl/Kokkos_Core.cpp Outdated Show resolved Hide resolved
core/src/impl/Kokkos_Core.cpp Show resolved Hide resolved
core/src/impl/Kokkos_Core.cpp Show resolved Hide resolved
core/src/impl/Kokkos_Core.cpp Show resolved Hide resolved
core/src/impl/Kokkos_Core.cpp Outdated Show resolved Hide resolved
core/src/impl/Kokkos_Core.cpp Outdated Show resolved Hide resolved
core/src/impl/Kokkos_Core.cpp Show resolved Hide resolved
core/src/impl/Kokkos_Core.cpp Outdated Show resolved Hide resolved
core/src/impl/Kokkos_Core.cpp Outdated Show resolved Hide resolved
core/src/impl/Kokkos_Core.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@JBludau JBludau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@@ -780,6 +840,7 @@ void Kokkos::Impl::parse_command_line_arguments(
!kokkos_num_devices_found) {
num_devices = std::stoi(num1_only);
settings.set_num_devices(num_devices);
settings.set_map_device_id_by("mpi_rank");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XzzX @crtrott
I suspect the issue in #5252 comes from this line. Probably needs to move it to L815

Comment on lines -286 to -304
int use_gpu = settings.has_device_id() ? settings.get_device_id() : -1;
const int ndevices = [](int num_devices) -> int {
if (num_devices > 0) return num_devices;
#if defined(KOKKOS_ENABLE_CUDA)
return Cuda::detect_device_count();
#elif defined(KOKKOS_ENABLE_HIP)
return Experimental::HIP::detect_device_count();
#elif defined(KOKKOS_ENABLE_SYCL)
return sycl::device::get_devices(sycl::info::device_type::gpu).size();
#else
return num_devices;
#endif
}(settings.has_num_devices() ? settings.get_num_devices() : -1);
const int skip_device =
settings.has_skip_device() ? settings.get_skip_device() : 9999;

// if the exact device is not set, but ndevices was given, assign round-robin
// using on-node MPI rank
if (use_gpu < 0) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XzzX @crtrott I think this is what changed. Will fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocks Promotion Overview issue for release-blocking bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants