-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenMPTarget: Update hierarchical parallelism. #6043
OpenMPTarget: Update hierarchical parallelism. #6043
Conversation
// Multiply the number of processors with teh SIMD length. | ||
max_threads *= 64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Multiply the number of processors with teh SIMD length. | |
max_threads *= 64; | |
// Multiply the number of processors by the SIMD length. | |
max_threads *= 64; |
Where do you get 64 from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed the SIMD length to be 64 to create a maximum number of possible "threads" on Intel architectures.
#endif | ||
#elif defined(KOKKOS_ARCH_INTEL_GPU) | ||
#pragma omp target map(max_threads) | ||
{ max_threads = omp_get_num_procs(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this give you the number of EUs? If so, shouldn't the value be a little higher since multiple workgroups could be scheduled by a EU? See https://github.com/intel/llvm/blob/756ba2616111235bba073e481b7f1c8004b34ee6/sycl/source/detail/reduction.cpp#L51-L62.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per OpenMP spec, this should give us the number of procs on the device and each proc can then execute a SIMD instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gives you the number of total hardware-threads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that OK to invoke every time? Should we be caching the value after the first call?
#if !defined(KOKKOS_IMPL_HIERARCHICAL_INTEL_GPU) | ||
#pragma omp target teams thread_limit(team_size) firstprivate(a_functor) \ | ||
num_teams(max_active_teams) is_device_ptr(scratch_ptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks similar to #6035 and thus I would expect #6035 (comment) to also apply. Why is restricting the number of workgroups/teams here, in general, a good idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No its not a good idea to restrict the number of teams but unfortunately for the OpenMPTarget backend, we need a tight control over the number of teams generated as we have data structures that depend on the maximum number of in-flight teams.
d3fb726
to
a4c66f0
Compare
#endif | ||
#elif defined(KOKKOS_ARCH_INTEL_GPU) | ||
#pragma omp target map(max_threads) | ||
{ max_threads = omp_get_num_procs(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that OK to invoke every time? Should we be caching the value after the first call?
|
||
// Multiply the number of processors with the SIMD length. | ||
max_threads *= 32; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about AMD? It is OK to fix later but maybe you still need the FIXME
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The omp_get_num_proc
is only invoked once for every instance and I think that should be ok.
The AMD currently has the default 2048*80 number of threads which is fine for now. I will add a FIXME line there to fix it in the future. I don't have a way to get the right number for AMD GPUs right now.
// Intel architectures prefer the classical hierarchical parallelism that relies | ||
// on OpenMP. | ||
#if defined(KOKKOS_ARCH_INTEL_GPU) | ||
#define KOKKOS_IMPL_HIERARCHICAL_INTEL_GPU | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that something that it would make sense to define somewhere more "central" and include in print as part of the configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can put that macro in a more central place. I will do that. It will be better since we need it in more than one file.
But I don't think we need this in configuration printing since the user is not concerned (IMO) with how we implement hierarchical parallelism.
// max_active_teams is the number of active teams on the given hardware. | ||
// We set the number of teams to be twice the number of max_active_teams for | ||
// the compiler to pick the right number in its case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand this comment. Why are we setting the number to twice the max instead of just once?
Also what do you mean "in its case"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ideal case would be to set a large enough upper bound on the number of teams generated using omp_set_num_teams
and and let compiler pick up the right number of teams for a given target
region. We do that in resize_scratch
where we assign the upper bound to be 2*max_active_teams . However that call is not respected and hence the need to add num_teams
.
The idea is to not hamper compiler's ability to chose the appropriate number of teams (hence a large upper bound) but also to have control over that number so we can allocate data for each team.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty sure 2ac6a07 accidentally disabled all the code paths guarded by #ifdef KOKKOS_IMPL_HIERARCHICAL_INTEL_GPU
if (omp_get_num_teams() > max_active_teams) | ||
Kokkos::abort("`omp_set_num_teams` call was not respected.\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of that check? Can this ever fail? That would indicate a bug in the OpenMP implementation wouldn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes thats the intention, that if there is a bug in OpenMP, dont run it because in this case it might lead to race conditions.
I don't understand, the macro is now in |
It is |
Ok now |
That means only one of the files will have the new code path enabled, whichever gets included first. |
Oh right thats true. I need to undef it in a common place only once. |
Removed the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me (apart from the typo you might want to fix).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple small things
// nteams should not exceed the maximum in-flight teams possible. | ||
const auto nteams = | ||
league_size < max_active_teams ? league_size : max_active_teams; | ||
int max_active_teams = omp_get_max_teams(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not min(nteams, omp_get_max_team())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That min
is happening in resize_scratch
and the resulting value is set in omp_set_num_teams
.
The value is then being accessed here using omp_get_max_teams
rather than setting another variable that is passed between routines.
Ignoring HIP build that timed out. |
* OpenMPTarget: Update hierarchical parallelism. * OpenMPTarget: Update initialize routine. * OpenMPTarget: Remove num_teams for Intel GPUs. * OpenMPTarget: fix comment. * OpenMPTarget: Oversubscribe number of teams. * OpenMPTarget: Move KOKKOS_IMPL_HIERARCHICAL_INTEL_GPU macro to a central location. * OpenMPTarget: Add num_teams clause for Intel GPUs too. * OpenMPTarget: Moving the undef for Intel GPUs into files that include the macro. * OpenMPTarget: Updated macro name and added to print_configuration. * OpenMPTarget: Adding impl to macro. * OpenMPTarget: Fix typo for Intel GPUs. * OpenMPTarget: Fix print_configuration. * OpenMPTarget: Rename variable names. * OpenMPTarget: clang format. --------- Co-authored-by: Rahulkumar Gayatri <rgayatri@lbl.gov>
The PR does the following:
Todo: