Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMPTarget: Update hierarchical parallelism. #6043

Merged
merged 14 commits into from
May 3, 2023

Conversation

rgayatri23
Copy link
Contributor

@rgayatri23 rgayatri23 commented Apr 10, 2023

The PR does the following:

  • Update concurrency based on underlying architecture
  • Update max_active_teams based on the updated concurrency
  • Introduce a different style of hierarchical parallelism for Intel architectures

Todo:

  • Update concurrency specifically for AMD GPUs.

Comment on lines 87 to 88
// Multiply the number of processors with teh SIMD length.
max_threads *= 64;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Multiply the number of processors with teh SIMD length.
max_threads *= 64;
// Multiply the number of processors by the SIMD length.
max_threads *= 64;

Where do you get 64 from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed the SIMD length to be 64 to create a maximum number of possible "threads" on Intel architectures.

#endif
#elif defined(KOKKOS_ARCH_INTEL_GPU)
#pragma omp target map(max_threads)
{ max_threads = omp_get_num_procs(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this give you the number of EUs? If so, shouldn't the value be a little higher since multiple workgroups could be scheduled by a EU? See https://github.com/intel/llvm/blob/756ba2616111235bba073e481b7f1c8004b34ee6/sycl/source/detail/reduction.cpp#L51-L62.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per OpenMP spec, this should give us the number of procs on the device and each proc can then execute a SIMD instruction.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gives you the number of total hardware-threads

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that OK to invoke every time? Should we be caching the value after the first call?

Comment on lines 141 to 143
#if !defined(KOKKOS_IMPL_HIERARCHICAL_INTEL_GPU)
#pragma omp target teams thread_limit(team_size) firstprivate(a_functor) \
num_teams(max_active_teams) is_device_ptr(scratch_ptr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks similar to #6035 and thus I would expect #6035 (comment) to also apply. Why is restricting the number of workgroups/teams here, in general, a good idea?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No its not a good idea to restrict the number of teams but unfortunately for the OpenMPTarget backend, we need a tight control over the number of teams generated as we have data structures that depend on the maximum number of in-flight teams.

@rgayatri23 rgayatri23 force-pushed the OpenMPTarget_update_hierarchical branch from d3fb726 to a4c66f0 Compare April 12, 2023 06:14
#endif
#elif defined(KOKKOS_ARCH_INTEL_GPU)
#pragma omp target map(max_threads)
{ max_threads = omp_get_num_procs(); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that OK to invoke every time? Should we be caching the value after the first call?


// Multiply the number of processors with the SIMD length.
max_threads *= 32;
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about AMD? It is OK to fix later but maybe you still need the FIXME

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The omp_get_num_proc is only invoked once for every instance and I think that should be ok.
The AMD currently has the default 2048*80 number of threads which is fine for now. I will add a FIXME line there to fix it in the future. I don't have a way to get the right number for AMD GPUs right now.

Comment on lines 25 to 29
// Intel architectures prefer the classical hierarchical parallelism that relies
// on OpenMP.
#if defined(KOKKOS_ARCH_INTEL_GPU)
#define KOKKOS_IMPL_HIERARCHICAL_INTEL_GPU
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that something that it would make sense to define somewhere more "central" and include in print as part of the configuration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can put that macro in a more central place. I will do that. It will be better since we need it in more than one file.
But I don't think we need this in configuration printing since the user is not concerned (IMO) with how we implement hierarchical parallelism.

Comment on lines +113 to +115
// max_active_teams is the number of active teams on the given hardware.
// We set the number of teams to be twice the number of max_active_teams for
// the compiler to pick the right number in its case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand this comment. Why are we setting the number to twice the max instead of just once?
Also what do you mean "in its case"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ideal case would be to set a large enough upper bound on the number of teams generated using omp_set_num_teams and and let compiler pick up the right number of teams for a given target region. We do that in resize_scratch where we assign the upper bound to be 2*max_active_teams . However that call is not respected and hence the need to add num_teams.

The idea is to not hamper compiler's ability to chose the appropriate number of teams (hence a large upper bound) but also to have control over that number so we can allocate data for each team.

Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure 2ac6a07 accidentally disabled all the code paths guarded by #ifdef KOKKOS_IMPL_HIERARCHICAL_INTEL_GPU

Comment on lines +140 to +141
if (omp_get_num_teams() > max_active_teams)
Kokkos::abort("`omp_set_num_teams` call was not respected.\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of that check? Can this ever fail? That would indicate a bug in the OpenMP implementation wouldn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thats the intention, that if there is a bug in OpenMP, dont run it because in this case it might lead to race conditions.

@rgayatri23
Copy link
Contributor Author

I am pretty sure 2ac6a07 accidentally disabled all the code paths guarded by #ifdef KOKKOS_IMPL_HIERARCHICAL_INTEL_GPU

I don't understand, the macro is now in core/src/OpenMPTarget/Kokkos_OpenMPTarget_Parallel.hpp .
Am I missing something?

@dalg24
Copy link
Member

dalg24 commented Apr 26, 2023

I am pretty sure 2ac6a07 accidentally disabled all the code paths guarded by #ifdef KOKKOS_IMPL_HIERARCHICAL_INTEL_GPU

I don't understand, the macro is now in core/src/OpenMPTarget/Kokkos_OpenMPTarget_Parallel.hpp . Am I missing something?

It is #undefd at the end of the header and never visible from the parallel constructs implementations

@rgayatri23
Copy link
Contributor Author

I am pretty sure 2ac6a07 accidentally disabled all the code paths guarded by #ifdef KOKKOS_IMPL_HIERARCHICAL_INTEL_GPU

I don't understand, the macro is now in core/src/OpenMPTarget/Kokkos_OpenMPTarget_Parallel.hpp . Am I missing something?

It is #undefd at the end of the header and never visible from the parallel constructs implementations

Ok now undefing from files that include the file that defines the macro.

@dalg24
Copy link
Member

dalg24 commented Apr 26, 2023

Ok now undefing from files that include the file that defines the macro.

That means only one of the files will have the new code path enabled, whichever gets included first.

@rgayatri23
Copy link
Contributor Author

Ok now undefing from files that include the file that defines the macro.

That means only one of the files will have the new code path enabled, whichever gets included first.

Oh right thats true. I need to undef it in a common place only once.

@rgayatri23
Copy link
Contributor Author

Ok now undefing from files that include the file that defines the macro.

That means only one of the files will have the new code path enabled, whichever gets included first.

Oh right thats true. I need to undef it in a common place only once.

Removed the undef and added the macro in print_configuration.

Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me (apart from the typo you might want to fix).

core/src/OpenMPTarget/Kokkos_OpenMPTarget_Instance.cpp Outdated Show resolved Hide resolved
Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple small things

// nteams should not exceed the maximum in-flight teams possible.
const auto nteams =
league_size < max_active_teams ? league_size : max_active_teams;
int max_active_teams = omp_get_max_teams();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not min(nteams, omp_get_max_team())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That min is happening in resize_scratch and the resulting value is set in omp_set_num_teams .
The value is then being accessed here using omp_get_max_teams rather than setting another variable that is passed between routines.

@rgayatri23 rgayatri23 added this to the Release 4.1 milestone May 3, 2023
@dalg24
Copy link
Member

dalg24 commented May 3, 2023

Ignoring HIP build that timed out.

@dalg24 dalg24 merged commit 4b6d971 into kokkos:develop May 3, 2023
26 of 27 checks passed
@rgayatri23 rgayatri23 deleted the OpenMPTarget_update_hierarchical branch May 4, 2023 01:32
@masterleinad masterleinad mentioned this pull request Jun 16, 2023
nliber pushed a commit to nliber/kokkos that referenced this pull request Jun 22, 2023
* OpenMPTarget: Update hierarchical parallelism.

* OpenMPTarget: Update initialize routine.

* OpenMPTarget: Remove num_teams for Intel GPUs.

* OpenMPTarget: fix comment.

* OpenMPTarget: Oversubscribe number of teams.

* OpenMPTarget: Move KOKKOS_IMPL_HIERARCHICAL_INTEL_GPU macro to a central location.

* OpenMPTarget: Add num_teams clause for Intel GPUs too.

* OpenMPTarget: Moving the undef for Intel GPUs into files that include the macro.

* OpenMPTarget: Updated macro name and added to print_configuration.

* OpenMPTarget: Adding impl to macro.

* OpenMPTarget: Fix typo for Intel GPUs.

* OpenMPTarget: Fix print_configuration.

* OpenMPTarget: Rename variable names.

* OpenMPTarget: clang format.

---------

Co-authored-by: Rahulkumar Gayatri <rgayatri@lbl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants