-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenMPTarget: Changes to the hierarchical parallelism #3808
Conversation
…uce implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new code is a lot cleaner but I am not sure it works for every corner case.
|
||
// Iterate through the number of teams until league_size and assign the | ||
// league_id accordingly | ||
for (int league_id = blockIdx, track = 0; league_id < league_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that loop always works? I feel that this works only if league_size < OpenMPTargetExec::MAX_ACTIVE_TEAMS
and num_teams = nteams
which does the standard does not guarantee. The comment indicates that you want to iterate thought the number of teams but league_size
is an upper of that number, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rombur - The standard guarantees that , the number of teams specified in the num_teams
clause is always less than or equal to the maximum concurrently running teams. So I think the loop would work by looping around the gridDim
generated. I tried it with a test case where it successfully works when the league_size > max_active_teams
. I understand that it might not be hitting the corner case which might break the logic.
const int gridDim = omp_get_num_teams(); | ||
|
||
for (int league_id = blockIdx, track = 0; league_id < league_size; | ||
league_id += gridDim, ++track) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
m_functor(team); | ||
} | ||
} else | ||
printf("`num_teams` clause was not respected. \n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Abort?
…ts in case for revert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. I would remove the old code right now unless you have a reason to keep for now. We can always use git to look at the old implementation.
|
||
int* lock_array = OpenMPTargetExec::get_lock_array(max_active_teams); | ||
|
||
// Saving the older implementation that uses `atomic_compare_exchange` to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a fan of keeping dead code around. The code is already saved in git.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rombur The only reason I have the old code in there is because of a slight apprehension that I might not have covered all corner cases. It would be easier to switch to older code this way. But you are right, we should get rid of it in the future, once all compilers supporting OpenMP offload are mature enough to reach this point.
// Saving the older implementation that uses `atomic_compare_exchange` to | ||
// calculate the shared memory block index and `distribute` clause to distribute | ||
// teams. | ||
#if 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change to
#define KOKKOS_IMPL_USE_NEW_CODE_PATH // chose a more sensible name
#ifdef KOKKOS_IMPL_USE_NEW_CODE_PATH
<new_code>
#else
<old_code>
#endif
...
#ifdef KOKKOS_IMPL_USE_NEW_CODE_PATH
<new_code>
#else
<old_code>
#endif
#undef KOKKOS_IMPL_USE_NEW_CODE_PATH
…itch between hierarchical implementations.
Retest this please. |
* OpenMPTarget: Changes to the hierarchical ParallelFor and ParallelReduce implementations. * OpenMPTarget: Edited comments for better understanding. * OpenMPTarget: Adding a check for the `num_teams` clause. * OpenMPTarget: Added back the older hierarchical parallelism in comments in case for revert. * OpenMPTarget: Adding `KOKKOS_IMPL_LOCK_FREE_HIERARCHICAL` macro to switch between hierarchical implementations. Co-authored-by: rgayatri <rgayatri@lbl.gov>
Updates :
num_teams
clause to restrict the maximum number of OpenMPTarget teams generated to maximum possible in-flight teams and avoiding the use ofatomic_compare_exchange
.teams-distribute
andparallel
. Gives a 2x speedup on a performance test case.Idea provided by Christopher Daley from NERSC . @cdaley