Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMPTarget: Changes to the hierarchical parallelism #3808

Merged
merged 5 commits into from
Mar 3, 2021

Conversation

rgayatri23
Copy link
Contributor

@rgayatri23 rgayatri23 commented Feb 24, 2021

Updates :

  1. Using the num_teams clause to restrict the maximum number of OpenMPTarget teams generated to maximum possible in-flight teams and avoiding the use of atomic_compare_exchange .
  2. Manual worksharing of the "league" loop in hierarchical parallelism to avoid code between teams-distribute and parallel . Gives a 2x speedup on a performance test case.
    Idea provided by Christopher Daley from NERSC . @cdaley

Copy link
Member

@Rombur Rombur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new code is a lot cleaner but I am not sure it works for every corner case.


// Iterate through the number of teams until league_size and assign the
// league_id accordingly
for (int league_id = blockIdx, track = 0; league_id < league_size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that loop always works? I feel that this works only if league_size < OpenMPTargetExec::MAX_ACTIVE_TEAMS and num_teams = nteams which does the standard does not guarantee. The comment indicates that you want to iterate thought the number of teams but league_size is an upper of that number, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rombur - The standard guarantees that , the number of teams specified in the num_teams clause is always less than or equal to the maximum concurrently running teams. So I think the loop would work by looping around the gridDim generated. I tried it with a test case where it successfully works when the league_size > max_active_teams. I understand that it might not be hitting the corner case which might break the logic.

const int gridDim = omp_get_num_teams();

for (int league_id = blockIdx, track = 0; league_id < league_size;
league_id += gridDim, ++track) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

m_functor(team);
}
} else
printf("`num_teams` clause was not respected. \n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abort?

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Copy link
Member

@Rombur Rombur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I would remove the old code right now unless you have a reason to keep for now. We can always use git to look at the old implementation.


int* lock_array = OpenMPTargetExec::get_lock_array(max_active_teams);

// Saving the older implementation that uses `atomic_compare_exchange` to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of keeping dead code around. The code is already saved in git.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rombur The only reason I have the old code in there is because of a slight apprehension that I might not have covered all corner cases. It would be easier to switch to older code this way. But you are right, we should get rid of it in the future, once all compilers supporting OpenMP offload are mature enough to reach this point.

dalg24
dalg24 previously requested changes Mar 3, 2021
// Saving the older implementation that uses `atomic_compare_exchange` to
// calculate the shared memory block index and `distribute` clause to distribute
// teams.
#if 0
Copy link
Member

@dalg24 dalg24 Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change to

#define KOKKOS_IMPL_USE_NEW_CODE_PATH // chose a more sensible name
#ifdef KOKKOS_IMPL_USE_NEW_CODE_PATH
<new_code>
#else
<old_code>
#endif

...
#ifdef KOKKOS_IMPL_USE_NEW_CODE_PATH
<new_code>
#else
<old_code>
#endif

#undef KOKKOS_IMPL_USE_NEW_CODE_PATH

…itch between hierarchical implementations.
@masterleinad
Copy link
Contributor

Retest this please.

@crtrott crtrott dismissed dalg24’s stale review March 3, 2021 23:47

Was addressed

@crtrott crtrott merged commit e17aadb into kokkos:develop Mar 3, 2021
masterleinad pushed a commit to masterleinad/kokkos that referenced this pull request Mar 11, 2021
* OpenMPTarget: Changes to the hierarchical ParallelFor and ParallelReduce implementations.

* OpenMPTarget: Edited comments for better understanding.

* OpenMPTarget: Adding a check for the `num_teams` clause.

* OpenMPTarget: Added back the older hierarchical parallelism in comments in case for revert.

* OpenMPTarget: Adding `KOKKOS_IMPL_LOCK_FREE_HIERARCHICAL` macro to switch between hierarchical implementations.

Co-authored-by: rgayatri <rgayatri@lbl.gov>
@rgayatri23 rgayatri23 deleted the OpenMPTarget_optimizations branch March 24, 2023 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants