Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reaxc/qeq optimization - using kokkos hierarchical parallelism #1496

Merged
merged 11 commits into from Jun 11, 2019

Conversation

@akkamesh
Copy link
Collaborator

commented Jun 5, 2019

Summary
This PR provides various optimizations to ReaxC Qeq matrix computation (i.e. to FixQEqReaxKokkosComputeHFunctor method)

  • uses hierarchical parallelism
  • optimizations to reduce global memory transactions - compute/load data once and reuse via scratch space
  • optimizations to change the execution policy from Kokkos::parallel_scan which use a multiple-pass algorithm that results in two kernel invocations to Kokkos::parallel_for which has a single kernel invocation with fewer data dependencies.

With this PR, overall performance gain for ReaxC simulation on GPUs is up to ~1.6X and the performance of FixQEqReaxKokkosComputeHFunctor method is 5-12X.

Author(s)
Kamesh Arumugam (NVIDIA)

Licensing
By submitting this pull request, I agree, that my contribution will be included in LAMMPS and redistributed under either the GNU General Public License version 2 (GPL v2) or the GNU Lesser General Public License version 2.1 (LGPL v2.1).

Backward Compatibility
No issues.

Tagging @stanmoore1 for review.

@stanmoore1

This comment has been minimized.

Copy link
Contributor

commented Jun 5, 2019

Thanks @akkamesh. I will take a look.

@stanmoore1

This comment has been minimized.

Copy link
Contributor

commented Jun 6, 2019

I ran this on a CPU with 4 OMP threads. There is about a 3% slowdown compared to latest master, probably acceptable.

@stanmoore1

This comment has been minimized.

Copy link
Contributor

commented Jun 6, 2019

Running this through Kokkos package regression tests.

@stanmoore1

This comment has been minimized.

Copy link
Contributor

commented Jun 6, 2019

Regression tests look good.

@akkamesh

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 7, 2019

I ran this on a CPU with 4 OMP threads. There is about a 3% slowdown compared to latest master, probably acceptable.

Thanks, @stanmoore1. It is possible that the use of scratch space is causing the overhead with the OMP case.

@stanmoore1

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

I am seeing good speedup on GPUs.

@stanmoore1

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

For ~1 million atoms on a Broadwell node without any threading, I'm seeing a ~13% slowdown.

@akkamesh akkamesh requested a review from sjplimp as a code owner Jun 7, 2019

@stanmoore1

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

@akkamesh I made it so the CPU uses the original version and GPU uses your enhanced team version with shared memory, should get the best of both performance-wise. I'm testing again.

@stanmoore1 stanmoore1 force-pushed the akkamesh:enh-ext-reaxc branch from 499c185 to 9e3dc26 Jun 7, 2019

@stanmoore1

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

@akkamesh were there any optimizations for GPUs you found that slowed down the CPU version? If so, feel free to add them in again. Sometimes single-source performance portability is a pipe dream.

@stanmoore1

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

@akkamesh regression tests pass, and performance looks good. Do you have any other changes before we merge this?

@stanmoore1 stanmoore1 assigned akohlmey and unassigned stanmoore1 Jun 10, 2019

@stanmoore1 stanmoore1 requested review from akohlmey, athomps, rbberger and stanmoore1 and removed request for stanmoore1 Jun 10, 2019

@akkamesh

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 10, 2019

@akkamesh were there any optimizations for GPUs you found that slowed down the CPU version? If so, feel free to add them in again. Sometimes single-source performance portability is a pipe dream.

I agree. With single-source performance portability as the primary motivation, I resisted from any GPU specific optimizations. If its an option you don't mind considering it then I will keep that in mind for future contributions.

@akkamesh

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 10, 2019

@akkamesh regression tests pass, and performance looks good. Do you have any other changes before we merge this?

Thanks @stanmoore1 for running all the tests and confirming the performance. I don't have any more changes for this PR.

@stanmoore1

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

@akkamesh hardware specialization is an option when necessary, especially for bottleneck kernels. Sometimes one algorithm works better on GPUs and another works better on CPUs; it is hard to abstract that away in Kokkos.

Merge branch 'master' into enh-ext-reaxc
Resolved Merge Conflict in src/KOKKOS/kokkos.cpp

@akohlmey akohlmey merged commit fe29572 into lammps:master Jun 11, 2019

6 checks passed

lammps/pull-requests/build-docs-pr head run ended
Details
lammps/pull-requests/cmake/cmake-serial-pr head run ended
Details
lammps/pull-requests/kokkos-omp-pr head run ended
Details
lammps/pull-requests/openmpi-pr head run ended
Details
lammps/pull-requests/serial-pr head run ended
Details
lammps/pull-requests/shlib-pr head run ended
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.