Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant performance regression in LAMMPS after updating Kokkos #1139

Closed
stanmoore1 opened this issue Oct 2, 2017 · 9 comments
Closed
Labels
Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos)
Milestone

Comments

@stanmoore1
Copy link
Contributor

stanmoore1 commented Oct 2, 2017

I'm running the Kokkos version of LAMMPS and am seeing a ~40% decrease in performance on GPUs when I switched from 2.03.05 (2017-05-27) to Kokkos 2.03.13 (2017-07-27). The performance regression persisted after I updated to version 2.04.00 (2017-08-16). Any known issues that could be related?

@stanmoore1
Copy link
Contributor Author

I had the versions backwards, updated the comment.

@hcedwar
Copy link
Contributor

hcedwar commented Oct 3, 2017

We suspect modification of Cuba launch bounds that went in ~June. On our to-do list to investigate.

@ibaned
Copy link
Contributor

ibaned commented Oct 3, 2017

@stanmoore1 could you post details of a representative LAMMPS case to run including launch command?

@stanmoore1
Copy link
Contributor Author

@hcedwar your guess seems correct. The CUDA launch bounds change is 4e3c6e7 (PR #909) but it doesn't compile so I can't test it directly. Looking at commits close to that, the LAMMPS performance figure of merit for afa25d1 (PR #908) before the launch bounds change is 152 timesteps/s and for c979a6f (PR #912) after the launch bounds change (which fixed the compile error), the LAMMPS figure of merit is 97 timesteps/s, a decrease of ~40%.

@stanmoore1 stanmoore1 mentioned this issue Oct 3, 2017
@stanmoore1
Copy link
Contributor Author

@ibaned the test I'm running is bundled with LAMMPS. I'm running on a single P100 node:

export LMP_ROOT=~/lammps_master
cd ~/lammps_master/src/USER-INTEL/TEST

mpiexec -np 1 ~/lammps_master/src/lmp_kokkos_cuda_mpi -in in.intel.tersoff -k on g 1 -sf kk -pk kokkos comm device binsize 4.2 newton on neigh half -v m 0.2

@stanmoore1
Copy link
Contributor Author

stanmoore1 commented Oct 3, 2017

As a final test, I commented out the __launch_bounds__ directives in Kokkos_CudaExec.hpp and the performance regression went away:

diff --git a/lib/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp b/lib/kokkos/core/src/
index cae8ecd..079d9f0 100644
--- a/lib/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp
@@ -164,7 +164,7 @@ static void cuda_parallel_launch_constant_memory()

 template< class DriverType, unsigned int maxTperB, unsigned int minBperSM >
 __global__
-__launch_bounds__(maxTperB, minBperSM)
+//__launch_bounds__(maxTperB, minBperSM)
 static void cuda_parallel_launch_constant_memory()
 {
   const DriverType & driver =
@@ -182,7 +182,7 @@ static void cuda_parallel_launch_local_memory( const DriverT

 template< class DriverType, unsigned int maxTperB, unsigned int minBperSM >
 __global__
-__launch_bounds__(maxTperB, minBperSM)
+//__launch_bounds__(maxTperB, minBperSM)
 static void cuda_parallel_launch_local_memory( const DriverType driver )
 {
   driver();

@hcedwar
Copy link
Contributor

hcedwar commented Oct 3, 2017

Thank you for generating proof / evidence for the suspected root cause. As soon as we (the whole Kokkos development team) return from travel is a high priority to fix.

@ibaned ibaned added this to the 2017 October milestone Oct 3, 2017
@ibaned ibaned added the Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos) label Oct 3, 2017
@ibaned
Copy link
Contributor

ibaned commented Oct 5, 2017

PR #1143 should fix this

hcedwar added a commit that referenced this issue Oct 5, 2017
Set default CUDA launch bounds to <0,0> and when do not use
CUDA __launch_bounds__ unless CUDA launch bounds are explicitly specified.
@hcedwar
Copy link
Contributor

hcedwar commented Oct 5, 2017

Using PR #1147 to fix.

crtrott added a commit that referenced this issue Oct 5, 2017
Fix for #1139 performance regression bug (and #1140 for tracking).
@crtrott crtrott closed this as completed Oct 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos)
Projects
None yet
Development

No branches or pull requests

4 participants