Significant performance regression in LAMMPS after updating Kokkos #1139

stanmoore1 · 2017-10-02T23:42:49Z

I'm running the Kokkos version of LAMMPS and am seeing a ~40% decrease in performance on GPUs when I switched from 2.03.05 (2017-05-27) to Kokkos 2.03.13 (2017-07-27). The performance regression persisted after I updated to version 2.04.00 (2017-08-16). Any known issues that could be related?

stanmoore1 · 2017-10-02T23:43:43Z

I had the versions backwards, updated the comment.

hcedwar · 2017-10-03T00:15:03Z

We suspect modification of Cuba launch bounds that went in ~June. On our to-do list to investigate.

ibaned · 2017-10-03T06:25:11Z

@stanmoore1 could you post details of a representative LAMMPS case to run including launch command?

stanmoore1 · 2017-10-03T15:38:20Z

@hcedwar your guess seems correct. The CUDA launch bounds change is 4e3c6e7 (PR #909) but it doesn't compile so I can't test it directly. Looking at commits close to that, the LAMMPS performance figure of merit for afa25d1 (PR #908) before the launch bounds change is 152 timesteps/s and for c979a6f (PR #912) after the launch bounds change (which fixed the compile error), the LAMMPS figure of merit is 97 timesteps/s, a decrease of ~40%.

stanmoore1 · 2017-10-03T15:47:10Z

@ibaned the test I'm running is bundled with LAMMPS. I'm running on a single P100 node:

export LMP_ROOT=~/lammps_master
cd ~/lammps_master/src/USER-INTEL/TEST

mpiexec -np 1 ~/lammps_master/src/lmp_kokkos_cuda_mpi -in in.intel.tersoff -k on g 1 -sf kk -pk kokkos comm device binsize 4.2 newton on neigh half -v m 0.2

stanmoore1 · 2017-10-03T16:54:19Z

As a final test, I commented out the __launch_bounds__ directives in Kokkos_CudaExec.hpp and the performance regression went away:

diff --git a/lib/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp b/lib/kokkos/core/src/
index cae8ecd..079d9f0 100644
--- a/lib/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp
@@ -164,7 +164,7 @@ static void cuda_parallel_launch_constant_memory()

 template< class DriverType, unsigned int maxTperB, unsigned int minBperSM >
 __global__
-__launch_bounds__(maxTperB, minBperSM)
+//__launch_bounds__(maxTperB, minBperSM)
 static void cuda_parallel_launch_constant_memory()
 {
   const DriverType & driver =
@@ -182,7 +182,7 @@ static void cuda_parallel_launch_local_memory( const DriverT

 template< class DriverType, unsigned int maxTperB, unsigned int minBperSM >
 __global__
-__launch_bounds__(maxTperB, minBperSM)
+//__launch_bounds__(maxTperB, minBperSM)
 static void cuda_parallel_launch_local_memory( const DriverType driver )
 {
   driver();

hcedwar · 2017-10-03T16:56:14Z

Thank you for generating proof / evidence for the suspected root cause. As soon as we (the whole Kokkos development team) return from travel is a high priority to fix.

ibaned · 2017-10-05T08:03:07Z

PR #1143 should fix this

Set default CUDA launch bounds to <0,0> and when do not use CUDA __launch_bounds__ unless CUDA launch bounds are explicitly specified.

hcedwar · 2017-10-05T19:11:02Z

Using PR #1147 to fix.

Fix for #1139 performance regression bug (and #1140 for tracking).

stanmoore1 mentioned this issue Oct 3, 2017

Develop #909

Merged

hcedwar mentioned this issue Oct 3, 2017

Cuda launch bounds performance regression bug #1140

Closed

ibaned added this to the 2017 October milestone Oct 3, 2017

ibaned added the Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos) label Oct 3, 2017

This was referenced Oct 3, 2017

Update to Kokkos r2.04.04 and add workaround for performance regression lammps/lammps#677

Merged

Allowing calls with truly no __launch_bounds__ #1143

Closed

hcedwar added a commit that referenced this issue Oct 5, 2017

Fix for #1139 performance regression bug (and #1140 for tracking).

be20a09

Set default CUDA launch bounds to <0,0> and when do not use CUDA __launch_bounds__ unless CUDA launch bounds are explicitly specified.

crtrott added a commit that referenced this issue Oct 5, 2017

Merge pull request #1147 from kokkos/issue-1140

df8a0a4

Fix for #1139 performance regression bug (and #1140 for tracking).

ibaned added the InDevelop label Oct 6, 2017

crtrott closed this as completed Oct 28, 2017

crtrott mentioned this issue Oct 28, 2017

Kokkos promotion 2.04.04 -> 2.04.11 trilinos/Trilinos#1916

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant performance regression in LAMMPS after updating Kokkos #1139

Significant performance regression in LAMMPS after updating Kokkos #1139

stanmoore1 commented Oct 2, 2017 •

edited

stanmoore1 commented Oct 2, 2017

hcedwar commented Oct 3, 2017

ibaned commented Oct 3, 2017

stanmoore1 commented Oct 3, 2017

stanmoore1 commented Oct 3, 2017

stanmoore1 commented Oct 3, 2017 •

edited

hcedwar commented Oct 3, 2017

ibaned commented Oct 5, 2017

hcedwar commented Oct 5, 2017

Significant performance regression in LAMMPS after updating Kokkos #1139

Significant performance regression in LAMMPS after updating Kokkos #1139

Comments

stanmoore1 commented Oct 2, 2017 • edited

stanmoore1 commented Oct 2, 2017

hcedwar commented Oct 3, 2017

ibaned commented Oct 3, 2017

stanmoore1 commented Oct 3, 2017

stanmoore1 commented Oct 3, 2017

stanmoore1 commented Oct 3, 2017 • edited

hcedwar commented Oct 3, 2017

ibaned commented Oct 5, 2017

hcedwar commented Oct 5, 2017

stanmoore1 commented Oct 2, 2017 •

edited

stanmoore1 commented Oct 3, 2017 •

edited