-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significant performance regression in LAMMPS after updating Kokkos #1139
Comments
I had the versions backwards, updated the comment. |
We suspect modification of Cuba launch bounds that went in ~June. On our to-do list to investigate. |
@stanmoore1 could you post details of a representative LAMMPS case to run including launch command? |
@hcedwar your guess seems correct. The CUDA launch bounds change is 4e3c6e7 (PR #909) but it doesn't compile so I can't test it directly. Looking at commits close to that, the LAMMPS performance figure of merit for afa25d1 (PR #908) before the launch bounds change is 152 timesteps/s and for c979a6f (PR #912) after the launch bounds change (which fixed the compile error), the LAMMPS figure of merit is 97 timesteps/s, a decrease of ~40%. |
@ibaned the test I'm running is bundled with LAMMPS. I'm running on a single P100 node: export LMP_ROOT=~/lammps_master cd ~/lammps_master/src/USER-INTEL/TEST mpiexec -np 1 ~/lammps_master/src/lmp_kokkos_cuda_mpi -in in.intel.tersoff -k on g 1 -sf kk -pk kokkos comm device binsize 4.2 newton on neigh half -v m 0.2 |
As a final test, I commented out the diff --git a/lib/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp b/lib/kokkos/core/src/ index cae8ecd..079d9f0 100644 --- a/lib/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp +++ b/lib/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp @@ -164,7 +164,7 @@ static void cuda_parallel_launch_constant_memory() template< class DriverType, unsigned int maxTperB, unsigned int minBperSM > __global__ -__launch_bounds__(maxTperB, minBperSM) +//__launch_bounds__(maxTperB, minBperSM) static void cuda_parallel_launch_constant_memory() { const DriverType & driver = @@ -182,7 +182,7 @@ static void cuda_parallel_launch_local_memory( const DriverT template< class DriverType, unsigned int maxTperB, unsigned int minBperSM > __global__ -__launch_bounds__(maxTperB, minBperSM) +//__launch_bounds__(maxTperB, minBperSM) static void cuda_parallel_launch_local_memory( const DriverType driver ) { driver(); |
Thank you for generating proof / evidence for the suspected root cause. As soon as we (the whole Kokkos development team) return from travel is a high priority to fix. |
PR #1143 should fix this |
Set default CUDA launch bounds to <0,0> and when do not use CUDA __launch_bounds__ unless CUDA launch bounds are explicitly specified.
Using PR #1147 to fix. |
I'm running the Kokkos version of LAMMPS and am seeing a ~40% decrease in performance on GPUs when I switched from 2.03.05 (2017-05-27) to Kokkos 2.03.13 (2017-07-27). The performance regression persisted after I updated to version 2.04.00 (2017-08-16). Any known issues that could be related?
The text was updated successfully, but these errors were encountered: