New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Kokkos support for atom sorting on device #3740
Conversation
For 8M LJ atoms running for 1000 timesteps on a single V100 GPU, this PR gives ~25% speedup when sorting every 100 timesteps with the new code vs the old code and default sorting every 1000 timesteps. Part of the speedup is due to a better sorting binsize, and the other part is due to sorting more often on device which improves cache access. |
This gives a marginal speedup for 256k Rhodo benchmark, but only when using the old default binsize for sorting = 6 Angstrom instead of 12, otherwise it gives a slowdown. Not sure how to get a better heuristic for sorting binsize. |
@weinbe2 @arghdos |
Looks like I need to run performance tests to see if the majority of styles get a speedup or slowdown with the change in default sorting binsize. |
3 regression tests failing on the GPU, will debug those too |
@stanmoore1 My recollection is that the optimal sort frequency was also different on CPU vs GPU? Is that a default |
@sjplimp good point, we should consider changing the default from 1000 to 100, will do some tests. |
Regressions appear to be false positives related to changing binsize affecting the pRNG in |
The sorting binsize change seems to help large LJ but hurts most other benchmarks like ReaxFF, Tersoff, EAM, and Rhodo so I will revert. |
Looks like sorting every 100 helps LJ but slows down other benchmarks too. |
This is ready to merge from my POV. Noted that LJ can get a speedup from using a different binsize and sorting every 100 timesteps, but this won't be the default behavior since it hurts other benchmarks. |
Actually seeing a slowdown for ReaxFF, need to double check |
@stanmoore1 I don't understand if the atom_modify sort setttings affect this new on-GPU sorting. If they do, it seems like there should be info added to the atom_modify doc page to explain how to adjust these settings for both CPUs and GPUs, and also what the defaults are for both. Note that the atom_modify settings are for bin size and frequency. |
@sjplimp no I reverted all the changes to the atom_modify sort settings since they cause performance regressions for some benchmarks, keeping the old defaults for now. So the only change in this PR is to sort atoms on the GPU instead of CPU. |
@stanmoore1 Hi Stan - my point is that the atom_modify sort doc page has no discussion of CPU vs GPU, including for the defaults. If you expect the user to do something different for GPUs, then that should be explained, even if it is just guidance on what to try for GPUs. Are you saying the user should not do anything different in their input script for GPUs (for sorting), and that the code should now simply run faster (for sorting) b/c it is being done on the GPU? However you also seem to be saying that sometimes sorting on the GPU is slower than on the CPU? Which doesn't make that much sense to me. |
The |
If you mean atom_modify sort, then yes, I think that is a good thing to do, and that is best place to do it. |
@sjplimp will do |
The performance regression I saw with ReaxFF is related to #3756. |
This PR is ready to merge from my POV, but should be merged after #3756 since I already merged those changes into this PR. |
Actually forgot to update the docs, will do that now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve.
Summary
Currently atom sorting is done on host CPU when using Kokkos. This is expensive since it must move all the data, and is performed in serial. This PR adds Kokkos support for atom sorting on device.
Related Issue(s)
None
Author(s)
Stan Moore (SNL)
Licensing
By submitting this pull request, I agree, that my contribution will be included in LAMMPS and redistributed under either the GNU General Public License version 2 (GPL v2) or the GNU Lesser General Public License version 2.1 (LGPL v2.1).
Backward Compatibility
Yes