Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Kokkos support for atom sorting on device #3740

Merged
merged 15 commits into from May 1, 2023

Conversation

stanmoore1
Copy link
Contributor

@stanmoore1 stanmoore1 commented Apr 19, 2023

Summary

Currently atom sorting is done on host CPU when using Kokkos. This is expensive since it must move all the data, and is performed in serial. This PR adds Kokkos support for atom sorting on device.

Related Issue(s)

None

Author(s)

Stan Moore (SNL)

Licensing

By submitting this pull request, I agree, that my contribution will be included in LAMMPS and redistributed under either the GNU General Public License version 2 (GPL v2) or the GNU Lesser General Public License version 2.1 (LGPL v2.1).

Backward Compatibility

Yes

@stanmoore1 stanmoore1 marked this pull request as draft April 19, 2023 19:24
@stanmoore1 stanmoore1 marked this pull request as ready for review April 19, 2023 20:24
@stanmoore1
Copy link
Contributor Author

For 8M LJ atoms running for 1000 timesteps on a single V100 GPU, this PR gives ~25% speedup when sorting every 100 timesteps with the new code vs the old code and default sorting every 1000 timesteps. Part of the speedup is due to a better sorting binsize, and the other part is due to sorting more often on device which improves cache access.

@stanmoore1
Copy link
Contributor Author

This gives a marginal speedup for 256k Rhodo benchmark, but only when using the old default binsize for sorting = 6 Angstrom instead of 12, otherwise it gives a slowdown. Not sure how to get a better heuristic for sorting binsize.

@stanmoore1
Copy link
Contributor Author

@weinbe2 @arghdos

@stanmoore1
Copy link
Contributor Author

Looks like I need to run performance tests to see if the majority of styles get a speedup or slowdown with the change in default sorting binsize.

@stanmoore1
Copy link
Contributor Author

3 regression tests failing on the GPU, will debug those too

@sjplimp
Copy link
Contributor

sjplimp commented Apr 19, 2023

@stanmoore1 My recollection is that the optimal sort frequency was also different on CPU vs GPU? Is that a default
setting that should be different between the two modes?

@stanmoore1
Copy link
Contributor Author

@sjplimp good point, we should consider changing the default from 1000 to 100, will do some tests.

@stanmoore1
Copy link
Contributor Author

Regressions appear to be false positives related to changing binsize affecting the pRNG in fix langevin and fix gle.

@stanmoore1
Copy link
Contributor Author

The sorting binsize change seems to help large LJ but hurts most other benchmarks like ReaxFF, Tersoff, EAM, and Rhodo so I will revert.

@stanmoore1
Copy link
Contributor Author

Looks like sorting every 100 helps LJ but slows down other benchmarks too.

@stanmoore1
Copy link
Contributor Author

This is ready to merge from my POV. Noted that LJ can get a speedup from using a different binsize and sorting every 100 timesteps, but this won't be the default behavior since it hurts other benchmarks.

@stanmoore1 stanmoore1 assigned akohlmey and unassigned stanmoore1 Apr 20, 2023
@stanmoore1 stanmoore1 requested a review from athomps April 20, 2023 20:31
@stanmoore1
Copy link
Contributor Author

Actually seeing a slowdown for ReaxFF, need to double check

@stanmoore1 stanmoore1 assigned stanmoore1 and unassigned akohlmey Apr 20, 2023
@sjplimp
Copy link
Contributor

sjplimp commented Apr 20, 2023

@stanmoore1 I don't understand if the atom_modify sort setttings affect this new on-GPU sorting. If they do, it seems like there should be info added to the atom_modify doc page to explain how to adjust these settings for both CPUs and GPUs, and also what the defaults are for both. Note that the atom_modify settings are for bin size and frequency.

@stanmoore1
Copy link
Contributor Author

@sjplimp no I reverted all the changes to the atom_modify sort settings since they cause performance regressions for some benchmarks, keeping the old defaults for now. So the only change in this PR is to sort atoms on the GPU instead of CPU.

@sjplimp
Copy link
Contributor

sjplimp commented Apr 20, 2023

@stanmoore1 Hi Stan - my point is that the atom_modify sort doc page has no discussion of CPU vs GPU, including for the defaults. If you expect the user to do something different for GPUs, then that should be explained, even if it is just guidance on what to try for GPUs. Are you saying the user should not do anything different in their input script for GPUs (for sorting), and that the code should now simply run faster (for sorting) b/c it is being done on the GPU? However you also seem to be saying that sometimes sorting on the GPU is slower than on the CPU? Which doesn't make that much sense to me.

@stanmoore1
Copy link
Contributor Author

stanmoore1 commented Apr 20, 2023

The atom_modify sort defaults are sortfreq = 1000 and binsize = cutneighmax/2.0. For the LJ benchmark on GPUs, using sortfreq = 100 and binsize = cutneighmax is better but that hurts performance of other benchmarks like Tersoff on the GPU so we can't make that the default for GPUs. I could mention this in the docs.

@sjplimp
Copy link
Contributor

sjplimp commented Apr 20, 2023

If you mean atom_modify sort, then yes, I think that is a good thing to do, and that is best place to do it.
You should probably also mention it on the Kokkos package page under 7.4 Accelerator packages in the manual.

@stanmoore1
Copy link
Contributor Author

@sjplimp will do

@akohlmey akohlmey added this to the Stable Release Summer 2023 milestone Apr 24, 2023
@stanmoore1
Copy link
Contributor Author

The performance regression I saw with ReaxFF is related to #3756.

@stanmoore1
Copy link
Contributor Author

stanmoore1 commented Apr 27, 2023

This PR is ready to merge from my POV, but should be merged after #3756 since I already merged those changes into this PR.

@stanmoore1 stanmoore1 assigned akohlmey and unassigned stanmoore1 Apr 27, 2023
@stanmoore1
Copy link
Contributor Author

Actually forgot to update the docs, will do that now.

@stanmoore1 stanmoore1 assigned stanmoore1 and unassigned akohlmey Apr 27, 2023
@stanmoore1 stanmoore1 assigned akohlmey and unassigned stanmoore1 Apr 27, 2023
Copy link
Contributor

@athomps athomps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve.

@akohlmey akohlmey merged commit 41a0196 into lammps:develop May 1, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants