Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU support for lj/cut/tip4p/long pair style #1776

Merged
merged 12 commits into from Jan 8, 2020
Merged

Conversation

@Vsevak
Copy link
Contributor

Vsevak commented Nov 18, 2019

Summary

This PR adds the GPU-accelerated version of lj/cut/tip4p/long pair style implemented as a part of module GPU.

Authors

Vsevolod Nikolskiy (thevsevak@gmail.com, @Vsevak), Vladimir Stegailov (v.stegailov@hse.ru, @vvsteg)
International Laboratory for Supercomputer Atomistic Modelling and Multi-scale Analysis
Higher School of Economics, Moscow, Russia

Licensing

By submitting this pull request, I agree, that my contribution will be included in LAMMPS and redistributed under either the GNU General Public License version 2 (GPL v2) or the GNU Lesser General Public License version 2.1 (LGPL v2.1).

Backward Compatibility

The pull request will not break backward compatibility.

Implementation Notes

This implementation is inspired by the original lj/cut/tip4p/long from module KSPACE, but the algorithm has been complicated significantly by adaptation to full neighbor lists. The algorithm is divided into two GPU kernels.

  1. The first kernel does most of the work and computes the force that is applied to local real atoms and virtual charges m. Since tip4p code requires IDs to find indices of O, H atoms within a water molecule, this kernel takes as arguments the pointers tag, sametag and map. Domain::closest_image and simplified Atom::map have been reimplemented for this kernel. Also, this kernel handles the case when the corresponding oxygen is not a local atom.

  2. The second kernel distributes force/energy/virial from each virtual charge m to the real atoms.

This code has been verified against original lj/cut/tip4p/long with both CUDA and OpenCL backends (and also the new ROCm HIP backend #1590 ) on GTX1070, GTX950m and Radeon VII. The possible input is attached as examples/water. The code demonstrates machine epsilon precision of force, per-atom energy and virial with DOUBLE_DOUBLE. The GPUs deliver acceleration of Pair about x7 against 8 cores of Xeon processor, or ~60% reduced time-to-solution.

The changes of atom.h do not affect any features and only add a getter for map_maxarray.

Know issues and future work
The current version sometimes produces inaccurate results on Nvidia Volta with CUDA 10.0 and Driver v. 410.104. This may be due to a driver issue or due to the Independent Thread Scheduling added in sm_70. In the latter case, a refinement of the synchronization will be required for higher Compute Capabilities >= 70. With sm_61 this problem was not found.
The current version requires map array.

Post Submission Checklist

  • The feature or features in this pull request is complete
  • Licensing information is complete
  • Corresponding author information is complete
  • The source code follows the LAMMPS formatting guidelines
  • Suitable new documentation files and/or updates to the existing docs are included
  • The added/updated documentation is integrated and tested with the documentation build system
  • The feature has been verified to work with the conventional build system
  • The feature has been verified to work with the CMake based build system
  • A package specific README file has been included or updated
  • One or more example input decks are included
Vsevak added 5 commits Nov 9, 2019
Source code, Makefiles and Install for GPU-accelerated TIP4P pair style.
It is implemented as a part of the standard GPU package.
The style is compatible with the standard  lj/cut/tip4p/long.
Also, this commit modifies "atom.h" just to
add a getter for variable 'max_same'.
@ndtrung81

This comment has been minimized.

Copy link
Contributor

ndtrung81 commented Nov 18, 2019

@Vsevak @vvsteg thanks for working on this pair style. I approve the PR. I have a couple questions:

  1. Is it possible to separate the computation of hneigh into another kernel? I expect that this new kernel, which is only called after neighbor list is updated, would remove the if checks inside the for loop in the heavy-load kernel.

  2. What are the sometimes incorrect results? forces, energies, and/or virials? Are the incorrect results deterministic? Which kernel(s) you think might be responsible to these sometimes incorrect results?

@Vsevak

This comment has been minimized.

Copy link
Contributor Author

Vsevak commented Nov 18, 2019

@ndtrung81 Thank you!

  1. I guess it is possible. My first approach was to leave the computation of hneigh on the CPU side. It was working but rather slow, and I returned to the original tip4p/long flow. Now it makes sense to separate the computation of hneigh into another GPU kernel and I will try it out.

  2. Incorrect results on Volta are non-deterministic and affect all values. As I understand it the coulomb part of the first kernel is responsible. I suggest an influence of the Independent Thread Scheduling, so perhaps the separation of the first kernel can simplify the control flow and bring a solution.

@rbberger rbberger self-assigned this Nov 26, 2019
Vsevak added 4 commits Nov 27, 2019
According to Khronos OpenCL docs, "The C99 standard headers <...>, float.h, <...> are not available and cannot be included by a program"
Simplify the main GPU kernel and add another kernel 'k_pair_reneigh'. It works good on GTX1070 (Pascal), but still there is a problem with non-deterministic results on Volta.

I reimplement BaseCharge::compute methods in the child class LJ_TIP4PLong to correctly embed a new kernel in the code.

Also commit includes some codestyle fixes.
This eliminates the need for thread fence and makes the calculation stable on GTX1070 (CUDA and OpenCL) and TitanV
@rbberger rbberger assigned sjplimp and unassigned rbberger Dec 10, 2019
@rbberger

This comment has been minimized.

Copy link
Member

rbberger commented Dec 10, 2019

@sjplimp please approve the change to atom.h

@Vsevak

This comment has been minimized.

Copy link
Contributor Author

Vsevak commented Dec 10, 2019

@ndtrung81 I have separated the computations of 'hneigh' and 'new cite' into new kernels. Now the heavy-load kernel is much simpler. I have done some testing on GTX1070 (CUDA and OpenCL) and Titan V. I am pleased to announce that I did not find any signs of non-deterministic behavior or incorrect results in any configuration. However, I have not done very thorough testing yet, as I am on a trip now.

@ndtrung81

This comment has been minimized.

Copy link
Contributor

ndtrung81 commented Dec 11, 2019

@Vsevak glad to hear that. Thanks for upgrading this new feature!

@akohlmey

This comment has been minimized.

Copy link
Member

akohlmey commented Dec 17, 2019

@Vsevak thanks for your contribution. Just one very minor nitpick. Your example input deck should be either removed or moved to a different location clearly indicating that this is meant to compare and debug GPU accelerated tip4p water. Placing this into a folder examples/water would suggest that this is a demonstration of (multiple) water potentials. Also we have the convention that generic, simple and commonly relevant inputs are in folders with a lower case name and more specific and complex inputs in folders with upper case names. Also, the naming convention for files does not seem to be followed and there are plenty of commented out commands. If different things should be show, it would be better to have multiple inputs (named in.something to show each "something") and there would be reference outputs, at least for CPU versions.

@Vsevak

This comment has been minimized.

Copy link
Contributor Author

Vsevak commented Dec 19, 2019

@akohlmey thank you. I propose to move the input in examples/TIP4P/gpu_dump. It can be used to demonstrate the equality of all coordinates and forces on the first steps when calculating using the original pair style and with the GPU algorithm. Also, I found the data.spce file in some other examples, so I propose to store it just as a symlink to avoid duplication. I did the renaming and apply a sample output in the commit.

@akohlmey

This comment has been minimized.

Copy link
Member

akohlmey commented Jan 6, 2020

@sjplimp this is still assigned to you, so you can check and approve the change in atom.h and assign the PR to me for merging (should not be a problem, but this is a core header, so you have to explicitly approve changes to functionality).

Such data transfer is performed at each timestep, so it does not belong to the initialization
@sjplimp
sjplimp approved these changes Jan 8, 2020
@akohlmey

This comment has been minimized.

Copy link
Member

akohlmey commented Jan 8, 2020

@ndtrung81 can you quickly check, if this needs similar changes as in #1752? I would like to include both PRs in the patch release planned for tomorrow. Since the memory leak is a minor issue, it would be no big deal if this is not addressed right away. Just want to make certain we are aware of it.

@akohlmey akohlmey assigned akohlmey and unassigned sjplimp Jan 8, 2020
@ndtrung81

This comment has been minimized.

Copy link
Contributor

ndtrung81 commented Jan 8, 2020

@akohlmey the changes in #1752 are made to the base classes only, so we do not need them incorporated in this PR.

@akohlmey

This comment has been minimized.

Copy link
Member

akohlmey commented Jan 8, 2020

@akohlmey the changes in #1752 are made to the base classes only, so we do not need them incorporated in this PR.

excellent! thanks for the fast reaction.

@akohlmey akohlmey merged commit 481c647 into lammps:master Jan 8, 2020
6 checks passed
6 checks passed
lammps/pull-requests/cmake/cmake-kokkos-omp-pr head run ended
Details
lammps/pull-requests/cmake/cmake-serial-pr head run ended
Details
lammps/pull-requests/kokkos-omp-pr head run ended
Details
lammps/pull-requests/openmpi-pr head run ended
Details
lammps/pull-requests/serial-pr head run ended
Details
lammps/pull-requests/shlib-pr head run ended
Details
@Vsevak

This comment has been minimized.

Copy link
Contributor Author

Vsevak commented Jan 8, 2020

@akohlmey Thank you for accepting this PR.
I presented a talk about this GPU algorithm for TIP4P at the International Conference on Parallel Computing (ParCo, Prague, Czech, 10-13 September 2019). The corresponding article has been accepted for publication in the Conference proceedings. May I add this reference to the log.cite output? Vsevak@5615a39
0001-Add-a-reference-to-the-log.cite.patch.txt

@akohlmey

This comment has been minimized.

Copy link
Member

akohlmey commented Jan 8, 2020

@akohlmey I presented a talk about this GPU algorithm for TIP4P at the International Conference on Parallel Computing (ParCo, Prague, Czech, 10-13 September 2019). The corresponding article has been accepted for publication in the Conference proceedings. May I add this reference to the log.cite output?
0001-Add-a-reference-to-the-log.cite.patch.txt

yes. I just added the attached patch to PR #1817

@akohlmey

This comment has been minimized.

Copy link
Member

akohlmey commented Jan 9, 2020

@Vsevak FYI your code was not compatible to be compiled with -DLAMMPS_BIGBIG. You had been using int * in several places, where you needed to use tagint *. This doesn't show with the other settings because there tagint is 32-bit, but with -DLAMMPS_BIGBIG it is 64-bit and then those pointer assignments will break compilation. I recovered it from the failing compilation, but this does not detect cases, where int is used instead of tagint, so I suggest you carefully review everything and run tests on some large systems, if possible, where you have more than 2 billion particles to check, that everything is correct or otherwise submit a PR with bugfixes.

@Vsevak Vsevak mentioned this pull request Jan 20, 2020
8 of 8 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.