Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for tiled decompositions in PPPM #2280

Merged
merged 50 commits into from
Aug 21, 2020
Merged

Support for tiled decompositions in PPPM #2280

merged 50 commits into from
Aug 21, 2020

Conversation

sjplimp
Copy link
Contributor

@sjplimp sjplimp commented Aug 5, 2020

Summary

Add support to PPPM for load-balanced decompositions produced by balance and fix balance rcb,
i.e. tiled decompositions. The majority of changes for this are in the GridComm class which now stores a regular or tiled decomposition of the FFT grids used by PPPM and does forward and reverse communication of the charge and efield values on those grids. This change should support both orthogonal and triclinic domains and all the options that the PPPM class supports. By the time it is ready to merge all variant PPPM classes, including accelerator versions, should be supported.

Related Issues

N/A

Author(s)

Steve

Licensing

By submitting this pull request, I agree, that my contribution will be included in LAMMPS and redistributed under either the GNU General Public License version 2 (GPL v2) or the GNU Lesser General Public License version 2.1 (LGPL v2.1).

Backward Compatibility

Non-tiled decompositions should work as before.

Implementation Notes

Post Submission Checklist

  • The feature or features in this pull request is complete
  • Licensing information is complete
  • Corresponding author information is complete
  • The source code follows the LAMMPS formatting guidelines
  • Suitable new documentation files and/or updates to the existing docs are included
  • The added/updated documentation is integrated and tested with the documentation build system
  • The feature has been verified to work with the conventional build system
  • The feature has been verified to work with the CMake based build system
  • Suitable tests have been added to the unittest tree.
  • A package specific README file has been included or updated
  • One or more example input decks are included

Further Information, Files, and Links

@sjplimp
Copy link
Contributor Author

sjplimp commented Aug 5, 2020

@akohlmey @stanmoore1 As we discussed this PR initially has new PPPM2 and GridComm2 classes with the added support for tiling. The original PPPM and GridComm classes are unchanged.

@sjplimp
Copy link
Contributor Author

sjplimp commented Aug 6, 2020

@akohlmey Generalized this so the GridComm class can be used by fixes or computes. Similar to how the Comm methods for forward/reverse comm can be called from pair, fix, compute styles. Maybe this would be better done with a single templated forward_comm() method in GridComm2 which was templated on a ptr to a KSpace, Fix, Compute instance ?

@akohlmey
Copy link
Member

akohlmey commented Aug 7, 2020

@sjplimp ran this with the unittest test programs and found no difference between when switching to comm_style tiled, but those are all serial tests, so I am not sure how relevant those are. What worries me a little bit is that changing from orthogonal to triclinic changes the forces even though all the tilt factors are 0. This is the same for pppm an pppm2 though.

@stanmoore1
Copy link
Contributor

What worries me a little bit is that changing from orthogonal to triclinic changes the forces even though all the tilt factors are 0. This is the same for pppm an pppm2 though.

Are you manually constraining the grid? I believe the logic for choosing the grid size is a little different for triclinic vs not.

@akohlmey
Copy link
Member

akohlmey commented Aug 7, 2020

What worries me a little bit is that changing from orthogonal to triclinic changes the forces even though all the tilt factors are 0. This is the same for pppm an pppm2 though.

Are you manually constraining the grid? I believe the logic for choosing the grid size is a little different for triclinic vs not.

nope. there are the test inputs and lists of reference forces/energies/stresses: https://github.com/lammps/lammps/blob/master/unittest/force-styles/tests/kspace-pppm.yaml and https://github.com/lammps/lammps/blob/master/unittest/force-styles/tests/kspace-pppm_tri.yaml

all kspace related commands are in the "post_commands" section. The tester feeds input fragments, files and settings from these files into a LAMMPS instance and then compares the forces, stresses and energies.

@stanmoore1
Copy link
Contributor

stanmoore1 commented Aug 7, 2020

You will get slightly different forces if triclinic picks say a 34x34x34 grid while orthogonal picks a 32x32x32 grid.

@sjplimp
Copy link
Contributor Author

sjplimp commented Aug 11, 2020

@akohlmey @stanmoore1 Did this difference between orthogonal and triclinic for an orthog geometry get resolved? This was for both the original and this new PPPM2, so not an issue caused by this PR, correct? I am about ready to convert this PPPM2 to PPPM and propagate the changes to other files. Any reason to still wait on doing that?

@sjplimp
Copy link
Contributor Author

sjplimp commented Aug 12, 2020

@akohlmey @stanmoore1 I believe I have made all the needed mods to all the PPPM variants, except Kokkos. I also removed the trial PPPM2 and GridComm2 classes. So now the changes are in the mainstream PPPM and GridComm class. I think more changes are needed in GridCommKokkos than in PPPMKokkos. MSM is more complicated. I think we should disable tiled with MSM for now, but it will take some changes to MSM to make it work with the new GridComm API. If you can test all the PPPM variants, w/out MSM in a good state, please do.

@sjplimp
Copy link
Contributor Author

sjplimp commented Aug 12, 2020

@akohlmey @stanmoore1 ok, with Stan's advice I made MSM compatible with the new GridComm class. So everything but PPPM/Kokkos is now in place. Please test everything you can think of.

@akohlmey
Copy link
Member

@akohlmey @stanmoore1 ok, with Stan's advice I made MSM compatible with the new GridComm class. So everything but PPPM/Kokkos is now in place. Please test everything you can think of.

@sjplimp While I look into this, I need to you to look at PR #2284

@akohlmey
Copy link
Member

@sjplimp this branch current fails to compile kspace style msm/cg/omp.
Kspace style pppm/stagger produces incorrect forces.
It also was erroring out in the unit test for msm with a double free or corruption (!prev) message after MSM initialization
and for kspace style msm/cg it has a segmentation fault at the same location.

[==========] Running 6 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 6 tests from PairStyle
[ RUN      ] PairStyle.plain

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff752d7c0 in free () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install brotli-1.0.7-6.fc31.x86_64 cyrus-sasl-lib-2.1.27-3.fc31.x86_64 fftw-libs-double-3.3.8-6.fc31.x86_64 hwloc-libs-2.0.4-1.fc31.x86_64 keyutils-libs-1.6-3.fc31.x86_64 krb5-libs-1.17-46.fc31.x86_64 libcom_err-1.45.5-1.fc31.x86_64 libcurl-7.66.0-2.fc31.x86_64 libgcc-9.3.1-2.fc31.x86_64 libgfortran-9.3.1-2.fc31.x86_64 libgomp-9.3.1-2.fc31.x86_64 libidn2-2.3.0-1.fc31.x86_64 libjpeg-turbo-2.0.2-5.fc31.x86_64 libnghttp2-1.41.0-1.fc31.x86_64 libpng-1.6.37-2.fc31.x86_64 libpsl-0.21.0-2.fc31.x86_64 libquadmath-9.3.1-2.fc31.x86_64 libselinux-2.9-5.fc31.x86_64 libssh-0.9.4-3.fc31.x86_64 libstdc++-9.3.1-2.fc31.x86_64 libtool-ltdl-2.4.6-31.fc31.x86_64 libxcrypt-4.4.16-3.fc31.x86_64 libyaml-0.2.2-2.fc31.x86_64 mpich-3.3.2-1.fc31.x86_64 openssl-libs-1.1.1g-1.fc31.x86_64 zlib-1.2.11-20.fc31.x86_64
(gdb) up
#1  0x0000000000b3a5dc in LAMMPS_NS::MSM::deallocate_levels (this=this@entry=0x257de80) at /home/akohlmey/compile/lammps/src/KSPACE/msm.cpp:858
858	  delete [] gc_buf1;
(gdb) 
#2  0x0000000000b4b91a in LAMMPS_NS::MSM::set_grid_global (this=this@entry=0x257de80) at /home/akohlmey/compile/lammps/src/KSPACE/msm.cpp:1120
1120	  deallocate_levels();

In both cases initializing the affected pointers to NULL in the constructor resolved those issues. This is done in commit bd79179 after that the unittest result is:

The following tests passed:
	KSpaceStyle:ewald
	KSpaceStyle:ewald_disp
	KSpaceStyle:ewald_nozforce
	KSpaceStyle:ewald_slab
	KSpaceStyle:msm
	KSpaceStyle:msm_cg
	KSpaceStyle:pppm
	KSpaceStyle:pppm_ad
	KSpaceStyle:pppm_cg
	KSpaceStyle:pppm_cg_ad
	KSpaceStyle:pppm_disp
	KSpaceStyle:pppm_disp_ad
	KSpaceStyle:pppm_disp_tip4p
	KSpaceStyle:pppm_nozforce
	KSpaceStyle:pppm_slab
	KSpaceStyle:pppm_tip4p
	KSpaceStyle:pppm_tip4p_ad
	KSpaceStyle:pppm_tip4p_nozforce
	KSpaceStyle:pppm_tip4p_slab
	KSpaceStyle:pppm_tri

95% tests passed, 1 tests failed out of 21

Total Test time (real) =  26.38 sec

The following tests FAILED:
	196 - KSpaceStyle:pppm_stagger (Failed)

So only this and the failed compilation of msm/cg/omp remains:

[  8%] Building CXX object CMakeFiles/lammps.dir/home/akohlmey/compile/lammps/src/USER-OMP/msm_cg_omp.cpp.o
/home/akohlmey/compile/lammps/src/USER-OMP/msm_cg_omp.cpp: In member function ‘virtual void LAMMPS_NS::MSMCGOMP::compute(int, int)’:
/home/akohlmey/compile/lammps/src/USER-OMP/msm_cg_omp.cpp:100:5: error: ‘cg_peratom_all’ was not declared in this scope
  100 |     cg_peratom_all->ghost_notify();
      |     ^~~~~~~~~~~~~~
/home/akohlmey/compile/lammps/src/USER-OMP/msm_cg_omp.cpp:104:7: error: ‘cg_peratom’ was not declared in this scope; did you mean ‘direct_peratom’?
  104 |       cg_peratom[n]->ghost_notify();
      |       ^~~~~~~~~~
      |       direct_peratom
/home/akohlmey/compile/lammps/src/USER-OMP/msm_cg_omp.cpp:178:3: error: ‘cg_all’ was not declared in this scope; did you mean ‘gcall’?
  178 |   cg_all->reverse_comm(this,REVERSE_RHO);
      |   ^~~~~~
      |   gcall
/home/akohlmey/compile/lammps/src/USER-OMP/msm_cg_omp.cpp:186:5: error: ‘cg’ was not declared in this scope; did you mean ‘gc’?
  186 |     cg[n]->forward_comm(this,FORWARD_RHO);
      |     ^~
      |     gc
/home/akohlmey/compile/lammps/src/USER-OMP/msm_cg_omp.cpp:199:7: error: ‘cg’ was not declared in this scope; did you mean ‘gc’?
  199 |       cg[levels-1]->forward_comm(this,FORWARD_RHO);
      |       ^~
      |       gc
/home/akohlmey/compile/lammps/src/USER-OMP/msm_cg_omp.cpp:203:9: error: ‘cg_peratom’ was not declared in this scope; did you mean ‘direct_peratom’?
  203 |         cg_peratom[levels-1]->reverse_comm(this,REVERSE_AD_PERATOM);
      |         ^~~~~~~~~~
      |         direct_peratom
/home/akohlmey/compile/lammps/src/USER-OMP/msm_cg_omp.cpp:211:9: error: ‘cg_peratom’ was not declared in this scope; did you mean ‘direct_peratom’?
  211 |         cg_peratom[levels-1]->reverse_comm(this,REVERSE_AD_PERATOM);
      |         ^~~~~~~~~~
      |         direct_peratom
/home/akohlmey/compile/lammps/src/USER-OMP/msm_cg_omp.cpp:223:5: error: ‘cg’ was not declared in this scope; did you mean ‘gc’?
  223 |     cg[n]->reverse_comm(this,REVERSE_AD);
      |     ^~
      |     gc
/home/akohlmey/compile/lammps/src/USER-OMP/msm_cg_omp.cpp:228:7: error: ‘cg_peratom’ was not declared in this scope; did you mean ‘direct_peratom’?
  228 |       cg_peratom[n]->reverse_comm(this,REVERSE_AD_PERATOM);
      |       ^~~~~~~~~~
      |       direct_peratom
/home/akohlmey/compile/lammps/src/USER-OMP/msm_cg_omp.cpp:240:5: error: ‘cg_peratom_all’ was not declared in this scope
  240 |     cg_peratom_all->forward_comm(this,FORWARD_AD_PERATOM);
      |     ^~~~~~~~~~~~~~

@stanmoore1 stanmoore1 self-requested a review August 13, 2020 15:24
@ndtrung81
Copy link
Contributor

@akohlmey I suppose everything is good, and there's no need for my review at this point.

@akohlmey
Copy link
Member

@akohlmey I suppose everything is good, and there's no need for my review at this point.

yes, your review was auto-requested because you are listed as a code owner for one of the affected files in .github/CODEOWNERS, probably pppm/gpu.

@akohlmey
Copy link
Member

@sjplimp i've resolved the merge conflict and added a couple of unit tests for kspace with comm_style tiled (won't really test the communication patterns, but will help us detect when changes have side effects). I will now trigger the full regression tests one more time. Anything else that needs to be done before this is ready to merge?

@akohlmey akohlmey added test_for_regression Enable to trigger regression tests test_runs Enable to trigger run tests labels Aug 20, 2020
@stanmoore1
Copy link
Contributor

We figured out how to get correct forces when flipping FFT forward/reverse (that @sjplimp tried to add). Do we just want to make that a new PR and add that later?

@akohlmey
Copy link
Member

We figured out how to get correct forces when flipping FFT forward/reverse (that @sjplimp tried to add). Do we just want to make that a new PR and add that later?

Yes. You could make a feature branch from this branch and then submit this as PR. If you compile with cmake and enable testing, you can then test the kspace styles selectively with ctest -R KSpace.

@akohlmey akohlmey removed test_for_regression Enable to trigger regression tests test_runs Enable to trigger run tests labels Aug 20, 2020
@sjplimp
Copy link
Contributor Author

sjplimp commented Aug 20, 2020

@akohlmey @stanmoore1 I'm a bit lost in the git lingo. If you're saying you want to (a) drop any FFT changes from this PR and test/merge it now, and then (b) do another PR later to make the FFT and PPPM changes, then that's fine by me.

@akohlmey
Copy link
Member

@akohlmey @stanmoore1 I'm a bit lost in the git lingo. If you're saying you want to (a) drop any FFT changes from this PR and test/merge it now, and then (b) do another PR later to make the FFT and PPPM changes, then that's fine by me.

yes. All I was saying is that it is possible to create the branch with git checkout -b fft-feature gridcomm-tiled, so that you have all the changes from here already included instead of starting from master.

I am currently trying to identify some issue that was flagged by one of the load balancing regression tests but cannot (yet) reproduce it.

@akohlmey akohlmey added test_for_regression Enable to trigger regression tests test_runs Enable to trigger run tests and removed test_for_regression Enable to trigger regression tests test_runs Enable to trigger run tests labels Aug 20, 2020
@akohlmey akohlmey added test_for_regression Enable to trigger regression tests test_runs Enable to trigger run tests labels Aug 21, 2020
@akohlmey akohlmey merged commit 4c46119 into master Aug 21, 2020
@akohlmey akohlmey deleted the gridcomm-tiled branch August 21, 2020 04:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement kspace_package ready_for_merge test_for_regression Enable to trigger regression tests test_runs Enable to trigger run tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants