New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Using pair style hybrid with GPU package and neighbor lists on GPU #2621
Comments
@akohlmey I found a flaw in the current implementation (my bad) that misses the skipping pair types for the sub-styles in pair hybrid. Maybe that is the reason why pair hybrid* was not supported with neighbor list builds on the GPU, which is similar to why "neigh_modify exclude" is not being supported (yet) for GPU neighbor builds. The current neighbor list build does not have a way to explicitly skip/exclude pair types. Maybe we need a separate code path or a separate neighbor build kernel for pair hybrid and "neigh_modify exclude", so not to interfere with the performance of the other existing cases. I will take a closer look at this in the coming weeks. @wmbrownIntel do you have a suggestion? |
All of my tests with hybrid use 'neigh no' and I honestly don't remember how we have supported this. My recommendation is to test with hybrid overlay to see if it works correctly with GPU builds. If not, check for combination of GPU build and hybrid in FixGPU::init for error. If so, check for combination of GPU build and neighbor->requests[i]->skip for error. Sorry I can't give this more attention at the moment; I have to be focused on another project this week... |
@wmbrownIntel thanks for your suggestion. I have refactored the code to support GPU neighbor builds for pair hybrid in PR #1430. The issue is tracked down to the current implementation of eam/gpu where rho and fp are now computed for all the pair types, and should be computed for only the unskipped types as done in the CPU version. I will work on the bugfix for this in a separate PR. Other pairwise styles are supposed to work correctly (as cutsq[itype][jtype] is set to zero for skipped pair types with setflag[itype][jtype] == 0). For 3-body styles, * * are enforced, so they should work correctly, too. |
test input re-checked with comment fff89135fb and it passes. |
correction. it passes with 1 MPI rank, but fails with 2 MPI ranks. |
@akohlmey FYI, I revisited the issue with the given input script and found that with the current HEAD (commit b44e353) the run completed successfully with "neigh yes" with 2 and 4 MPI procs. This is with the CUDA build via CMake, but the OpenCL build fail with multiple MPI procs. I only turned off the error out line in fix_gpu.cpp as below for this test:
Could you reproduce the crash with the OpenCL build with your branch? |
When I run this input with the suggested change with CUDA and 4 MPI ranks on 1 GPU, it crashes in the second run with
Similar for OpenCL:
I see similar behavior with the PR #3485 branch with some inputs also without hybrid. E.g. with an input file:
in the bench folder.
It may be related to having more than one style used with GPU acceleration. I can "fix" the issue with multiple rhodo bench runs |
@akohlmey For some reason I still cannot reproduce the crash with CUDA with 4 MPI ranks on my end (CUDA Toolkit 11.6, Fedora 35, NVIDIA P100 GPU), with the recent commit #05aca2b in the develop branch. I can reproduce the crash with OpenCL. FYI, the runtime errors you got are at geryon/nvd_timer.h:76 and geryon/ocl_timer:118 are both at the UCL_Timer::sync_stop() function. The difference is that the CUDA backend calls cuEventSynchronize() with start_event, whereas the OpenCl calls clWaitForEvents() with stop_event. A little bit update regarding debugging progress: I generated a restart file from the CPU run with the input script given here, so to have a well-behaved configuration. Then I read the restart file in another input script run_from_restart.eam.txt This script runs a single time step and writes out a dump file with atom forces. On my end, both OpenCL and CUDA on 4 MPI ranks produce consistent thermo output and atom forces with "neigh yes", "neigh no" vs. the 4-MPI CPU run. It looks to me that the neighbor list build and force kernels work properly with neigh yes for pair hybrid. The crash with OpenCL on my end might come from the event sync with UCL_Timer in ocl_timer.h, but I am not sure how to debug it further after this revisit. The issue you're seeing with repeated runs and clear even without pair hybrid suggest other root causes, not relevant to the present issue. The line reported ocl_kernel.h:261 is where clSetKernelArg is called add_arg(). Error -36 indicates invalid command queue (cl.h:164). So it looks like the command queue where the kernel is enqueued is invalid at the point. Any comments and suggestions will be highly appreciated. |
@ndtrung81 I've made some more experiments. There is something fishy going on when there are multiple kernels active and that is exposed most easily when you have multiple run commands. It could very well be in the GPU neighbor list code. So I can "fix" the multiple rhodo runs case by adding I don't really want to enable the GPU neighbor list for hybrid unless that is resolved. Unfortunately, this is beyond my skills and experience with GPU programming (I never spent much time on it anyway). |
@akohlmey I think I fixed a bug in
Can you double check to see if it is the case on your tests? The original issue with |
@ndtrung81 yes, this fixes the crash for running the rhodo test case multiple times with clear in between on CUDA. OpenCL seems to be unaffected. |
@akohlmey I have tried the below changes with the original input script in this bug report, and it seems to resolve the crash with Note to self: when multiple GPU neighbor builds are requested, the Neighbor instances actually share the same Atom object, and using async = true here leads to corrupted particle id on the device. Need to switch to async = false only for this particular setting, so not to affect the more popular use cases (only one GPU neighbor build request).
|
@ndtrung81 I see some small improvements, but still have crashes with CUDA and GPU neighbor lists. There also seem to be significant differences with forces. Even with double precision there is a significant difference somewhere between GPU and CPU forces, since energies are identical on step 0 but different from step 1 onward. The run with the neighbor lists on the GPU fails during the second run command with a lost atoms error. |
Further improvements have been difficult to achieve. @ndtrung81 and @akohlmey are happy with what has been fixed so far so we decided to close this issue and wait for a new bug report. |
Hi all, I am currently trying to use the GPU package with a hybrid/overlay potential and am encountering a similar issue ( Given I am using a custom potential, what would be the best way for me to provide you with a bug report? Shall I just fork lammps and make my changes there? Thank you very much for your help! |
@pw0908 there is nothing more to report. We have tried very hard to sort this out, but did not succeed. Hence this issue is closed and won't be reopened. All you have to do is compute the neighbor lists on the CPU via the package command or -pk command line flag. |
... or instead of using pair style hybrid/overlay, you change your custom pair style to include lj/cut and coul/long in addition to your custom style. That would be faster than hybrid/overlay, even if you could use GPU neighbor lists. |
I see. I'll give that a shot. Thanks! Just out of curiosity, I was thinking of opening a PR to merge this implementation into LAMMPS. Would there be much point in also providing the implementation of lj/cut/coul/long/custom? This potential is just an approximate way of modeling ion-dipole interaction in coarse-grained systems; you would normally use it in conjunction with lj/cut and coul/long anyways. |
Probably more than for the version that requires using hybrid/overlay |
@pw0908 you can derive your custom pair style classes from those for lj/cut/coul/long in lib/gpu (LJCoulLong) and src/GPU (PairLJCutCoulLongGPU). It will be faster than hybrid/overlay with neighbor builds on the host. Your PR will be very much welcome. |
Summary
There appears to be a problem using neighbor lists on the GPU with pair style hybrid inputs. The following input does not work reliably with just "-sf gpu" added, but using "-pk gpu 0 neigh no" is avoiding the crashes or corrupted data.
This is supposed to work since PR #1430. Below is an abbreviated version of an input posted on lammps-users.
LAMMPS Version and Platform
Current LAMMPS development head revision ce4dc4e
Expected Behavior
Input will run with "-sf gpu" using 1 or multiple MPI processes without having to disable neighbor lists on the GPU.
Actual Behavior
LAMMPS reports "nan" or incorrect results or crashes with different errors or warnings. (CUDA has device failures, OpenCL asks to boost neigh_one when run on 1 CPU but ran work with setting binsize, but fails with more MPI ranks).
Steps to Reproduce
Use the following input:
The text was updated successfully, but these errors were encountered: