Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Using pair style hybrid with GPU package and neighbor lists on GPU #2621

Closed
akohlmey opened this issue Feb 23, 2021 · 20 comments
Closed
Assignees

Comments

@akohlmey
Copy link
Member

Summary

There appears to be a problem using neighbor lists on the GPU with pair style hybrid inputs. The following input does not work reliably with just "-sf gpu" added, but using "-pk gpu 0 neigh no" is avoiding the crashes or corrupted data.
This is supposed to work since PR #1430. Below is an abbreviated version of an input posted on lammps-users.

LAMMPS Version and Platform

Current LAMMPS development head revision ce4dc4e

Expected Behavior

Input will run with "-sf gpu" using 1 or multiple MPI processes without having to disable neighbor lists on the GPU.

Actual Behavior

LAMMPS reports "nan" or incorrect results or crashes with different errors or warnings. (CUDA has device failures, OpenCL asks to boost neigh_one when run on 1 CPU but ran work with setting binsize, but fails with more MPI ranks).

Steps to Reproduce

Use the following input:

units        metal
dimension    3
boundary    p p p
atom_style    atomic
variable Ar_mass    equal 39.95                        # [aem]
variable Au_mass    equal 196.97                        # [aem]
variable Ar_eps        equal 0.0103                        # epsilon Ar-Ar, [eV]
variable Ar_sigma    equal 3.4                        # sigma Ar-Ar, [A]
variable Ar_crit    equal 3.5*${Ar_sigma}                    # cut-off distance, [A]
variable rho        equal 1.41                        # density [g/cm^3]

# simulated box
region     simbox     block     0 50     0 50     0 50     side in units box
create_box    2     simbox

########################
# Version #1  w/o hybrid
########################
#pair_style    lj/cut ${Ar_crit}
#pair_coeff    * * ${Ar_eps} ${Ar_sigma}  ${Ar_crit}

############################################
# Version #2 fails with neighbor list on GPU
############################################
pair_style     hybrid lj/cut ${Ar_crit} eam
pair_coeff    1 1 eam Au_u3.eam
pair_coeff    1 2 lj/cut ${Ar_eps} ${Ar_sigma}      # Au-Ar
pair_coeff    2 2 lj/cut ${Ar_eps} ${Ar_sigma}        # Ar-Ar

mass    1    ${Au_mass}
mass    2    ${Ar_mass}

variable Coef        equal 66.34     # to gr/cm^3
variable N_L1        equal round(${rho}*(50*50*50)/${Coef})
create_atoms 2 random ${N_L1} 123456 simbox

thermo    100
thermo_style    custom    step    temp    press    pe ke density

timestep    0.001
fix        Lang    all    langevin 85 85 5 3333
fix         NVELIM     all     nve/limit 0.002
run_style    verlet
run         500
unfix        NVELIM

timestep    0.005
fix        NVE_all all    nve
run_style    verlet
run        500
@ndtrung81
Copy link
Contributor

@akohlmey I found a flaw in the current implementation (my bad) that misses the skipping pair types for the sub-styles in pair hybrid. Maybe that is the reason why pair hybrid* was not supported with neighbor list builds on the GPU, which is similar to why "neigh_modify exclude" is not being supported (yet) for GPU neighbor builds. The current neighbor list build does not have a way to explicitly skip/exclude pair types. Maybe we need a separate code path or a separate neighbor build kernel for pair hybrid and "neigh_modify exclude", so not to interfere with the performance of the other existing cases. I will take a closer look at this in the coming weeks. @wmbrownIntel do you have a suggestion?

@wmbrownIntel
Copy link
Collaborator

All of my tests with hybrid use 'neigh no' and I honestly don't remember how we have supported this. My recommendation is to test with hybrid overlay to see if it works correctly with GPU builds. If not, check for combination of GPU build and hybrid in FixGPU::init for error. If so, check for combination of GPU build and neighbor->requests[i]->skip for error. Sorry I can't give this more attention at the moment; I have to be focused on another project this week...

@ndtrung81
Copy link
Contributor

@wmbrownIntel thanks for your suggestion. I have refactored the code to support GPU neighbor builds for pair hybrid in PR #1430.

The issue is tracked down to the current implementation of eam/gpu where rho and fp are now computed for all the pair types, and should be computed for only the unskipped types as done in the CPU version. I will work on the bugfix for this in a separate PR.

Other pairwise styles are supposed to work correctly (as cutsq[itype][jtype] is set to zero for skipped pair types with setflag[itype][jtype] == 0). For 3-body styles, * * are enforced, so they should work correctly, too.

@akohlmey
Copy link
Member Author

test input re-checked with comment fff89135fb and it passes.

@akohlmey
Copy link
Member Author

correction. it passes with 1 MPI rank, but fails with 2 MPI ranks.

@ndtrung81
Copy link
Contributor

@akohlmey FYI, I revisited the issue with the given input script and found that with the current HEAD (commit b44e353) the run completed successfully with "neigh yes" with 2 and 4 MPI procs. This is with the CUDA build via CMake, but the OpenCL build fail with multiple MPI procs.

I only turned off the error out line in fix_gpu.cpp as below for this test:

diff --git a/src/GPU/fix_gpu.cpp b/src/GPU/fix_gpu.cpp
index 0de2ab51f8..836bdc0f75 100644
--- a/src/GPU/fix_gpu.cpp
+++ b/src/GPU/fix_gpu.cpp
@@ -287,8 +287,7 @@ void FixGPU::init()
     for (int i = 0; i < hybrid->nstyles; i++)
       if (!utils::strmatch(hybrid->keywords[i],"/gpu$"))
         force->pair->no_virial_fdotr_compute = 1;
-    if (_gpu_mode != GPU_FORCE)
-      error->all(FLERR, "Must not use GPU neighbor lists with hybrid pair style");
+//    if (_gpu_mode != GPU_FORCE) error->all(FLERR, "Must not use GPU neighbor lists with hybrid pair style");
   }
 
   // rRESPA support

Could you reproduce the crash with the OpenCL build with your branch?

@akohlmey
Copy link
Member Author

When I run this input with the suggested change with CUDA and 4 MPI ranks on 1 GPU, it crashes in the second run with

Setting up Verlet run ...
  Unit style    : metal
  Current step  : 500
  Time step     : 0.005
Cuda driver error 700 in call at file '/home/akohlmey/compile/lammps/lib/gpu/geryon/nvd_timer.h' in line 76.
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1

Similar for OpenCL:

Setting up Verlet run ...
  Unit style    : metal
  Current step  : 500
  Time step     : 0.005
Per MPI rank memory allocation (min/avg/max) = 2.754 | 2.755 | 2.757 Mbytes
   Step          Temp          Press          PotEng         KinEng        Density    
       500   62.575673      66535.548      999.64831      21.483164      1.4100918    
OpenCL error in file '/home/akohlmey/compile/lammps/lib/gpu/geryon/ocl_timer.h' in line 118 : -9999.
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3

I see similar behavior with the PR #3485 branch with some inputs also without hybrid. E.g. with an input file:

include in.rhodo
clear
include in.rhodo
clear
include in.rhodo
clear
include in.rhodo

in the bench folder.

--------------------------------------------------------------------------
- Using acceleration for pppm:
-  with 4 proc(s) per device.
-  with OpenCL Parameters for: NVIDIA_GPU (203)
-  Horizontal vector operations: ENABLED
-  Shared memory system: No
--------------------------------------------------------------------------
Device 0: NVIDIA GeForce GTX 1060 6GB, 10 CUs, 5.9 GB, 1.7 GHZ (Mixed Precision)
--------------------------------------------------------------------------

Initializing Device and compiling on process 0...Done.
Initializing Device 0 on core 0...Done.
Initializing Device 0 on core 1...Done.
Initializing Device 0 on core 2...Done.
Initializing Device 0 on core 3...Done.


--------------------------------------------------------------------------
- Using acceleration for lj/charmm/coul/long:
-  with 4 proc(s) per device.
-  with OpenCL Parameters for: NVIDIA_GPU (203)
-  Horizontal vector operations: ENABLED
-  Shared memory system: No
--------------------------------------------------------------------------
Device 0: NVIDIA GeForce GTX 1060 6GB, 10 CUs, 5.9 GB, 1.7 GHZ (Mixed Precision)
--------------------------------------------------------------------------

Initializing Device and compiling on process 0...Done.
Initializing Device 0 on core 0...Done.
Initializing Device 0 on core 1...Done.
Initializing Device 0 on core 2...Done.
Initializing Device 0 on core 3...Done.

Generated 2278 of 2278 mixed pair_coeff terms from arithmetic mixing rule
Setting up Verlet run ...
  Unit style    : real
  Current step  : 0
  Time step     : 2
OpenCL error in file '/home/akohlmey/compile/lammps/lib/gpu/geryon/ocl_kernel.h' in line 261 : -38.
OpenCL error in file '/home/akohlmey/compile/lammps/lib/gpu/geryon/ocl_kernel.h' in line 261 : -38.
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
OpenCL error in file '/home/akohlmey/compile/lammps/lib/gpu/geryon/ocl_kernel.h' in line 261 : -38.
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
OpenCL error in file '/home/akohlmey/compile/lammps/lib/gpu/geryon/ocl_kernel.h' in line 261 : -38.
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3

It may be related to having more than one style used with GPU acceleration. I can "fix" the issue with multiple rhodo bench runs
by adding -pk gpu 0 pair/only yes.

@ndtrung81
Copy link
Contributor

@akohlmey For some reason I still cannot reproduce the crash with CUDA with 4 MPI ranks on my end (CUDA Toolkit 11.6, Fedora 35, NVIDIA P100 GPU), with the recent commit #05aca2b in the develop branch. I can reproduce the crash with OpenCL. FYI, the runtime errors you got are at geryon/nvd_timer.h:76 and geryon/ocl_timer:118 are both at the UCL_Timer::sync_stop() function. The difference is that the CUDA backend calls cuEventSynchronize() with start_event, whereas the OpenCl calls clWaitForEvents() with stop_event.

A little bit update regarding debugging progress: I generated a restart file from the CPU run with the input script given here, so to have a well-behaved configuration. Then I read the restart file in another input script run_from_restart.eam.txt This script runs a single time step and writes out a dump file with atom forces. On my end, both OpenCL and CUDA on 4 MPI ranks produce consistent thermo output and atom forces with "neigh yes", "neigh no" vs. the 4-MPI CPU run. It looks to me that the neighbor list build and force kernels work properly with neigh yes for pair hybrid. The crash with OpenCL on my end might come from the event sync with UCL_Timer in ocl_timer.h, but I am not sure how to debug it further after this revisit.

The issue you're seeing with repeated runs and clear even without pair hybrid suggest other root causes, not relevant to the present issue. The line reported ocl_kernel.h:261 is where clSetKernelArg is called add_arg(). Error -36 indicates invalid command queue (cl.h:164). So it looks like the command queue where the kernel is enqueued is invalid at the point.

Any comments and suggestions will be highly appreciated.

@akohlmey
Copy link
Member Author

akohlmey commented Nov 2, 2022

@ndtrung81 I've made some more experiments. There is something fishy going on when there are multiple kernels active and that is exposed most easily when you have multiple run commands. It could very well be in the GPU neighbor list code. So I can "fix" the multiple rhodo runs case by adding -pk gpu 0 pair/only yes and thus avoid the pppm kernel. This would be consistent with the fact that the hybrid pair style case can be "fixed" by using neighbor lists computed on the CPU.

I don't really want to enable the GPU neighbor list for hybrid unless that is resolved. Unfortunately, this is beyond my skills and experience with GPU programming (I never spent much time on it anyway).

@ndtrung81
Copy link
Contributor

@akohlmey I think I fixed a bug in lal_pppm.cpp where the kernels need to be recompiled after clear. I can run the input file you created with in.rhodo successfully on my end with the patch below (lal_pppm.diff.txt attached):

diff --git a/lib/gpu/lal_pppm.cpp b/lib/gpu/lal_pppm.cpp
index 39249ea350..f957b36a7d 100644
--- a/lib/gpu/lal_pppm.cpp
+++ b/lib/gpu/lal_pppm.cpp
@@ -79,6 +79,8 @@ grdtyp *PPPMT::init(const int nlocal, const int nall, FILE *_screen,
     return nullptr;
   }
 
+  if (ucl_device!=device->gpu) _compiled=false;
+
   ucl_device=device->gpu;
   atom=&device->atom;

Can you double check to see if it is the case on your tests? The original issue with neigh yes with pair hybrid is still unresolved though..

@akohlmey
Copy link
Member Author

Can you double check to see if it is the case on your tests?

@ndtrung81 yes, this fixes the crash for running the rhodo test case multiple times with clear in between on CUDA. OpenCL seems to be unaffected.

@ndtrung81
Copy link
Contributor

@akohlmey I have tried the below changes with the original input script in this bug report, and it seems to resolve the crash with neigh yes on my end with NVIDIA CUDA and OpenCL. At your convenience, could you please try the changes on your end? If this is the right fix for this bug, I will rewrite some involving parts, and submit a separate PR.

Note to self: when multiple GPU neighbor builds are requested, the Neighbor instances actually share the same Atom object, and using async = true here leads to corrupted particle id on the device. Need to switch to async = false only for this particular setting, so not to affect the more popular use cases (only one GPU neighbor build request).

diff --git a/lib/gpu/lal_neighbor.cpp b/lib/gpu/lal_neighbor.cpp
index 10816e2fa6..2d3c1c2b8a 100644
--- a/lib/gpu/lal_neighbor.cpp
+++ b/lib/gpu/lal_neighbor.cpp
@@ -731,7 +731,7 @@ void Neighbor::build_nbor_list(double **x, const int inum, const int host_inum,
       particle_id[ploc]=i;
     }
     time_hybrid2.start();
-    ucl_copy(atom.dev_particle_id,atom.host_particle_id,nall,true);
+    ucl_copy(atom.dev_particle_id,atom.host_particle_id,nall,false);
     time_hybrid2.stop();
     _bin_time+=MPI_Wtime()-stime;
   }
diff --git a/src/GPU/fix_gpu.cpp b/src/GPU/fix_gpu.cpp
index 92f4f256b2..f68543d85e 100644
--- a/src/GPU/fix_gpu.cpp
+++ b/src/GPU/fix_gpu.cpp
@@ -276,8 +276,10 @@ void FixGPU::init()
     for (int i = 0; i < hybrid->nstyles; i++)
       if (!utils::strmatch(hybrid->keywords[i],"/gpu$"))
         force->pair->no_virial_fdotr_compute = 1;
+/*
     if (_gpu_mode != GPU_FORCE)
       error->all(FLERR, "Must not use GPU neighbor lists with hybrid pair style");
+*/
   }
 
   // rRESPA support

@akohlmey
Copy link
Member Author

@ndtrung81 I see some small improvements, but still have crashes with CUDA and GPU neighbor lists.

There also seem to be significant differences with forces. Even with double precision there is a significant difference somewhere between GPU and CPU forces, since energies are identical on step 0 but different from step 1 onward.

The run with the neighbor lists on the GPU fails during the second run command with a lost atoms error.
Depending on compilation settings I can also get memory error inside the kernel resulting in Xid errors in the Linux kernel. OpenCL seems to be more forgiving, but the issue persists.

@akohlmey akohlmey removed this from the Stable Release Summer 2023 milestone Jul 17, 2023
@akohlmey
Copy link
Member Author

Further improvements have been difficult to achieve. @ndtrung81 and @akohlmey are happy with what has been fixed so far so we decided to close this issue and wait for a new bug report.

@akohlmey akohlmey closed this as not planned Won't fix, can't repro, duplicate, stale Feb 14, 2024
@pw0908
Copy link

pw0908 commented Mar 29, 2024

Hi all,

I am currently trying to use the GPU package with a hybrid/overlay potential and am encountering a similar issue (ERROR: Must not use GPU neighbor lists with hybrid pair style). I am using an overlay of lj/cut, coul/long and a custom potential (which I've implemented in the GPU package myself).

Given I am using a custom potential, what would be the best way for me to provide you with a bug report? Shall I just fork lammps and make my changes there?

Thank you very much for your help!

@akohlmey
Copy link
Member Author

@pw0908 there is nothing more to report. We have tried very hard to sort this out, but did not succeed. Hence this issue is closed and won't be reopened.

All you have to do is compute the neighbor lists on the CPU via the package command or -pk command line flag.

@akohlmey
Copy link
Member Author

... or instead of using pair style hybrid/overlay, you change your custom pair style to include lj/cut and coul/long in addition to your custom style. That would be faster than hybrid/overlay, even if you could use GPU neighbor lists.

@pw0908
Copy link

pw0908 commented Mar 29, 2024

I see. I'll give that a shot. Thanks!

Just out of curiosity, I was thinking of opening a PR to merge this implementation into LAMMPS. Would there be much point in also providing the implementation of lj/cut/coul/long/custom? This potential is just an approximate way of modeling ion-dipole interaction in coarse-grained systems; you would normally use it in conjunction with lj/cut and coul/long anyways.

@akohlmey
Copy link
Member Author

Would there be much point in also providing the implementation of lj/cut/coul/long/custom?

Probably more than for the version that requires using hybrid/overlay

@ndtrung81
Copy link
Contributor

@pw0908 you can derive your custom pair style classes from those for lj/cut/coul/long in lib/gpu (LJCoulLong) and src/GPU (PairLJCutCoulLongGPU). It will be faster than hybrid/overlay with neighbor builds on the host. Your PR will be very much welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

4 participants