[BUG] Using pair style hybrid with GPU package and neighbor lists on GPU #2621

akohlmey · 2021-02-23T13:55:51Z

Summary

There appears to be a problem using neighbor lists on the GPU with pair style hybrid inputs. The following input does not work reliably with just "-sf gpu" added, but using "-pk gpu 0 neigh no" is avoiding the crashes or corrupted data.
This is supposed to work since PR #1430. Below is an abbreviated version of an input posted on lammps-users.

LAMMPS Version and Platform

Current LAMMPS development head revision ce4dc4e

Expected Behavior

Input will run with "-sf gpu" using 1 or multiple MPI processes without having to disable neighbor lists on the GPU.

Actual Behavior

LAMMPS reports "nan" or incorrect results or crashes with different errors or warnings. (CUDA has device failures, OpenCL asks to boost neigh_one when run on 1 CPU but ran work with setting binsize, but fails with more MPI ranks).

Steps to Reproduce

Use the following input:

units        metal
dimension    3
boundary    p p p
atom_style    atomic
variable Ar_mass    equal 39.95                        # [aem]
variable Au_mass    equal 196.97                        # [aem]
variable Ar_eps        equal 0.0103                        # epsilon Ar-Ar, [eV]
variable Ar_sigma    equal 3.4                        # sigma Ar-Ar, [A]
variable Ar_crit    equal 3.5*${Ar_sigma}                    # cut-off distance, [A]
variable rho        equal 1.41                        # density [g/cm^3]

# simulated box
region     simbox     block     0 50     0 50     0 50     side in units box
create_box    2     simbox

########################
# Version #1  w/o hybrid
########################
#pair_style    lj/cut ${Ar_crit}
#pair_coeff    * * ${Ar_eps} ${Ar_sigma}  ${Ar_crit}

############################################
# Version #2 fails with neighbor list on GPU
############################################
pair_style     hybrid lj/cut ${Ar_crit} eam
pair_coeff    1 1 eam Au_u3.eam
pair_coeff    1 2 lj/cut ${Ar_eps} ${Ar_sigma}      # Au-Ar
pair_coeff    2 2 lj/cut ${Ar_eps} ${Ar_sigma}        # Ar-Ar

mass    1    ${Au_mass}
mass    2    ${Ar_mass}

variable Coef        equal 66.34     # to gr/cm^3
variable N_L1        equal round(${rho}*(50*50*50)/${Coef})
create_atoms 2 random ${N_L1} 123456 simbox

thermo    100
thermo_style    custom    step    temp    press    pe ke density

timestep    0.001
fix        Lang    all    langevin 85 85 5 3333
fix         NVELIM     all     nve/limit 0.002
run_style    verlet
run         500
unfix        NVELIM

timestep    0.005
fix        NVE_all all    nve
run_style    verlet
run        500

The text was updated successfully, but these errors were encountered:

ndtrung81 · 2021-02-24T04:48:26Z

@akohlmey I found a flaw in the current implementation (my bad) that misses the skipping pair types for the sub-styles in pair hybrid. Maybe that is the reason why pair hybrid* was not supported with neighbor list builds on the GPU, which is similar to why "neigh_modify exclude" is not being supported (yet) for GPU neighbor builds. The current neighbor list build does not have a way to explicitly skip/exclude pair types. Maybe we need a separate code path or a separate neighbor build kernel for pair hybrid and "neigh_modify exclude", so not to interfere with the performance of the other existing cases. I will take a closer look at this in the coming weeks. @wmbrownIntel do you have a suggestion?

wmbrownIntel · 2021-02-24T06:44:54Z

All of my tests with hybrid use 'neigh no' and I honestly don't remember how we have supported this. My recommendation is to test with hybrid overlay to see if it works correctly with GPU builds. If not, check for combination of GPU build and hybrid in FixGPU::init for error. If so, check for combination of GPU build and neighbor->requests[i]->skip for error. Sorry I can't give this more attention at the moment; I have to be focused on another project this week...

ndtrung81 · 2021-02-25T21:57:43Z

@wmbrownIntel thanks for your suggestion. I have refactored the code to support GPU neighbor builds for pair hybrid in PR #1430.

The issue is tracked down to the current implementation of eam/gpu where rho and fp are now computed for all the pair types, and should be computed for only the unskipped types as done in the CPU version. I will work on the bugfix for this in a separate PR.

Other pairwise styles are supposed to work correctly (as cutsq[itype][jtype] is set to zero for skipped pair types with setflag[itype][jtype] == 0). For 3-body styles, * * are enforced, so they should work correctly, too.

akohlmey · 2021-08-29T21:03:04Z

test input re-checked with comment fff89135fb and it passes.

akohlmey · 2021-08-29T21:03:59Z

correction. it passes with 1 MPI rank, but fails with 2 MPI ranks.

ndtrung81 · 2022-10-19T18:56:54Z

@akohlmey FYI, I revisited the issue with the given input script and found that with the current HEAD (commit b44e353) the run completed successfully with "neigh yes" with 2 and 4 MPI procs. This is with the CUDA build via CMake, but the OpenCL build fail with multiple MPI procs.

I only turned off the error out line in fix_gpu.cpp as below for this test:

diff --git a/src/GPU/fix_gpu.cpp b/src/GPU/fix_gpu.cpp
index 0de2ab51f8..836bdc0f75 100644
--- a/src/GPU/fix_gpu.cpp
+++ b/src/GPU/fix_gpu.cpp
@@ -287,8 +287,7 @@ void FixGPU::init()
     for (int i = 0; i < hybrid->nstyles; i++)
       if (!utils::strmatch(hybrid->keywords[i],"/gpu$"))
         force->pair->no_virial_fdotr_compute = 1;
-    if (_gpu_mode != GPU_FORCE)
-      error->all(FLERR, "Must not use GPU neighbor lists with hybrid pair style");
+//    if (_gpu_mode != GPU_FORCE) error->all(FLERR, "Must not use GPU neighbor lists with hybrid pair style");
   }
 
   // rRESPA support

Could you reproduce the crash with the OpenCL build with your branch?

akohlmey · 2022-10-19T19:12:25Z

When I run this input with the suggested change with CUDA and 4 MPI ranks on 1 GPU, it crashes in the second run with

Setting up Verlet run ...
  Unit style    : metal
  Current step  : 500
  Time step     : 0.005
Cuda driver error 700 in call at file '/home/akohlmey/compile/lammps/lib/gpu/geryon/nvd_timer.h' in line 76.
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1

Similar for OpenCL:

Setting up Verlet run ...
  Unit style    : metal
  Current step  : 500
  Time step     : 0.005
Per MPI rank memory allocation (min/avg/max) = 2.754 | 2.755 | 2.757 Mbytes
   Step          Temp          Press          PotEng         KinEng        Density    
       500   62.575673      66535.548      999.64831      21.483164      1.4100918    
OpenCL error in file '/home/akohlmey/compile/lammps/lib/gpu/geryon/ocl_timer.h' in line 118 : -9999.
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3

I see similar behavior with the PR #3485 branch with some inputs also without hybrid. E.g. with an input file:

include in.rhodo
clear
include in.rhodo
clear
include in.rhodo
clear
include in.rhodo

in the bench folder.

--------------------------------------------------------------------------
- Using acceleration for pppm:
-  with 4 proc(s) per device.
-  with OpenCL Parameters for: NVIDIA_GPU (203)
-  Horizontal vector operations: ENABLED
-  Shared memory system: No
--------------------------------------------------------------------------
Device 0: NVIDIA GeForce GTX 1060 6GB, 10 CUs, 5.9 GB, 1.7 GHZ (Mixed Precision)
--------------------------------------------------------------------------

Initializing Device and compiling on process 0...Done.
Initializing Device 0 on core 0...Done.
Initializing Device 0 on core 1...Done.
Initializing Device 0 on core 2...Done.
Initializing Device 0 on core 3...Done.


--------------------------------------------------------------------------
- Using acceleration for lj/charmm/coul/long:
-  with 4 proc(s) per device.
-  with OpenCL Parameters for: NVIDIA_GPU (203)
-  Horizontal vector operations: ENABLED
-  Shared memory system: No
--------------------------------------------------------------------------
Device 0: NVIDIA GeForce GTX 1060 6GB, 10 CUs, 5.9 GB, 1.7 GHZ (Mixed Precision)
--------------------------------------------------------------------------

Initializing Device and compiling on process 0...Done.
Initializing Device 0 on core 0...Done.
Initializing Device 0 on core 1...Done.
Initializing Device 0 on core 2...Done.
Initializing Device 0 on core 3...Done.

Generated 2278 of 2278 mixed pair_coeff terms from arithmetic mixing rule
Setting up Verlet run ...
  Unit style    : real
  Current step  : 0
  Time step     : 2
OpenCL error in file '/home/akohlmey/compile/lammps/lib/gpu/geryon/ocl_kernel.h' in line 261 : -38.
OpenCL error in file '/home/akohlmey/compile/lammps/lib/gpu/geryon/ocl_kernel.h' in line 261 : -38.
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
OpenCL error in file '/home/akohlmey/compile/lammps/lib/gpu/geryon/ocl_kernel.h' in line 261 : -38.
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
OpenCL error in file '/home/akohlmey/compile/lammps/lib/gpu/geryon/ocl_kernel.h' in line 261 : -38.
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3

It may be related to having more than one style used with GPU acceleration. I can "fix" the issue with multiple rhodo bench runs
by adding -pk gpu 0 pair/only yes.

ndtrung81 · 2022-11-01T15:28:57Z

@akohlmey For some reason I still cannot reproduce the crash with CUDA with 4 MPI ranks on my end (CUDA Toolkit 11.6, Fedora 35, NVIDIA P100 GPU), with the recent commit #05aca2b in the develop branch. I can reproduce the crash with OpenCL. FYI, the runtime errors you got are at geryon/nvd_timer.h:76 and geryon/ocl_timer:118 are both at the UCL_Timer::sync_stop() function. The difference is that the CUDA backend calls cuEventSynchronize() with start_event, whereas the OpenCl calls clWaitForEvents() with stop_event.

A little bit update regarding debugging progress: I generated a restart file from the CPU run with the input script given here, so to have a well-behaved configuration. Then I read the restart file in another input script run_from_restart.eam.txt This script runs a single time step and writes out a dump file with atom forces. On my end, both OpenCL and CUDA on 4 MPI ranks produce consistent thermo output and atom forces with "neigh yes", "neigh no" vs. the 4-MPI CPU run. It looks to me that the neighbor list build and force kernels work properly with neigh yes for pair hybrid. The crash with OpenCL on my end might come from the event sync with UCL_Timer in ocl_timer.h, but I am not sure how to debug it further after this revisit.

The issue you're seeing with repeated runs and clear even without pair hybrid suggest other root causes, not relevant to the present issue. The line reported ocl_kernel.h:261 is where clSetKernelArg is called add_arg(). Error -36 indicates invalid command queue (cl.h:164). So it looks like the command queue where the kernel is enqueued is invalid at the point.

Any comments and suggestions will be highly appreciated.

akohlmey · 2022-11-02T18:13:42Z

@ndtrung81 I've made some more experiments. There is something fishy going on when there are multiple kernels active and that is exposed most easily when you have multiple run commands. It could very well be in the GPU neighbor list code. So I can "fix" the multiple rhodo runs case by adding -pk gpu 0 pair/only yes and thus avoid the pppm kernel. This would be consistent with the fact that the hybrid pair style case can be "fixed" by using neighbor lists computed on the CPU.

I don't really want to enable the GPU neighbor list for hybrid unless that is resolved. Unfortunately, this is beyond my skills and experience with GPU programming (I never spent much time on it anyway).

ndtrung81 · 2023-04-12T15:55:59Z

@akohlmey I think I fixed a bug in lal_pppm.cpp where the kernels need to be recompiled after clear. I can run the input file you created with in.rhodo successfully on my end with the patch below (lal_pppm.diff.txt attached):

diff --git a/lib/gpu/lal_pppm.cpp b/lib/gpu/lal_pppm.cpp
index 39249ea350..f957b36a7d 100644
--- a/lib/gpu/lal_pppm.cpp
+++ b/lib/gpu/lal_pppm.cpp
@@ -79,6 +79,8 @@ grdtyp *PPPMT::init(const int nlocal, const int nall, FILE *_screen,
     return nullptr;
   }
 
+  if (ucl_device!=device->gpu) _compiled=false;
+
   ucl_device=device->gpu;
   atom=&device->atom;

Can you double check to see if it is the case on your tests? The original issue with neigh yes with pair hybrid is still unresolved though..

akohlmey · 2023-04-12T17:20:19Z

Can you double check to see if it is the case on your tests?

@ndtrung81 yes, this fixes the crash for running the rhodo test case multiple times with clear in between on CUDA. OpenCL seems to be unaffected.

ndtrung81 · 2023-04-17T15:53:14Z

@akohlmey I have tried the below changes with the original input script in this bug report, and it seems to resolve the crash with neigh yes on my end with NVIDIA CUDA and OpenCL. At your convenience, could you please try the changes on your end? If this is the right fix for this bug, I will rewrite some involving parts, and submit a separate PR.

Note to self: when multiple GPU neighbor builds are requested, the Neighbor instances actually share the same Atom object, and using async = true here leads to corrupted particle id on the device. Need to switch to async = false only for this particular setting, so not to affect the more popular use cases (only one GPU neighbor build request).

diff --git a/lib/gpu/lal_neighbor.cpp b/lib/gpu/lal_neighbor.cpp
index 10816e2fa6..2d3c1c2b8a 100644
--- a/lib/gpu/lal_neighbor.cpp
+++ b/lib/gpu/lal_neighbor.cpp
@@ -731,7 +731,7 @@ void Neighbor::build_nbor_list(double **x, const int inum, const int host_inum,
       particle_id[ploc]=i;
     }
     time_hybrid2.start();
-    ucl_copy(atom.dev_particle_id,atom.host_particle_id,nall,true);
+    ucl_copy(atom.dev_particle_id,atom.host_particle_id,nall,false);
     time_hybrid2.stop();
     _bin_time+=MPI_Wtime()-stime;
   }
diff --git a/src/GPU/fix_gpu.cpp b/src/GPU/fix_gpu.cpp
index 92f4f256b2..f68543d85e 100644
--- a/src/GPU/fix_gpu.cpp
+++ b/src/GPU/fix_gpu.cpp
@@ -276,8 +276,10 @@ void FixGPU::init()
     for (int i = 0; i < hybrid->nstyles; i++)
       if (!utils::strmatch(hybrid->keywords[i],"/gpu$"))
         force->pair->no_virial_fdotr_compute = 1;
+/*
     if (_gpu_mode != GPU_FORCE)
       error->all(FLERR, "Must not use GPU neighbor lists with hybrid pair style");
+*/
   }
 
   // rRESPA support

akohlmey · 2023-04-17T17:26:27Z

@ndtrung81 I see some small improvements, but still have crashes with CUDA and GPU neighbor lists.

There also seem to be significant differences with forces. Even with double precision there is a significant difference somewhere between GPU and CPU forces, since energies are identical on step 0 but different from step 1 onward.

The run with the neighbor lists on the GPU fails during the second run command with a lost atoms error.
Depending on compilation settings I can also get memory error inside the kernel resulting in Xid errors in the Linux kernel. OpenCL seems to be more forgiving, but the issue persists.

akohlmey · 2024-02-14T02:11:50Z

Further improvements have been difficult to achieve. @ndtrung81 and @akohlmey are happy with what has been fixed so far so we decided to close this issue and wait for a new bug report.

pw0908 · 2024-03-29T03:46:28Z

Hi all,

I am currently trying to use the GPU package with a hybrid/overlay potential and am encountering a similar issue (ERROR: Must not use GPU neighbor lists with hybrid pair style). I am using an overlay of lj/cut, coul/long and a custom potential (which I've implemented in the GPU package myself).

Given I am using a custom potential, what would be the best way for me to provide you with a bug report? Shall I just fork lammps and make my changes there?

Thank you very much for your help!

akohlmey · 2024-03-29T03:56:36Z

@pw0908 there is nothing more to report. We have tried very hard to sort this out, but did not succeed. Hence this issue is closed and won't be reopened.

All you have to do is compute the neighbor lists on the CPU via the package command or -pk command line flag.

akohlmey · 2024-03-29T04:02:20Z

... or instead of using pair style hybrid/overlay, you change your custom pair style to include lj/cut and coul/long in addition to your custom style. That would be faster than hybrid/overlay, even if you could use GPU neighbor lists.

pw0908 · 2024-03-29T04:08:11Z

I see. I'll give that a shot. Thanks!

Just out of curiosity, I was thinking of opening a PR to merge this implementation into LAMMPS. Would there be much point in also providing the implementation of lj/cut/coul/long/custom? This potential is just an approximate way of modeling ion-dipole interaction in coarse-grained systems; you would normally use it in conjunction with lj/cut and coul/long anyways.

akohlmey · 2024-03-29T06:34:09Z

Would there be much point in also providing the implementation of lj/cut/coul/long/custom?

Probably more than for the version that requires using hybrid/overlay

ndtrung81 · 2024-03-29T15:40:56Z

@pw0908 you can derive your custom pair style classes from those for lj/cut/coul/long in lib/gpu (LJCoulLong) and src/GPU (PairLJCutCoulLongGPU). It will be faster than hybrid/overlay with neighbor builds on the host. Your PR will be very much welcome.

akohlmey added bug gpu_package labels Feb 23, 2021

akohlmey added this to the Stable Release Spring 2021 milestone Feb 23, 2021

akohlmey assigned ndtrung81 and wmbrownIntel Feb 23, 2021

ndtrung81 mentioned this issue Mar 3, 2021

Fixing bugs on eam*/gpu for pair hybrid with neigh yes, where the gpu… #2640

Merged

11 tasks

akohlmey mentioned this issue May 2, 2021

[BUG] Tracking GPU package issues #2747

Closed

akohlmey closed this as completed Aug 29, 2021

akohlmey reopened this Aug 29, 2021

akohlmey mentioned this issue Aug 29, 2021

Fix minor GPU package issues for the stable release #2911

Merged

6 tasks

akohlmey modified the milestones: Stable Release Summer 2021, Stable Release Spring 2022 Sep 6, 2021

mppotter mentioned this issue Feb 1, 2022

[BUG] hybrid sw lj/cut produces nan thermo data when using GPU package #3109

Closed

akohlmey modified the milestones: Stable Release Summer 2022, Stable Release Spring 2023 May 19, 2022

akohlmey mentioned this issue Apr 14, 2023

Collected small changes and fixes #3735

Merged

8 tasks

akohlmey removed this from the Stable Release Summer 2023 milestone Jul 17, 2023

akohlmey closed this as not planned Won't fix, can't repro, duplicate, stale Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Using pair style hybrid with GPU package and neighbor lists on GPU #2621

[BUG] Using pair style hybrid with GPU package and neighbor lists on GPU #2621

akohlmey commented Feb 23, 2021

ndtrung81 commented Feb 24, 2021

wmbrownIntel commented Feb 24, 2021

ndtrung81 commented Feb 25, 2021

akohlmey commented Aug 29, 2021

akohlmey commented Aug 29, 2021

ndtrung81 commented Oct 19, 2022

akohlmey commented Oct 19, 2022

ndtrung81 commented Nov 1, 2022

akohlmey commented Nov 2, 2022

ndtrung81 commented Apr 12, 2023

akohlmey commented Apr 12, 2023

ndtrung81 commented Apr 17, 2023

akohlmey commented Apr 17, 2023

akohlmey commented Feb 14, 2024

pw0908 commented Mar 29, 2024

akohlmey commented Mar 29, 2024

akohlmey commented Mar 29, 2024

pw0908 commented Mar 29, 2024 •

edited

akohlmey commented Mar 29, 2024

ndtrung81 commented Mar 29, 2024

[BUG] Using pair style hybrid with GPU package and neighbor lists on GPU #2621

[BUG] Using pair style hybrid with GPU package and neighbor lists on GPU #2621

Comments

akohlmey commented Feb 23, 2021

ndtrung81 commented Feb 24, 2021

wmbrownIntel commented Feb 24, 2021

ndtrung81 commented Feb 25, 2021

akohlmey commented Aug 29, 2021

akohlmey commented Aug 29, 2021

ndtrung81 commented Oct 19, 2022

akohlmey commented Oct 19, 2022

ndtrung81 commented Nov 1, 2022

akohlmey commented Nov 2, 2022

ndtrung81 commented Apr 12, 2023

akohlmey commented Apr 12, 2023

ndtrung81 commented Apr 17, 2023

akohlmey commented Apr 17, 2023

akohlmey commented Feb 14, 2024

pw0908 commented Mar 29, 2024

akohlmey commented Mar 29, 2024

akohlmey commented Mar 29, 2024

pw0908 commented Mar 29, 2024 • edited

akohlmey commented Mar 29, 2024

ndtrung81 commented Mar 29, 2024

pw0908 commented Mar 29, 2024 •

edited