Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] _Segmentation_Fault_from_npair_half_bin_newton.cpp_ #3835

Closed
aramdbs opened this issue Jun 29, 2023 · 12 comments · Fixed by #3830
Closed

[BUG] _Segmentation_Fault_from_npair_half_bin_newton.cpp_ #3835

aramdbs opened this issue Jun 29, 2023 · 12 comments · Fixed by #3830

Comments

@aramdbs
Copy link

aramdbs commented Jun 29, 2023

Summary

When using a custom (fix) module that requests half pair list and sets a cutoff for the list using set_cutoff() function a segmentation fault occurs either every time or occasionally depending on the system. The segmentation fault is thrown on line 128 (delx = xtmp - x[j][0];), due to ill defined value of j that originates from assignment j = binhead[ibin+stencil[k]] in for command for (j = binhead[ibin+stencil[k]]; j >= 0; j = bins[j]) on line 124. This is most likely due to memory corruption as far as I can tell.

I have reversed any changes and stripped my module of any code except for pair list request and call of set_cutoff() to make sure that the memory corruption does not occure from within the module.

It is also important to note that the latest stable version does not seem to have the issue.

LAMMPS Version and Platform

LAMMPS version: LAMMPS (28 Mar 2023 - Update 1) (the release version that was cloned from github at the beginning of June)
The error always occurs on linux system (#40~20.04.1-Ubuntu SMP Mon Mar 7 09:18:32 UTC 2022) with 11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz
It also occasionally manifests on MacOS Ventura with M2 processor (approximately once in 20 times)

Steps to Reproduce

To provide more information and reproduce the error I prepared a directory with most basic simulation setup (subfolder simulation_setup), additional code that requests the pair list and sets cutoff (using one of the values the gives error on my side, though it may occurs at smaller values as well) placed in subdirectory src, and log files from LAMMPS itself and Valgrind debug run (in error_logs subdirectory, also please note that line numbers after 128 may be shifted due to print that I added on my side to investigate the issue). The latter clearly reports the issues in npair_half_bin_newton.cpp that originate on line 124.

Further Information, Files, and Links

The archive of the directory described above is attached.
bug_report_npair_half_bin_newton.zip

@aramdbs aramdbs added the bug label Jun 29, 2023
@akohlmey akohlmey added this to the Stable Release Summer 2023 milestone Jun 29, 2023
@akohlmey
Copy link
Member

This is not really a bug in LAMMPS, but an issue with your fix code: when requesting a custom neighbor list cutoff, you also have to make certain that you can actually reach atoms that are that far away and thus you have to check whether your custom cutoff plus neighbor list skin distance is smaller or equal to the communication cutoff. This is by default set to the neighbor list skin plus the largest pair style cutoff.

In your input, you have a neighbor list skin of 13 angstrom plus are requesting a custom cutoff for your fix neighbor list of 21 angstrom, thus the communication cutoff must be at least 34 angstrom. This is not the case. You have a pair style cutoff of 4.5 angstrom, thus your communication cutoff is set to 17.5. Adding the line comm_modify cutoff 34 eliminates the memory access issue and thus the segfaults.

That said, a neighbor list skin of 13 angstrom is extremely wasteful. It adds many unwanted pairs to the neighbor list and thus can result in a significant slowdown. For standard molecular systems, it should be best to stick with the default neighbor list settings (2 angstrom skin, delay 0, every 1, check yes). Depending on the dynamics of the system and the fastest moving atoms as well as the choice of timestep, you may improve performance by using an even smaller skin (1.5 or 1.0) and use delay 1 or delay 2 (after delay 2 the performance gains from increasing the delay are quickly diminishing versus the risk of missing neighbors).
So for this case adding comm_modify cutoff 23 would be sufficient.

You can look at fix srp for some prototype code to check whether the communication cutoff is sufficiently large. Look at the setup_pre_force() function around line 248.

Please also not that LAMMPS prints an important warning with the large neighbor list skin for your system that must not be ignored.

@akohlmey akohlmey self-assigned this Jun 29, 2023
@aramdbs
Copy link
Author

aramdbs commented Jun 29, 2023 via email

@aramdbs
Copy link
Author

aramdbs commented Jun 29, 2023

Plus, wouldn't this raise an error or an exception rather than resulting in seg fault (or promote me to do something else), given that all I do is to change the largest pair list cutoff using provided standard functions intended for that?

@akohlmey
Copy link
Member

However, in my understanding, when requesting a pairlist with larger cutoff within the code as I did, that should result in extension of communication cutoff either automatically or by means of additional request within the code.

No. There is no precedence for this and you are mistaken. The only case where the communication cutoff is automatically extended is when a pair style cutoff requires it. In all other cases, it is the job of the individual code to ensure the communication cutoff requirements are fulfilled. An example for a case that supports requesting a neighbor list with a custom cutoff is compute rdf. That checks for whether a custom communication cutoff is set or uses the default communication cutoff and then requires that the custom cutoff for the g(r) computation is sufficiently small or stops with an error suggesting to change the communication cutoff as needed.

You can extend the communication cutoff only outside of a run. Similarly, you cannot change a neighbor list request during a run. If you want to rather extend the communication cutoff on demand, it also needs to be done by your code during the init() processing. I have not seen this done anywhere so I cannot guarantee that this will work.

@aramdbs
Copy link
Author

aramdbs commented Jun 29, 2023

no no, just to clarify I have no intension to change the cutoff during the run, just at the very beginning when simulation input parameters are given.

Also, just to clarify that I understood correctly, there is no way to change the communication cutoff from within the code, only from the input (*.in) file?

@akohlmey
Copy link
Member

Plus, wouldn't this raise an error or an exception rather than resulting in seg fault (or promote me to do something else), given that all I do is to change the largest pair list cutoff using provided standard functions intended for that?

No. The general philosophy in LAMMPS is that programmers are not supposed to make unreasonable choices when writing their code. Requesting a neighbor list with a cutoff larger than the communication cutoff is an unreasonable choice. It is near impossible and would be a huge effort to protect all function arguments from bad choices. There are plenty of cases where what is a reasonable or unreasonable choice depends on the context and sometimes the base code has no knowledge of that.

@akohlmey
Copy link
Member

Also, just to clarify that I understood correctly, there is no way to change the communication cutoff from within the code, only from the input (*.in) file?

You are misunderstanding again!

When you look through the Input class, you can see the the command comm_modify calls the function Comm::modify_params(). In that function, the class member Comm::cutghostuser is set to the requested value.
There are several cases, where this value is changed, e.g. some of the GPU package pair styles for TIP4P water.

@aramdbs
Copy link
Author

aramdbs commented Jun 29, 2023

Thanks for all the clarifications. I will try to find an example where Comm::cutghostuser is changed to see if that may work in my case.

@aramdbs
Copy link
Author

aramdbs commented Jun 30, 2023

Sorry, can I double check something to make sure that I understand, in compute_rdf.cpp and compute_adf.cpp, the pair list cutoffs are set to mycutneigh = cutoff_user + skin and then, the following check is performed:
mycutneigh<MAX(force->pair->cutforce+skin, comm->cutghostuser). However, according to how you came up with communication cutoff of 34 above, shouldn't it really be mycutneigh+skin<MAX(force->pair->cutforce+skin,comm->cutghostuser)? Or is it somehow different for compute and fix styles?

Or in other words, when setting pair list cutoff in my fix, should I set it to the required cutoff (user_cutoff) or user_cutoff+skin?

@akohlmey
Copy link
Member

shouldn't it really be mycutneigh+skin<MAX(force->pair->cutforce+skin,comm->cutghostuser)?

No. But the request->set_cutoff() call may be using the wrong argument. I need to check that.

@akohlmey
Copy link
Member

Or is it somehow different for compute and fix styles?

The difference is between perpetual and occasional neighbor lists.
For perpetual neighbor lists, they are automatically rebuilt regularly during the run due to the settings of the neighbor and neigh_modify command. For those the neighbor list skin is added automatically. In compute rdf you have an occasional neighbor list request where the neighbor list is built explicitly with a call to Neighbor::build_one(). At that point, no neighbor list skin distance would be needed unless the programmer want to use an additional buffer in the neighbor list to make certain there are no rounding or other similar issues to would exclude a single pair. For a force computation, these rarely make a difference, since the forces at the cutoff are small and mostly cancel for homogeneous systems. However, compute rdf is more sensitive, specifically the coordination number to missing single pairs of atoms. Thus adding a buffer for safety is a good idea and the neighbor list skin is a good value for that, since it is usually adjusted to the active unit settings. Using a fixed number could lead to a very inefficient choice.

@aramdbs
Copy link
Author

aramdbs commented Jun 30, 2023

Thank you very much! That explains a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

2 participants