-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VR does not run in MPI+hydro mode #54
Comments
@MatthieuSchaller thanks for the report. I don't think I'll be able to look into this until early next week though. In the meanwhile, could you please share the data, or push for my cosma application to proceed? I tried to reproduce locally with some of the inputs I had received in the past but they seem to lack the fields needed by the hydro-enabled code. |
If it helps, there is smaller test case here:
This one crashes about 20s after start so might be easier. Config is:
|
Thanks @MatthieuSchaller for the extra information. I managed to reproduce this locally with the data you pointed above, I hope I can post an update soon with some information |
The MPI sendrecv operations happening a few lines below require proprecvbuff to be the same size of the corresponding propsendbuff; that is, numrecv * numextrafields. This became clear when running the program under valgrind, and after inspecting it under a debugger. This commit fixes the immediate problem. A separate one will be added to add assertions on the sizes of the send buffers to add clarity to the code. This addresses #54. Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
Running under valgrind revealed an invalid write:
The in gdb the reason became clear: the On a new I also tried out running without |
I confirm this works both with and without OpenMP (as expected since unrelated). Thanks for tracking this down! (I am never brave enough to fire up valgrind inside an mpirun call...) |
Great! I merged this into the |
The different routines exchanging extra information (gas, start, BH and dark matter) during particle MPI exchange pass around two buffers, indices and property data, whose sizes are related: for N indices there are N * M properties. However most of the routines were flawed because they allocated a property reception array of N elements, leading to memory corruption. Additionally, this problem affected all these routines (except one) because the code performing these MPI operations was duplicated in all of them. This commit fixes these two problems in one go. Firstly, it adds a new function to perform the data exchange that is then called by the four original routines. Secondly, the new centralised version of the data exchange code correctly sizes the input buffers to avoid memory corruption. This commit addresses the issue reported in #87. A similar situation had been reported in #54, but at the time I hadn't realised this was a problem affecting several similarly-looking functions, and therefore the fix at the time (082ff68) was constrained only to the routine where the problem was originally reported, leaving the rest with the problem still to be fixed. Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
The different routines exchanging extra information (gas, start, BH and dark matter) during particle MPI exchange pass around two buffers, indices and property data, whose sizes are related: for N indices there are N * M properties. However most of the routines were flawed because they allocated a property reception array of N elements, leading to memory corruption. Additionally, this problem affected all these routines (except one) because the code performing these MPI operations was duplicated in all of them. This commit fixes these two problems in one go. Firstly, it adds a new function to perform the data exchange that is then called by the four original routines. Secondly, the new centralised version of the data exchange code correctly sizes the input buffers to avoid memory corruption. This commit addresses the issue reported in #87. A similar situation had been reported in #54, but at the time I hadn't realised this was a problem affecting several similarly-looking functions, and therefore the fix at the time (082ff68) was constrained only to the routine where the problem was originally reported, leaving the rest with the problem still to be fixed. Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
The different routines exchanging extra information (gas, start, BH and dark matter) during particle MPI exchange pass around two buffers, indices and property data, whose sizes are related: for N indices there are N * M properties. However most of the routines were flawed because they allocated a property reception array of N elements, leading to memory corruption. Additionally, this problem affected all these routines (except one) because the code performing these MPI operations was duplicated in all of them. This commit fixes these two problems in one go. Firstly, it adds a new function to perform the data exchange that is then called by the four original routines. Secondly, the new centralised version of the data exchange code correctly sizes the input buffers to avoid memory corruption. This commit addresses the issue reported in #87. A similar situation had been reported in #54, but at the time I hadn't realised this was a problem affecting several similarly-looking functions, and therefore the fix at the time (082ff68) was constrained only to the routine where the problem was originally reported, leaving the rest with the problem still to be fixed. Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
The different routines exchanging extra information (gas, start, BH and dark matter) during particle MPI exchange pass around two buffers, indices and property data, whose sizes are related: for N indices there are N * M properties. However most of the routines were flawed because they allocated a property reception array of N elements, leading to memory corruption. Additionally, this problem affected all these routines (except one) because the code performing these MPI operations was duplicated in all of them. This commit fixes these two problems in one go. Firstly, it adds a new function to perform the data exchange that is then called by the four original routines. Secondly, the new centralised version of the data exchange code correctly sizes the input buffers to avoid memory corruption. This commit addresses the issue reported in #87. A similar situation had been reported in #54, but at the time I hadn't realised this was a problem affecting several similarly-looking functions, and therefore the fix at the time (082ff68) was constrained only to the routine where the problem was originally reported, leaving the rest with the problem still to be fixed. Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
Describe the bug
Running VR with MPI switched on, OMP switched off and hydro switched on breaks on EAGLE boxes.
To Reproduce
VR_USE_HYDRO=ON
,VR_MPI=ON
andVR_OPENMP=OFF
.mpirun -np 16 stf -C vrconfig_3dfof_subhalos_SO_hydro.cfg -i eagle_0036 -o halos_mpi_0036 -I 2
This is a standard XL snapshot with our standard config file.
The code segfaults after printing
somewhere in a MPI call trying to clear a vector of size 10^16 (!!).
The input can be found here if necessary:
/snap7/scratch/dp004/jlvc76/SWIFT/EoS_tests/swiftsim/examples/EAGLE_ICs/EAGLE_25/eagle_0036.hdf5
and/snap7/scratch/dp004/jlvc76/SWIFT/EoS_tests/swiftsim/examples/EAGLE_ICs/EAGLE_25/vrconfig_3dfof_subhalos_SO_hydro.cfg
.The same setup works either:
Note that the snapshot is made of one single file if that is relevant.
The text was updated successfully, but these errors were encountered: