VR does not run in MPI+hydro mode #54

MatthieuSchaller · 2020-12-10T14:20:14Z

Describe the bug

Running VR with MPI switched on, OMP switched off and hydro switched on breaks on EAGLE boxes.

To Reproduce

Compile the latest master (cb4336d) with VR_USE_HYDRO=ON, VR_MPI=ON and VR_OPENMP=OFF.
Run mpirun -np 16 stf -C vrconfig_3dfof_subhalos_SO_hydro.cfg -i eagle_0036 -o halos_mpi_0036 -I 2

This is a standard XL snapshot with our standard config file.
The code segfaults after printing

[0000] [ 346.430] [ info] search.cxx:352 Finished linking across MPI domains in 1.258 [min]

somewhere in a MPI call trying to clear a vector of size 10^16 (!!).

The input can be found here if necessary: /snap7/scratch/dp004/jlvc76/SWIFT/EoS_tests/swiftsim/examples/EAGLE_ICs/EAGLE_25/eagle_0036.hdf5 and /snap7/scratch/dp004/jlvc76/SWIFT/EoS_tests/swiftsim/examples/EAGLE_ICs/EAGLE_25/vrconfig_3dfof_subhalos_SO_hydro.cfg.

The same setup works either:

On 1 MPI rank
Without VR_USE_HYDRO

Note that the snapshot is made of one single file if that is relevant.

The text was updated successfully, but these errors were encountered:

rtobar · 2020-12-11T02:22:24Z

@MatthieuSchaller thanks for the report. I don't think I'll be able to look into this until early next week though. In the meanwhile, could you please share the data, or push for my cosma application to proceed? I tried to reproduce locally with some of the inputs I had received in the past but they seem to lack the fields needed by the hydro-enabled code.

MatthieuSchaller · 2020-12-15T22:37:09Z

If it helps, there is smaller test case here:

/snap7/scratch/dp004/jlvc76/SWIFT/EoS_tests/swiftsim/examples/EAGLE_low_z/EAGLE_6/eagle_0000.hdf5
/snap7/scratch/dp004/jlvc76/SWIFT/EoS_tests/swiftsim/examples/EAGLE_low_z/EAGLE_6/vrconfig_3dfof_subhalos_SO_hydro.cfg

This one crashes about 20s after start so might be easier.

Config is: cmake ../ -DVR_USE_HYDRO=ON -DCMAKE_BUILD_TYPE=Debug
Run command line is: stf -C vrconfig_3dfof_subhalos_SO_hydro.cfg -i eagle_0000 -o halos_0000 -I 2

Problem happens with gcc or ICC,
Running with -DVR_OPENMP=OFF also crashes in the same way,
Running without VR_MPI_REDUCE crashes in a different way. There the crash happens when reading in stuff.

rtobar · 2020-12-16T02:38:18Z

Thanks @MatthieuSchaller for the extra information. I managed to reproduce this locally with the data you pointed above, I hope I can post an update soon with some information

The MPI sendrecv operations happening a few lines below require proprecvbuff to be the same size of the corresponding propsendbuff; that is, numrecv * numextrafields. This became clear when running the program under valgrind, and after inspecting it under a debugger. This commit fixes the immediate problem. A separate one will be added to add assertions on the sizes of the send buffers to add clarity to the code. This addresses #54. Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>

rtobar · 2020-12-16T04:02:21Z

Running under valgrind revealed an invalid write:

==14074==    by 0x2D2CC6: MPISendReceiveFOFHydroInfoBetweenThreads(Options&, fofid_in*, std::vector<long long, std::allocator<long long> >&, std::vector<float, std::allocator<float> >&, int, int, ompi_communicator_t*&) (mpiroutines.cxx:2697)

The in gdb the reason became clear: the proprecvbuff vector needed to be properly sized before receiving data. Its size should be the same that propsendbuff, which in turn should be the size of indicessend times numextrafields.

On a new issue-54 branch I fixed the resizing of proprecvbuff to accommodate for all incoming data, and also added an assertion to make sure the sizes of the input vectors are related to each other as expected. After these changes I can run the code to completion. @MatthieuSchaller please confirm that this is working for you too.

I also tried out running without VR_MPI_REDUCE and also got a crash. That looks like a separate issue though, so I'll create a separate ticket to keep track of it separately.

MatthieuSchaller · 2020-12-16T08:50:25Z

I confirm this works both with and without OpenMP (as expected since unrelated). Thanks for tracking this down!

(I am never brave enough to fire up valgrind inside an mpirun call...)

rtobar · 2020-12-16T08:57:25Z

Great! I merged this into the master branch now, thanks for testing!

The different routines exchanging extra information (gas, start, BH and dark matter) during particle MPI exchange pass around two buffers, indices and property data, whose sizes are related: for N indices there are N * M properties. However most of the routines were flawed because they allocated a property reception array of N elements, leading to memory corruption. Additionally, this problem affected all these routines (except one) because the code performing these MPI operations was duplicated in all of them. This commit fixes these two problems in one go. Firstly, it adds a new function to perform the data exchange that is then called by the four original routines. Secondly, the new centralised version of the data exchange code correctly sizes the input buffers to avoid memory corruption. This commit addresses the issue reported in #87. A similar situation had been reported in #54, but at the time I hadn't realised this was a problem affecting several similarly-looking functions, and therefore the fix at the time (082ff68) was constrained only to the routine where the problem was originally reported, leaving the rest with the problem still to be fixed. Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>

rtobar mentioned this issue Dec 16, 2020

Running with VR_MPI_REDUCE=OFF, VR_USE_HYDRO=ON crashes #56

Open

rtobar closed this as completed Dec 16, 2020

rtobar mentioned this issue Jun 15, 2021

Crash in latest master with full hydro run over MPI #87

Closed

rtobar mentioned this issue Jun 15, 2021

More potential issues hidden in MPISendReceive*InfoBetweenThreads functions #89

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VR does not run in MPI+hydro mode #54

VR does not run in MPI+hydro mode #54

MatthieuSchaller commented Dec 10, 2020 •

edited

Loading

rtobar commented Dec 11, 2020

MatthieuSchaller commented Dec 15, 2020

rtobar commented Dec 16, 2020

rtobar commented Dec 16, 2020

MatthieuSchaller commented Dec 16, 2020

rtobar commented Dec 16, 2020

VR does not run in MPI+hydro mode #54

VR does not run in MPI+hydro mode #54

Comments

MatthieuSchaller commented Dec 10, 2020 • edited Loading

rtobar commented Dec 11, 2020

MatthieuSchaller commented Dec 15, 2020

rtobar commented Dec 16, 2020

rtobar commented Dec 16, 2020

MatthieuSchaller commented Dec 16, 2020

rtobar commented Dec 16, 2020

MatthieuSchaller commented Dec 10, 2020 •

edited

Loading