Deadlock in the pre-ckpt phase when the checkpoint interval is short #161

JainTwinkle · 2022-05-30T19:52:11Z

It seems to be working now! I tried with mpi_hello_world with 1 rank and a heat distribution application (that we usually use for testing) with 4 ranks.

However, there seems to be an issue we already noticed in the old version of MANA, and we were hoping it was fixed in the new version. It seems to happen with our heat distribution application, I attach the source code so that you are able to reproduce the problem. heatdis.zip

The issue is that when checkpointing with relatively short intervals(and letting it execute for a few minutes), the execution eventually encounters a deadlock in the pre-checkpoint phase. Specifically, at least one of the MPI processes gets stuck in the drainSendRecv() function in the DMTCP_EVENT_PRECHECKPOINT case from mpi_plugin_event_hook().

JainTwinkle · 2022-05-30T20:20:01Z

@marcpb94,

Could you provide the backtrace of the hanging process (all threads) and specific instructions to reproduce the issue? For example, the launch command, input to the executable, and short interval (2 sec?).

Thanks!

gc00 · 2022-05-30T20:21:09Z

@JainTwinkle @marcpb94 Does this dreadlock occur if you choose a longer checkpoint interval, such as 5 minutes?
This might be a long-known issue for short checkpoint intervals. We were not concentrating on it earlier, when we were concentrating on making MANA more robust.
But now that MANA is becoming more robust, this might be a good time to analyze what is causing the dreadlock.

marcpb94 · 2022-05-31T11:13:28Z

@JainTwinkle I tend to use 400 as the parameter for the application, not sure how much effect this has on the reproducibility of the deadlock. As for the interval, the lower the interval the quicker it usually encounters the deadlock, although it seems to be a matter of luck. I usually use 2 seconds to make it appear quickly, although sometimes it takes a while. I have also noticed that the more number of processes (proportionally to the available core/thread count in the machine) it uses, the easier it is for the issue to appear.

The command used is the following:

mpirun -np 4 bin/mana_launch mpi-proxy-split/test/heatdis.mana.exe 400

I added heatdis to the test folder on mpi-proxy-split, to avoid any issues caused by compilation/linking.

The backtrace for the hanging process is the following:

Thread 2 (Thread 0x7f454b3e7700 (LWP 14260)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007f454ea7c650 in _real_syscall (sys_num=158) at pid/pid_syscallsreal.c:346
#2  0x00007f454ea7a246 in syscall (sys_num=158) at pid/pid_miscwrappers.cpp:507
#3  0x00007f454ed3c9c1 in _real_syscall (sys_num=158) at syscallsreal.c:891
#4  0x00007f454ecf0d5f in syscall (sys_num=158) at miscwrappers.cpp:611
#5  0x00007f454f9a4c6a in SwitchContext::SwitchContext (this=0x7f454b3e50f0, lowerHalfFs=241142976)
    at split_process.cpp:77
#6  0x00007f454f9b2a5b in MPI_Test_internal (request=0x7f454b3e5368, flag=0x7f454b3e516c, status=0x7f454b3e5150, 
    isRealRequest=false) at mpi_request_wrappers.cpp:52
#7  0x00007f454f9b3428 in MPI_Wait (request=0x7f454b3e5368, status=0x1) at mpi_request_wrappers.cpp:304
#8  0x00007f454f9a81b7 in MPI_Recv (buf=0x7f454fdfd008, count=81920, datatype=1275068685, source=1, tag=50, 
    comm=1140850688, status=0x1) at mpi_p2p_wrappers.cpp:153
#9  0x00007f454f99055b in recvMsgIntoInternalBuffer (status=..., comm=1140850688) at p2p_drain_send_recv.cpp:100
#10 0x00007f454f9908d5 in recvFromAllComms () at p2p_drain_send_recv.cpp:187
#11 0x00007f454f990a8a in drainSendRecv () at p2p_drain_send_recv.cpp:234
#12 0x00007f454f98c556 in mpi_plugin_event_hook (event=DMTCP_EVENT_PRECHECKPOINT, data=0x0) at mpi_plugin.cpp:318
#13 0x00007f454ecf14aa in dmtcp::PluginManager::eventHook (event=DMTCP_EVENT_PRECHECKPOINT, data=0x0)
    at pluginmanager.cpp:136
#14 0x00007f454ece61cf in dmtcp::DmtcpWorker::preCheckpoint () at dmtcpworker.cpp:472
#15 0x00007f454ecfa548 in checkpointhread (dummy=0x0) at threadlist.cpp:412
#16 0x00007f454ecff555 in thread_start (arg=0x7f454fe32008) at threadwrappers.cpp:108
#17 0x00007f454d836ea5 in start_thread (arg=0x7f454b3e7700) at pthread_create.c:307
#18 0x00007f454e26bb0d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 1 (Thread 0x7f454fe3b780 (LWP 14239)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007f454ea7c650 in _real_syscall (sys_num=202) at pid/pid_syscallsreal.c:346
#2  0x00007f454ea7a246 in syscall (sys_num=202) at pid/pid_miscwrappers.cpp:507
#3  0x00007f454ed3c9c1 in _real_syscall (sys_num=202) at syscallsreal.c:891
#4  0x00007f454ecf0d5f in syscall (sys_num=202) at miscwrappers.cpp:611
#5  0x00007f454ed1055f in futex (uaddr=0x7f454ef750f8 <threadResumeLock+24>, futex_op=0, val=2, timeout=0x0, 
    uaddr2=0x0, val3=0) at ../include/futex.h:14
#6  0x00007f454ed10599 in futex_wait (uaddr=0x7f454ef750f8 <threadResumeLock+24>, old_val=2)
    at ../include/futex.h:21
#7  0x00007f454ed10674 in DmtcpMutexLock (mutex=0x7f454ef750f8 <threadResumeLock+24>) at mutex.cpp:59
#8  0x00007f454ed1cdf7 in DmtcpRWLockRdLock (rwlock=0x7f454ef750e0 <threadResumeLock>) at rwlock.cpp:49
#9  0x00007f454ecfb2e4 in stopthisthread (signum=12) at threadlist.cpp:605
#10 <signal handler called>
#11 0x00007f454e2329fd in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#12 0x00007f454ecfe31e in dmtcp::ThreadSync::wrapperExecutionLockLock () at threadsync.cpp:362
#13 0x00007f454ecfe8e7 in dmtcp_plugin_disable_ckpt () at threadsync.cpp:533
#14 0x00007f454f9a7dc3 in MPI_Isend (buf=0x7f453e381010, count=10240, datatype=1275070475, dest=1, tag=50, 
    comm=1140850688, request=0x7ffd5395dac0) at mpi_p2p_wrappers.cpp:74
#15 0x0000000000400cfb in doWork (numprocs=4, rank=0, M=10240, nbLines=2563, g=0x7f4531b6d010, h=0x7f453e3aa010)
    at heatdis.c:59
#16 0x0000000000401215 in main (argc=2, argv=0x7ffd5395dc38) at heatdis.c:115

@gc00 I have seen the issue occur at times with intervals as long as 20-30 seconds (i remember a few times that it happened with 1min, i believe), but longer intervals tend to be fine. However, the issue might just take a much longer time to appear, to the point where I have just not run applications for that long.

JainTwinkle · 2022-05-31T19:46:50Z

Thanks, @marcpb94! I was able to reproduce this issue. We are looking into it.

@karya0 @xuyao0127 @dahongli
Rank 3 is the one that is currently stuck. Please find the backtrace of all four ranks here: heat-distribution-4-ranks-backtrace.txt

Yao,
We suspect that this might be an issue in the design while draining messages from MPI_Isend, and we think you would better know about the pre-ckpt phase algorithm. Could you please look at the backtraces?
The source is available here: heatdis.zip

JainTwinkle · 2022-06-01T18:36:02Z

Update: @xuyao0127 says that he is able to reproduce the issue on Cori, and he is working on it.

xuyao0127 · 2022-06-01T19:16:38Z

As @JainTwinkle said, this is a general bug in point-to-point communication. When draining point-to-point messages at checkpoint time, MPI_Iprobe detected an available message in the network, but the following MPI_Recv function can not receive the message and blocks the checkpoint progress. There is a similar heat equation program included in MANA that uses blocking point-to-point communication, but it doesn't have this issue. So I believe it's related to multiple non-blocking communications before the checkpoint.

I observed a pattern that between two neighboring ranks, one rank has created an MPI_Isend and an MPI_Irecv request, and wait on both requests. The other rank is at the beginning of this round of communication (no request created). Currently, I can not reproduce the same bug with 2 ranks or a simpler test program. I am still working on it.

gc00 · 2022-06-07T09:56:16Z

@marcpb94 , Could you please try fetching the branch of @xuyao0127 at PR #165? (@xuyao0127 found this solution after discussions with @JainTwinkle.) I'm still reviewing it, but I suspect that this will fix the bug that you discovered.

Thanks very much for reporting the bug! This was an important conceptual flaw in the previous software design that would randomly cause a failure in MANA.

marcpb94 · 2022-06-07T12:32:00Z

@gc00 @JainTwinkle @xuyao0127 I fetched the PR and have been running the heat application for an hour while checkpointing with a frequency of 1 second, with no deadlocks so far. Considering the deadlock appeared rather quickly when operating at that checkpoint frequency before, the bug might be actually fixed! I assume more thorough testing needs to be done on your end, so it's probably wise to leave this issue open until someone else confirms it.

Thanks a lot!

gc00 · 2022-06-08T10:43:16Z

@marcpb94 ,
Thank you again for reporting this important bug, and reporting that it is fixed in your environment. We have now pushed in PR #165 into 'main'. So, I'm closing this issue.

JainTwinkle mentioned this issue May 30, 2022

Error loading libmana.so #133

Closed

xuyao0127 mentioned this issue Jun 7, 2022

Async p2p bugfix #165

Merged

gc00 closed this as completed Jun 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock in the pre-ckpt phase when the checkpoint interval is short #161

Deadlock in the pre-ckpt phase when the checkpoint interval is short #161

JainTwinkle commented May 30, 2022

JainTwinkle commented May 30, 2022

gc00 commented May 30, 2022

marcpb94 commented May 31, 2022 •

edited

Loading

JainTwinkle commented May 31, 2022

JainTwinkle commented Jun 1, 2022

xuyao0127 commented Jun 1, 2022

gc00 commented Jun 7, 2022

marcpb94 commented Jun 7, 2022

gc00 commented Jun 8, 2022

Deadlock in the pre-ckpt phase when the checkpoint interval is short #161

Deadlock in the pre-ckpt phase when the checkpoint interval is short #161

Comments

JainTwinkle commented May 30, 2022

JainTwinkle commented May 30, 2022

gc00 commented May 30, 2022

marcpb94 commented May 31, 2022 • edited Loading

JainTwinkle commented May 31, 2022

JainTwinkle commented Jun 1, 2022

xuyao0127 commented Jun 1, 2022

gc00 commented Jun 7, 2022

marcpb94 commented Jun 7, 2022

gc00 commented Jun 8, 2022

marcpb94 commented May 31, 2022 •

edited

Loading