Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in the pre-ckpt phase when the checkpoint interval is short #161

Closed
JainTwinkle opened this issue May 30, 2022 · 9 comments
Closed

Comments

@JainTwinkle
Copy link
Collaborator

Quoting @marcpb94 from #133:

It seems to be working now! I tried with mpi_hello_world with 1 rank and a heat distribution application (that we usually use for testing) with 4 ranks.

However, there seems to be an issue we already noticed in the old version of MANA, and we were hoping it was fixed in the new version. It seems to happen with our heat distribution application, I attach the source code so that you are able to reproduce the problem. heatdis.zip

The issue is that when checkpointing with relatively short intervals(and letting it execute for a few minutes), the execution eventually encounters a deadlock in the pre-checkpoint phase. Specifically, at least one of the MPI processes gets stuck in the drainSendRecv() function in the DMTCP_EVENT_PRECHECKPOINT case from mpi_plugin_event_hook().

@JainTwinkle
Copy link
Collaborator Author

@marcpb94,

Could you provide the backtrace of the hanging process (all threads) and specific instructions to reproduce the issue? For example, the launch command, input to the executable, and short interval (2 sec?).

Thanks!

@gc00
Copy link
Collaborator

gc00 commented May 30, 2022

@JainTwinkle @marcpb94 Does this dreadlock occur if you choose a longer checkpoint interval, such as 5 minutes?
This might be a long-known issue for short checkpoint intervals. We were not concentrating on it earlier, when we were concentrating on making MANA more robust.
But now that MANA is becoming more robust, this might be a good time to analyze what is causing the dreadlock.

@marcpb94
Copy link

marcpb94 commented May 31, 2022

@JainTwinkle I tend to use 400 as the parameter for the application, not sure how much effect this has on the reproducibility of the deadlock. As for the interval, the lower the interval the quicker it usually encounters the deadlock, although it seems to be a matter of luck. I usually use 2 seconds to make it appear quickly, although sometimes it takes a while. I have also noticed that the more number of processes (proportionally to the available core/thread count in the machine) it uses, the easier it is for the issue to appear.

The command used is the following:

mpirun -np 4 bin/mana_launch mpi-proxy-split/test/heatdis.mana.exe 400

I added heatdis to the test folder on mpi-proxy-split, to avoid any issues caused by compilation/linking.

The backtrace for the hanging process is the following:

Thread 2 (Thread 0x7f454b3e7700 (LWP 14260)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007f454ea7c650 in _real_syscall (sys_num=158) at pid/pid_syscallsreal.c:346
#2  0x00007f454ea7a246 in syscall (sys_num=158) at pid/pid_miscwrappers.cpp:507
#3  0x00007f454ed3c9c1 in _real_syscall (sys_num=158) at syscallsreal.c:891
#4  0x00007f454ecf0d5f in syscall (sys_num=158) at miscwrappers.cpp:611
#5  0x00007f454f9a4c6a in SwitchContext::SwitchContext (this=0x7f454b3e50f0, lowerHalfFs=241142976)
    at split_process.cpp:77
#6  0x00007f454f9b2a5b in MPI_Test_internal (request=0x7f454b3e5368, flag=0x7f454b3e516c, status=0x7f454b3e5150, 
    isRealRequest=false) at mpi_request_wrappers.cpp:52
#7  0x00007f454f9b3428 in MPI_Wait (request=0x7f454b3e5368, status=0x1) at mpi_request_wrappers.cpp:304
#8  0x00007f454f9a81b7 in MPI_Recv (buf=0x7f454fdfd008, count=81920, datatype=1275068685, source=1, tag=50, 
    comm=1140850688, status=0x1) at mpi_p2p_wrappers.cpp:153
#9  0x00007f454f99055b in recvMsgIntoInternalBuffer (status=..., comm=1140850688) at p2p_drain_send_recv.cpp:100
#10 0x00007f454f9908d5 in recvFromAllComms () at p2p_drain_send_recv.cpp:187
#11 0x00007f454f990a8a in drainSendRecv () at p2p_drain_send_recv.cpp:234
#12 0x00007f454f98c556 in mpi_plugin_event_hook (event=DMTCP_EVENT_PRECHECKPOINT, data=0x0) at mpi_plugin.cpp:318
#13 0x00007f454ecf14aa in dmtcp::PluginManager::eventHook (event=DMTCP_EVENT_PRECHECKPOINT, data=0x0)
    at pluginmanager.cpp:136
#14 0x00007f454ece61cf in dmtcp::DmtcpWorker::preCheckpoint () at dmtcpworker.cpp:472
#15 0x00007f454ecfa548 in checkpointhread (dummy=0x0) at threadlist.cpp:412
#16 0x00007f454ecff555 in thread_start (arg=0x7f454fe32008) at threadwrappers.cpp:108
#17 0x00007f454d836ea5 in start_thread (arg=0x7f454b3e7700) at pthread_create.c:307
#18 0x00007f454e26bb0d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 1 (Thread 0x7f454fe3b780 (LWP 14239)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007f454ea7c650 in _real_syscall (sys_num=202) at pid/pid_syscallsreal.c:346
#2  0x00007f454ea7a246 in syscall (sys_num=202) at pid/pid_miscwrappers.cpp:507
#3  0x00007f454ed3c9c1 in _real_syscall (sys_num=202) at syscallsreal.c:891
#4  0x00007f454ecf0d5f in syscall (sys_num=202) at miscwrappers.cpp:611
#5  0x00007f454ed1055f in futex (uaddr=0x7f454ef750f8 <threadResumeLock+24>, futex_op=0, val=2, timeout=0x0, 
    uaddr2=0x0, val3=0) at ../include/futex.h:14
#6  0x00007f454ed10599 in futex_wait (uaddr=0x7f454ef750f8 <threadResumeLock+24>, old_val=2)
    at ../include/futex.h:21
#7  0x00007f454ed10674 in DmtcpMutexLock (mutex=0x7f454ef750f8 <threadResumeLock+24>) at mutex.cpp:59
#8  0x00007f454ed1cdf7 in DmtcpRWLockRdLock (rwlock=0x7f454ef750e0 <threadResumeLock>) at rwlock.cpp:49
#9  0x00007f454ecfb2e4 in stopthisthread (signum=12) at threadlist.cpp:605
#10 <signal handler called>
#11 0x00007f454e2329fd in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#12 0x00007f454ecfe31e in dmtcp::ThreadSync::wrapperExecutionLockLock () at threadsync.cpp:362
#13 0x00007f454ecfe8e7 in dmtcp_plugin_disable_ckpt () at threadsync.cpp:533
#14 0x00007f454f9a7dc3 in MPI_Isend (buf=0x7f453e381010, count=10240, datatype=1275070475, dest=1, tag=50, 
    comm=1140850688, request=0x7ffd5395dac0) at mpi_p2p_wrappers.cpp:74
#15 0x0000000000400cfb in doWork (numprocs=4, rank=0, M=10240, nbLines=2563, g=0x7f4531b6d010, h=0x7f453e3aa010)
    at heatdis.c:59
#16 0x0000000000401215 in main (argc=2, argv=0x7ffd5395dc38) at heatdis.c:115

@gc00 I have seen the issue occur at times with intervals as long as 20-30 seconds (i remember a few times that it happened with 1min, i believe), but longer intervals tend to be fine. However, the issue might just take a much longer time to appear, to the point where I have just not run applications for that long.

@JainTwinkle
Copy link
Collaborator Author

Thanks, @marcpb94! I was able to reproduce this issue. We are looking into it.

@karya0 @xuyao0127 @dahongli
Rank 3 is the one that is currently stuck. Please find the backtrace of all four ranks here: heat-distribution-4-ranks-backtrace.txt

Yao,
We suspect that this might be an issue in the design while draining messages from MPI_Isend, and we think you would better know about the pre-ckpt phase algorithm. Could you please look at the backtraces?
The source is available here: heatdis.zip

@JainTwinkle
Copy link
Collaborator Author

Update: @xuyao0127 says that he is able to reproduce the issue on Cori, and he is working on it.

@xuyao0127
Copy link
Collaborator

As @JainTwinkle said, this is a general bug in point-to-point communication. When draining point-to-point messages at checkpoint time, MPI_Iprobe detected an available message in the network, but the following MPI_Recv function can not receive the message and blocks the checkpoint progress. There is a similar heat equation program included in MANA that uses blocking point-to-point communication, but it doesn't have this issue. So I believe it's related to multiple non-blocking communications before the checkpoint.

I observed a pattern that between two neighboring ranks, one rank has created an MPI_Isend and an MPI_Irecv request, and wait on both requests. The other rank is at the beginning of this round of communication (no request created). Currently, I can not reproduce the same bug with 2 ranks or a simpler test program. I am still working on it.

@gc00
Copy link
Collaborator

gc00 commented Jun 7, 2022

@marcpb94 , Could you please try fetching the branch of @xuyao0127 at PR #165? (@xuyao0127 found this solution after discussions with @JainTwinkle.) I'm still reviewing it, but I suspect that this will fix the bug that you discovered.

Thanks very much for reporting the bug! This was an important conceptual flaw in the previous software design that would randomly cause a failure in MANA.

@marcpb94
Copy link

marcpb94 commented Jun 7, 2022

@gc00 @JainTwinkle @xuyao0127 I fetched the PR and have been running the heat application for an hour while checkpointing with a frequency of 1 second, with no deadlocks so far. Considering the deadlock appeared rather quickly when operating at that checkpoint frequency before, the bug might be actually fixed! I assume more thorough testing needs to be done on your end, so it's probably wise to leave this issue open until someone else confirms it.

Thanks a lot!

@gc00
Copy link
Collaborator

gc00 commented Jun 8, 2022

@marcpb94 ,
Thank you again for reporting this important bug, and reporting that it is fixed in your environment. We have now pushed in PR #165 into 'main'. So, I'm closing this issue.

@gc00 gc00 closed this as completed Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants