-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock in the pre-ckpt phase when the checkpoint interval is short #161
Comments
Could you provide the backtrace of the hanging process (all threads) and specific instructions to reproduce the issue? For example, the launch command, input to the executable, and short interval (2 sec?). Thanks! |
@JainTwinkle @marcpb94 Does this dreadlock occur if you choose a longer checkpoint interval, such as 5 minutes? |
@JainTwinkle I tend to use 400 as the parameter for the application, not sure how much effect this has on the reproducibility of the deadlock. As for the interval, the lower the interval the quicker it usually encounters the deadlock, although it seems to be a matter of luck. I usually use 2 seconds to make it appear quickly, although sometimes it takes a while. I have also noticed that the more number of processes (proportionally to the available core/thread count in the machine) it uses, the easier it is for the issue to appear. The command used is the following:
I added heatdis to the test folder on mpi-proxy-split, to avoid any issues caused by compilation/linking. The backtrace for the hanging process is the following:
@gc00 I have seen the issue occur at times with intervals as long as 20-30 seconds (i remember a few times that it happened with 1min, i believe), but longer intervals tend to be fine. However, the issue might just take a much longer time to appear, to the point where I have just not run applications for that long. |
Thanks, @marcpb94! I was able to reproduce this issue. We are looking into it. @karya0 @xuyao0127 @dahongli Yao, |
Update: @xuyao0127 says that he is able to reproduce the issue on Cori, and he is working on it. |
As @JainTwinkle said, this is a general bug in point-to-point communication. When draining point-to-point messages at checkpoint time, MPI_Iprobe detected an available message in the network, but the following MPI_Recv function can not receive the message and blocks the checkpoint progress. There is a similar heat equation program included in MANA that uses blocking point-to-point communication, but it doesn't have this issue. So I believe it's related to multiple non-blocking communications before the checkpoint. I observed a pattern that between two neighboring ranks, one rank has created an MPI_Isend and an MPI_Irecv request, and wait on both requests. The other rank is at the beginning of this round of communication (no request created). Currently, I can not reproduce the same bug with 2 ranks or a simpler test program. I am still working on it. |
@marcpb94 , Could you please try fetching the branch of @xuyao0127 at PR #165? (@xuyao0127 found this solution after discussions with @JainTwinkle.) I'm still reviewing it, but I suspect that this will fix the bug that you discovered. Thanks very much for reporting the bug! This was an important conceptual flaw in the previous software design that would randomly cause a failure in MANA. |
@gc00 @JainTwinkle @xuyao0127 I fetched the PR and have been running the heat application for an hour while checkpointing with a frequency of 1 second, with no deadlocks so far. Considering the deadlock appeared rather quickly when operating at that checkpoint frequency before, the bug might be actually fixed! I assume more thorough testing needs to be done on your end, so it's probably wise to leave this issue open until someone else confirms it. Thanks a lot! |
Quoting @marcpb94 from #133:
The text was updated successfully, but these errors were encountered: