-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error loading libmana.so #133
Comments
Hi @marcpb94 (and @fyshhh), However, This appears to be a bug in how we configure MANA in CentOS. Could you do:
@marcpb94, and whatever happens, please let us know, so that we can update our documentation. Thanks. @fyshhh, |
Hi,
Doing Thanks for the help! |
Hi, Turns out that yesterday I tried to execute the "non-mana" compiled version of the examples (it surprisingly worked and performed checkpoints? Although they do not seem to recover successfully). When I try to execute the mana.exe versions of the examples, the applications get stuck and do not give any output whatsoever. |
@marcpb94 , if you're comfortable using email, you can reach me at gene with the d domain: ccs neu edu @fyshhh , could you also try building MANA one more time? Two weeks ago, we changed to a new structure in the current default branch. I suspect that you'll need my latest PR to build MANA. Thanks for checking this. |
@JainTwinkle , This seems to be the same bug that you were looking at today. @fyshhh , Can you go through the git history to see which commit caused this to break? I wonder if it's related to us now using the feature/dmtcp-master branch. Thanks for looking into this. |
@gc00 It may be the same bug. But, we need the backtrace for the confirmation. @fyshhh and @marcpb94, could either of you check if it's reaching the segfault handler and getting stuck in an infinite loop? @gc00, maybe we should print it on the terminal whenever the segfault handler is called. Currently, I see no message stating that a segfault has happened, and I am supposed to attach gdb. |
@JainTwinkle yes, it was reaching the segfault handler. This is the backtrace when run on the
|
@gc00 @JainTwinkle the last working commit is [ |
@fyshhh Thanks! I'm also looking into it. For me, if I run mana_launch (actual command) under gdb, mpi_hello_world works fine (at least for a single rank). So, I'm guessing that it can be some subtle memory corruption. |
Backtrace with more info:
|
@gc00 and @fyshhh, I figured the issue. Short story: Initially, one environment variable's location is in the heap, and later, the heap gets shrunk/unmapped. So, the env variable's address becomes invalid. On my machine, it's the PATH env variable. Long story:
PATH's address is in the heap:
Later, the heap gets unmapped and the env pointer still points to the heap. So, the program segfaults. The following is the relevant code from glibc: (see my comment @
I still need to figure out why the heap gets de-allocated and how we can make sure that the PATH stays valid. I can think of a few solutions but I'd like to gather more information at this point. |
I looked further into this and also compared GDB (successful) vs. non-GDB (failed) runs. It's the Following is the backtrace to show when the heap set up by MANA gets unmapped:
Non-GDB: The procmaps below are self-explanatory:
GDB (running mana_launch/dmtcp_launch under gdb directly):
One thing to note is that in the failed case, the heap's location (where PATH env var exists) is not right after the data section of the executable, and it changes in each run. However, in GDB, it is right after the data section, and it doesn't overlap with the arena. So. the heap and PATH's location stay valid after the I am not sure but it seems that we are messing up the |
@marcpb94 , The new architecture was required so that MANA would inherit the latest version of DMTCP. (DMTCP is the underlying platform for MANA.) Prior to that, the student who developed the original MANA in his thesis had frozen a version of DMTCP, in order to have a stable platform for his thesis. We had to significantly change the MANA architecture to make DMTCP a git submodule in the MANA github repo. We had tested the new architecture on the Cori supercomputer, but apparently not on CentOS. Sorry for the rough ride. |
No worries @gc00, I'm glad it is getting attention and fixed soon! I'm currently working with the older version (with the frozen version of DMTCP), and we wanted to move to the new version to see if some of the issues we experienced are fixed (I saw some promising bug fixes from one or two months ago). |
Hi, any updates on the fix? |
@marcpb94 , @JainTwinkle now has an internal version that may work on simple cases. But it probably won't work for all cases. At this time, she is passing a container with her solution to @karya0, who will then improve that to a robust solution. We truly are getting close now. |
@gc00 @marcpb94 |
I am able to Backtrace:
|
Hi @JainTwinkle It seems to be working now! I tried with mpi_hello_world with 1 rank and a heat distribution application (that we usually use for testing) with 4 ranks. However, there seems to be an issue we already noticed in the old version of MANA, and we were hoping it was fixed in the new version. It seems to happen with our heat distribution application, I attach the source code so that you are able to reproduce the problem. heatdis.zip The issue is that when checkpointing with relatively short intervals(and letting it execute for a few minutes), the execution eventually encounters a deadlock in the pre-checkpoint phase. Specifically, at least one of the MPI processes gets stuck in the drainSendRecv() function in the DMTCP_EVENT_PRECHECKPOINT case from mpi_plugin_event_hook(). Should we continue in this thread? Or should I make another issue? Thanks, Marc |
I configured and installed mana from the dmtcp-master branch, on a local centos 7 machine. When trying to execute the mpi_hello_world.exe example, i get the following error:
ERROR: ld.so: object 'bin/../lib/dmtcp/libmana.so' from LD_PRELOAD cannot be preloaded: ignored. ERROR: ld.so: object 'bin/../lib/dmtcp/libmana.so' from LD_PRELOAD cannot be preloaded: ignored. ERROR: ld.so: object 'bin/../lib/dmtcp/libmana.so' from LD_PRELOAD cannot be preloaded: ignored. ERROR: ld.so: object 'bin/../lib/dmtcp/libmana.so' from LD_PRELOAD cannot be preloaded: ignored.
It seems that the mana_launch script tries to point dmtcp_launch to the wrong libmana.so location (bin/../lib/dmtcp instead of bin/../dmtcp/lib/dmtcp), but even if I try to point to where libmana.so is, the program gets stuck at the start.
Am I doing something wrong? Is the dmtcp-master branch currently working? Thanks!
The text was updated successfully, but these errors were encountered: