Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upDebug LX app crashing peculiarity #754
Comments
This comment has been minimized.
This comment has been minimized.
|
Hey there! I'm the developer mentioned by @vstath . We're aware of there possibly being a bug during the unload of this particular binary, but on "normal" Linux it's proven incredibly hard to reproduce (triggers approx. once in every ~5000 runs or so, at most). It's most likely some kind of pointer problem, as it causes a segmentation fault occasionally. (The resulting core dump has so far always been heavily corrupted, to the point of not being usable to find the problem.) It seems that it triggers reproducibly in SmartOS (every single run), and I would love to figure out what we're doing wrong (as it's likely some kind of mistake in our code). Unfortunately as mentioned above, I'm only familiar with GDB/Valgrind (both don't seem to catch the problem, whatever it is) and not with the debugging tools native in SmartOS. Assistance in this matter would be highly appreciated! What I would normally do in GDB (if it had worked, that is), is hook it to the running process (it's started through a series of hard to navigate forks, so hooking after running is easiest), wait for the unload sequence, and then use the backtrace as well as the print command to figure out what is going wrong. What would be the equivalent? |
This comment has been minimized.
This comment has been minimized.
|
Probably the first thing to do is see if there is a core dump which can tell us anything. Core dumps should be saved in the global zone under the zone's cores directory (/zones/{uuid}/cores. If you see one or more dumps there, you could run: I have a DTrace script which I can provide and which you can run in the global zone. It will collect a lot of information about the behavior of all of the processes running inside the zone in a relatively lightweight manner. This would be by next suggested step to help narrow down where the code is executing when it dumps. |
This comment has been minimized.
This comment has been minimized.
|
(Since the developer is following this) If the binary could be built with frame pointers (if it's not already -- i.e. without the -fomit-frame-pointer gcc option) at least for testing, that could also help provide more information in a subsequent core file. |
This comment has been minimized.
This comment has been minimized.
|
@jjelinek thank you. I would appreciate receiving the dtrace script. Unfortunately, I couldn't find any core dumps of that particular application peculiarity. |
This comment has been minimized.
This comment has been minimized.
|
Just for reference: the binary is built with frame pointers, but without debug symbols. If absolutely necessary I can provide @vstath with a version that does have debug symbols, but I'd prefer not to. |
This comment has been minimized.
This comment has been minimized.
|
I've placed the Dtrace script at: You must run DTrace as root in the global zone. Edit the script and in the "BEGIN" block, update the "watch_zoneid" value to be the zoneid of the zone which is running the application that fails. Then run the script like this: dtrace -s lxz.d >/var/tmp/log.out In a separate terminal, login to the zone and do whatever you need to do to cause the application to fail. After the application dies. quit running dtrace and let me know. I'll make arrangements for you to provide the log file to me and I'll see if there is anything I can determine. We may need to iterate a few times to narrow in on what is happening, but this will be a good starting point. |
This comment has been minimized.
This comment has been minimized.
|
I got the file - reached about 26MB of data and I stopped it some seconds after the crash. Should I delete the irrelevant entries (redis/java etc)? How can I send to you? |
This comment has been minimized.
This comment has been minimized.
|
Hey again! I took a look at @vstath 's log file, and noticed these entries:
This corresponds to the following code:
Which is followed by the following check:
Strangely, this check returns false despite process 27015 exiting cleanly. |
This comment has been minimized.
This comment has been minimized.
|
@Thulinma Are you clearing
I'm very suspicious of this use of |
This comment has been minimized.
This comment has been minimized.
|
That's because it never is -1. The lines above it already guarantee that pid is set to the actual pid of the child (in this case, 27015). To clarify, this is the sequence of events here:
This is also part of why I was surprised by the sequence of wait4 calls displayed: we never call waitpid with a value of -1 for the first argument. Then again, we never call wait4 - just waitpid, so I automatically assumed that was some kind of weird rewriting logic happening.... But maybe that's not what is happening at all? That makes me think... Hm. It's possible some other code got loaded that shouldn't be active. I'll check in the morning when I am more awake. If my suspicions are correct, the other wait4 calls are from another thread in the same application. (A thread that should not be running at all in this situation, normally speaking.) Edit: Oh, and no, we don't clear errno - because that side of the statement shouldn't be executed unless waitpid fails, so it should always have been set by waitpid if we get to that check. |
This comment has been minimized.
This comment has been minimized.
|
Something is clearly calling wait with a pid of |
This comment has been minimized.
This comment has been minimized.
|
Okay - found it! The specific way @vstath has his configuration set up, causes a thread to spawn that at regular intervals calls waitpid(-1) to reap leftover child processes. This means there are potentially two waitpid's going at the same time: one with -1 as parameter, one with a specific pid as parameter. Under Linux, the waitpid that returns seems to always be the specific one. In this case, it seems to be the catch-all one. Might be useful for you guys to know that this behavior is subtly different from Linux. I doubt it breaks many applications, but the count is at least one. |
This comment has been minimized.
This comment has been minimized.
|
Jaron,
Thanks for digging in on this, and sorry that this caused problems for you
because things were a little different for lx vs Linux. I'll get a bug open
to track this difference.
Jerry
…On Mon, Jan 8, 2018 at 9:00 AM, Jaron Viëtor ***@***.***> wrote:
Okay - found it!
The specific way @vstath <https://github.com/vstath> has his
configuration set up, causes a thread to spawn that at regular intervals
calls waitpid(-1) to reap leftover child processes. This means there are
potentially two waitpid's going at the same time: one with -1 as parameter,
one with a specific pid as parameter.
Under Linux, the waitpid that returns seems to always be the specific one.
In this case, it seems to be the catch-all one.
I don't think POSIX defines what the correct behavior is in this case (I
can't find described anywhere, at least). So *we'll change our logic not
to rely on this Linux-specific behavior anymore.*
(In other words: I'm considering this a bug on our end, which oddly cannot
be reproduced under Linux.)
Might be useful for you guys to know that this behavior is subtly
different from Linux. I doubt it breaks many applications, but the count is
at least one.
|
I am running SmartOS "5.11 joyent_20161222T003450Z i86pc i386 i86pc" and I have an Ubuntu 16.04 LX zone (image id 23ee2dbc-c155-11e6-ab6d-bf5689f582fd from Joyent).
Inside the LX I'm testing MistServer (https://mistserver.org) which behaves differently than when run under a baremetal Ubuntu (it crashes when releasing memory during application shut down).
I discussed with the developer and he tried to debug within the LX zone unsuccessfully, primarily because GDB does not function inside LX.
I am told by @rmustacc to mention it here so that we can debug the peculiarity using native tools (dtrace or similar?).
The application developer will provide all the necessary information, some being too detailed for me to know.
I appreciate in advance all the support and pointers provided.