Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug LX app crashing peculiarity #754

Open
vstath opened this issue Dec 19, 2017 · 13 comments
Open

Debug LX app crashing peculiarity #754

vstath opened this issue Dec 19, 2017 · 13 comments

Comments

@vstath
Copy link

@vstath vstath commented Dec 19, 2017

I am running SmartOS "5.11 joyent_20161222T003450Z i86pc i386 i86pc" and I have an Ubuntu 16.04 LX zone (image id 23ee2dbc-c155-11e6-ab6d-bf5689f582fd from Joyent).

Inside the LX I'm testing MistServer (https://mistserver.org) which behaves differently than when run under a baremetal Ubuntu (it crashes when releasing memory during application shut down).

I discussed with the developer and he tried to debug within the LX zone unsuccessfully, primarily because GDB does not function inside LX.

I am told by @rmustacc to mention it here so that we can debug the peculiarity using native tools (dtrace or similar?).

The application developer will provide all the necessary information, some being too detailed for me to know.

I appreciate in advance all the support and pointers provided.

@Thulinma

This comment has been minimized.

Copy link

@Thulinma Thulinma commented Dec 19, 2017

Hey there! I'm the developer mentioned by @vstath .

We're aware of there possibly being a bug during the unload of this particular binary, but on "normal" Linux it's proven incredibly hard to reproduce (triggers approx. once in every ~5000 runs or so, at most). It's most likely some kind of pointer problem, as it causes a segmentation fault occasionally. (The resulting core dump has so far always been heavily corrupted, to the point of not being usable to find the problem.)

It seems that it triggers reproducibly in SmartOS (every single run), and I would love to figure out what we're doing wrong (as it's likely some kind of mistake in our code). Unfortunately as mentioned above, I'm only familiar with GDB/Valgrind (both don't seem to catch the problem, whatever it is) and not with the debugging tools native in SmartOS. Assistance in this matter would be highly appreciated!

What I would normally do in GDB (if it had worked, that is), is hook it to the running process (it's started through a series of hard to navigate forks, so hooking after running is easiest), wait for the unload sequence, and then use the backtrace as well as the print command to figure out what is going wrong. What would be the equivalent?

@jjelinek

This comment has been minimized.

Copy link
Member

@jjelinek jjelinek commented Dec 19, 2017

Probably the first thing to do is see if there is a core dump which can tell us anything. Core dumps should be saved in the global zone under the zone's cores directory (/zones/{uuid}/cores. If you see one or more dumps there, you could run:
mdb /zones/{uuid}/cores/{coredump}
::status
$C
This will show us some information about the dump, but unfortunately we usually don't get a lot of info from a Linux core dump.

I have a DTrace script which I can provide and which you can run in the global zone. It will collect a lot of information about the behavior of all of the processes running inside the zone in a relatively lightweight manner. This would be by next suggested step to help narrow down where the code is executing when it dumps.

@jasonbking

This comment has been minimized.

Copy link
Contributor

@jasonbking jasonbking commented Dec 20, 2017

(Since the developer is following this) If the binary could be built with frame pointers (if it's not already -- i.e. without the -fomit-frame-pointer gcc option) at least for testing, that could also help provide more information in a subsequent core file.

@vstath

This comment has been minimized.

Copy link
Author

@vstath vstath commented Dec 20, 2017

@jjelinek thank you. I would appreciate receiving the dtrace script. Unfortunately, I couldn't find any core dumps of that particular application peculiarity.

@Thulinma

This comment has been minimized.

Copy link

@Thulinma Thulinma commented Dec 20, 2017

Just for reference: the binary is built with frame pointers, but without debug symbols. If absolutely necessary I can provide @vstath with a version that does have debug symbols, but I'd prefer not to. 🙂

@jjelinek

This comment has been minimized.

Copy link
Member

@jjelinek jjelinek commented Dec 20, 2017

I've placed the Dtrace script at:
https://us-east.manta.joyent.com/jjelinek/public/lxz.d

You must run DTrace as root in the global zone. Edit the script and in the "BEGIN" block, update the "watch_zoneid" value to be the zoneid of the zone which is running the application that fails. Then run the script like this:

dtrace -s lxz.d >/var/tmp/log.out

In a separate terminal, login to the zone and do whatever you need to do to cause the application to fail. After the application dies. quit running dtrace and let me know. I'll make arrangements for you to provide the log file to me and I'll see if there is anything I can determine. We may need to iterate a few times to narrow in on what is happening, but this will be a good starting point.

@vstath

This comment has been minimized.

Copy link
Author

@vstath vstath commented Jan 7, 2018

I got the file - reached about 26MB of data and I stopped it some seconds after the crash. Should I delete the irrelevant entries (redis/java etc)? How can I send to you?

@Thulinma

This comment has been minimized.

Copy link

@Thulinma Thulinma commented Jan 7, 2018

Hey again! I took a look at @vstath 's log file, and noticed these entries:

MistInBuffer 25922 1 -> wait4(-1, 0x7fffffefe2e8, 0x1, 0x0)
MistInBuffer 25922 1 <- wait4=27015 status 0x0
MistInBuffer 25922 1 -> wait4(-1, 0x7fffffefe2e8, 0x1, 0x0)
MistInBuffer 25922 1 <- wait4=-10
MistInBuffer 25922 1 -> wait4(27015, 0x7fffffefea1c, 0x0, 0x0)
MistInBuffer 25922 1 <- wait4=-10

This corresponds to the following code:

      while (waitpid(pid, &status, 0) != pid && errno == EINTR){}

Which is followed by the following check:

      if (WIFEXITED(status) && (WEXITSTATUS(status) == 0)){

Strangely, this check returns false despite process 27015 exiting cleanly.
Any ideas what may cause this?

@rzezeski

This comment has been minimized.

Copy link
Contributor

@rzezeski rzezeski commented Jan 8, 2018

@Thulinma Are you clearing errno before the while loop? If errno = EINTR before running this code (perhaps from a previous call OR even modified during the library call itself even though it was successful) then I think the following sequence of events is happening.

  • pid = -1
  • waitpit(-1, &status, 0) returns 27015
  • the first expression in the while condition is true and so is the second, repeat
  • waitpid(-1, &status, 0) returns -1 and sets errno = ECHILD, both conditions in the wile loop expression are false, break out
  • the WIFEXITED(... expression is false because the waitpid call failed and so status is bogus at this point
  • then, presumably, there's another call to wait after this logic (or maybe set pid and do a goto), this time with pid 27015 which was already waited on and thus the call fails again with ECHILD.

I'm very suspicious of this use of errno. You should only read it when waitpid() has failed. In fact, it feels like this code was written assuming pid is never -1.

@Thulinma

This comment has been minimized.

Copy link

@Thulinma Thulinma commented Jan 8, 2018

That's because it never is -1. The lines above it already guarantee that pid is set to the actual pid of the child (in this case, 27015).

To clarify, this is the sequence of events here:

  • Application calls fork(), stores result in pid (the value of pid is never modified after this).
  • If pid == -1, execute child behavior.
  • If pid != -1, wait for child to exit using the loop from my previous post.
  • After the loop, no further calls to waitpid() are made at all.

This is also part of why I was surprised by the sequence of wait4 calls displayed: we never call waitpid with a value of -1 for the first argument. Then again, we never call wait4 - just waitpid, so I automatically assumed that was some kind of weird rewriting logic happening.... But maybe that's not what is happening at all?

That makes me think... Hm. It's possible some other code got loaded that shouldn't be active. I'll check in the morning when I am more awake. If my suspicions are correct, the other wait4 calls are from another thread in the same application. (A thread that should not be running at all in this situation, normally speaking.)

Edit: Oh, and no, we don't clear errno - because that side of the statement shouldn't be executed unless waitpid fails, so it should always have been set by waitpid if we get to that check.

@rzezeski

This comment has been minimized.

Copy link
Contributor

@rzezeski rzezeski commented Jan 8, 2018

Something is clearly calling wait with a pid of -1 per the dtrace output. The reason it shows wait4() is most likely because libc is rewriting waidpid() to wait4() (musl uses wait4() and it looks like more than likely glibc does as well).

@Thulinma

This comment has been minimized.

Copy link

@Thulinma Thulinma commented Jan 8, 2018

Okay - found it!

The specific way @vstath has his configuration set up, causes a thread to spawn that at regular intervals calls waitpid(-1) to reap leftover child processes. This means there are potentially two waitpid's going at the same time: one with -1 as parameter, one with a specific pid as parameter.

Under Linux, the waitpid that returns seems to always be the specific one. In this case, it seems to be the catch-all one.
I don't think POSIX defines what the correct behavior is in this case (I can't find described anywhere, at least). So we'll change our logic not to rely on this Linux-specific behavior anymore.
(In other words: I'm considering this a bug on our end, which oddly cannot be reproduced under Linux.)

Might be useful for you guys to know that this behavior is subtly different from Linux. I doubt it breaks many applications, but the count is at least one. 😉

@jjelinek

This comment has been minimized.

Copy link
Member

@jjelinek jjelinek commented Jan 8, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.