Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
lxd stop / lxd restart hangs and never returns (Command get_cgroup failed to receive response: Connection reset by peer) #3159
Comments
|
What's that kernel you're running? it doesn't match any of the kernels I'd usually expect on Ubuntu 16.04? Your report of containers having difficulties going offline and then not coming back up matches perfectly with the symptoms of an apparmor bug which is yet to be fixed in the Ubuntu kernel. Our own CI environment hits this every few days so typically after several thousand containers have been started and stopped. There will typically be kernel logging associated with that issue, attaching your /var/log/kernel.log shortly after the problem started happening again would let us confirm. |
|
https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/1645037 is the Launchpad bug for this issue. It's rather confusing as they attempted a fix which backfired and had to be reverted and are now attempting to fix this again, so it's really hard to tell whether you have a kernel with the fix or not (especially since you're not using one of the stock kernels). |
tchwpkgorg
commented
Apr 7, 2017
|
There doesn't seem to be any related kernel logging around the time it happens. For example:
And matching /var/log/kernel.log around that time:
|
|
This still smells like a kernel issue as there is no lock in LXC or LXD that can block all containers starting or stopping. I can't remember if the apparmor issue was causing a similar "Connection reset by peer" error to show up in the log though. "ps aux | grep apparmor" on a stuck system would be another way to check though as with that bug, all apparmor_parser processes get stuck indefinitely (causing the stop and start hang). If it's not the usual apparmor issue, then your log suggests that the container monitor process is still running but in a very bad state. So when getting another one of those hangs at "lxc stop" time, it'd be interesting to do:
I expect it'd also be useful to know:
|
stgraber
added
the
Incomplete
label
Apr 7, 2017
|
Ah, you say "lxd restart" doesn't fix it, I'm assuming you mean "systemctl restart lxd" by that? If so, then you're definitely looking at a kernel bug here, it's just not obvious what kernel bug if there's nothing suspicious in your kernel log... |
tchwpkgorg
commented
Apr 7, 2017
Correct. "systemctl restart lxd" doesn't fix this issue.
Yes, it will start. But "lxc stop" will still hang for new (and old) containers - though it does stop the container, just does not exit. |
|
Okay, so then it really can't be any of the LXD locks as none of those would survive a daemon restart. The fact that you can create and start new containers is interesting though, it really suggests that the monitor process of the container is somehow still running. When you say the container isn't running anymore, how are you checking that? "lxc info", "lxc list", ... ? |
tchwpkgorg
commented
Apr 7, 2017
•
I check it with "lxc list" - it shows the container in STOPPED state. "lxc stop container" would not exit though. I can then start it again (lxc start container) - but "lxc stop container" will hang, if used again. So if "lxc stop container" was checking for a state of container every few seconds, and could exit if it saw it was in STOPPED state - it would be a workaround for a potential kernel bug :) |
|
Hmm, ok, so the fact that "lxc list" shows it as STOPPED suggests that the monitor process exited as it would otherwise have taken a while and eventually shown the status as "BROKEN". So this suggests you're stuck in the Shutdown() function of the LXC C API and that the container actually going offline doesn't unstick... Also, that "connection reset by peer." entry looks safe to ignore, I'm getting it on normally shutdown containers here. Looks like we can take advantage of the fact that you can create new containers and start them after the problem starts showing up. Next time this happens, can you:
And provide us with both daemon.debug.log and container.debug.log. That will show us both the debug output of the daemon during this whole exercise as well as a complete liblxc debug log for the container. |
|
I'm also assuming that running other "lxc" command doesn't unstick your stuck "lxc stop" right? |
tchwpkgorg
commented
Apr 7, 2017
I think I haven't seen a case where "lxc stop" would unstuck at any point. I'll run your suggested debugging next time the issue shows up. |
|
@tchwpkgorg ping |
tchwpkgorg
commented
Apr 20, 2017
|
(Un)fortunately it didn't happen again so far. I'll update the ticket when it does. |
|
Closing for now, please comment if this re-occurs and we'll re-open. |
stgraber
closed this
May 23, 2017
tchwpkgorg
commented
May 27, 2017
•
|
Here it is - unfortunately it doesn't give any extra info (I think):
|
stgraber
reopened this
May 29, 2017
|
Can you post an updated "lxc info"? we've got updated LXC, LXD and LXCFS in flight right now, so wondering what was used there :) The debug log is pretty useful though. We clearly see the container being asked to shutdown, it being shutdown (calling the hook), LXD triggering the rebalance as a result of that, but the shutdown operation staying in running state... |
stgraber
added
Bug
and removed
Incomplete
labels
May 29, 2017
stgraber
added this to the lxd-2.15 milestone
May 29, 2017
tchwpkgorg
commented
May 30, 2017
|
Here it is:
|
|
oh, that's nice, so that's latest everything :) |
|
If I may chime in as well: @tchwpkgorg, can you please provide the logs of the failing container itself? Especially:
That would be really helpful. |
|
So since this was first reported, we've reproduced a number of similar hangs due to the lxc-monitord process. A fix for this has been merged in liblxc which effectively drops the use of lxc-monitord for normal operations. This fix will be in the next LXC stable release. As the liblxc issue matches this report perfectly, I'm going to close this issue. If you do see this happen again, can you please run "pkill -9 lxc-monitord" and see if that unstick things. |
stgraber
closed this
Jun 27, 2017
tchwpkgorg
commented
Jul 13, 2017
•
|
It hanged for me again with:
Running "pkill -9 lxc-monitord" made any "lxc stop" commands fail with code 1:
After that, "lxc stop" seem to be working correctly. |
|
Cool, so that does confirm that it's the monitord issue. This will be fixed with liblxc1 at version 2.0.9 once we release it. |
tchwpkgorg
commented
Jul 28, 2017
|
Is there any ETA for liblxc1 version 2.0.9 release? Or - any reliable workaround? We're seeing these hangs quite often, and our systems malfunction as a result. Would "pkill -9 lxc-monitord" executed via cron every minute work? |
|
@tchwpkgorg there won't be a 2.0.9. There'll be a 2.1 and we're aiming for August! |
tchwpkgorg commentedApr 7, 2017
Required information
I saw the issue with various 2.0.x versions; the last one where it happens is 2.0.9-0ubuntu1~16.04.2. Also tried different many kernels (stock Ubuntu, newer ppa).
This gets logged to container log:
When it starts to happen, it affects all containers. LXD server runs 100+ containers, which starts/stops/deletes dozens of containers daily and is used for automation, sometimes many at the same time. Approximately once every 1-2 months, "lxc stop" / "lxc restart" command will fail, which is a bit of stability concern for us.
The only thing which fixes that is server restart (lxd restart doesn't fix it).
There is also no clear way to reproduce it reliably (other than running the server for long, and starting/stopping a large number of containers over that time...).