lxd stop / lxd restart hangs and never returns (Command get_cgroup failed to receive response: Connection reset by peer) #3159

Closed
tchwpkgorg opened this Issue Apr 7, 2017 · 24 comments

Comments

Projects
None yet
3 participants

Required information

  • Distribution: Ubuntu
  • Distribution version: 16.04

I saw the issue with various 2.0.x versions; the last one where it happens is 2.0.9-0ubuntu1~16.04.2. Also tried different many kernels (stock Ubuntu, newer ppa).

  • "lxc stop container" will stop the container but will never exit
  • "lxc restart container" will stop the container, will not start it again and will never exit

This gets logged to container log:

        lxc 20170301115514.738 WARN     lxc_commands - commands.c:lxc_cmd_rsp_recv:172 - Command get_cgroup failed to receive response: Connection reset by peer.

When it starts to happen, it affects all containers. LXD server runs 100+ containers, which starts/stops/deletes dozens of containers daily and is used for automation, sometimes many at the same time. Approximately once every 1-2 months, "lxc stop" / "lxc restart" command will fail, which is a bit of stability concern for us.

The only thing which fixes that is server restart (lxd restart doesn't fix it).

There is also no clear way to reproduce it reliably (other than running the server for long, and starting/stopping a large number of containers over that time...).

# lxc info
config:
  core.https_address: '[::]'
  core.trust_password: true
apiextensions:
- id_map
apistatus: stable
apiversion: "1.0"
auth: trusted
public: false
environment:
  addresses:
(...)
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
(...)
  driver: lxc
  driverversion: 2.0.7
  kernel: Linux
  kernelarchitecture: x86_64
  kernelversion: 4.10.8-041008-generic
  server: lxd
  serverpid: 2269
  serverversion: 2.0.9
  storage: btrfs
  storageversion: "4.4"

Owner

stgraber commented Apr 7, 2017

What's that kernel you're running? it doesn't match any of the kernels I'd usually expect on Ubuntu 16.04?

Your report of containers having difficulties going offline and then not coming back up matches perfectly with the symptoms of an apparmor bug which is yet to be fixed in the Ubuntu kernel.

Our own CI environment hits this every few days so typically after several thousand containers have been started and stopped. There will typically be kernel logging associated with that issue, attaching your /var/log/kernel.log shortly after the problem started happening again would let us confirm.

Owner

stgraber commented Apr 7, 2017

https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/1645037 is the Launchpad bug for this issue.

It's rather confusing as they attempted a fix which backfired and had to be reverted and are now attempting to fix this again, so it's really hard to tell whether you have a kernel with the fix or not (especially since you're not using one of the stock kernels).

There doesn't seem to be any related kernel logging around the time it happens.

For example:


TD-Website-testing-a19ea62218-2017-04-06-03-45-34/lxc.log:            lxc 20170406035516.105 WARN     lxc_commands - commands.c:lxc_cmd_rsp_recv:172 - Command get_cgroup failed to receive response: Connection reset by peer.
TD-Website-testing-a19ea62218-2017-04-06-03-50-18/lxc.log.old:            lxc 20170406035433.918 WARN     lxc_commands - commands.c:lxc_cmd_rsp_recv:172 - Command get_cgroup failed to receive response: Connection reset by peer.

And matching /var/log/kernel.log around that time:

Apr  6 03:22:41 lxd01 kernel: [ 9566.675485] br-testing: port 7(vethC3CKJ2) entered forwarding state
Apr  6 03:45:37 lxd01 kernel: [10942.240358] br-testing: port 8(veth36BUID) entered blocking state
Apr  6 03:45:37 lxd01 kernel: [10942.240400] device veth36BUID entered promiscuous mode
Apr  6 03:45:37 lxd01 kernel: [10942.240445] IPv6: ADDRCONF(NETDEV_UP): veth36BUID: link is not ready
Apr  6 03:45:52 lxd01 kernel: [10957.502324] br-testing: port 8(veth36BUID) entered learning state
Apr  6 03:49:13 lxd01 kernel: [11158.779309] ffmpeg[8261]: segfault at 1 ip 00000000004bdd1a sp 00007ffe391e7900 error 4 in ffmpeg[400000+1235000]
Apr  6 03:50:21 lxd01 kernel: [11226.468047] br-testing: port 9(vethR1QWJX) entered blocking state
Apr  6 03:50:21 lxd01 kernel: [11226.468049] br-testing: port 9(vethR1QWJX) entered disabled state
Apr  6 03:50:21 lxd01 kernel: [11226.468087] device vethR1QWJX entered promiscuous mode
Apr  6 03:50:21 lxd01 kernel: [11226.468129] IPv6: ADDRCONF(NETDEV_UP): vethR1QWJX: link is not ready
Apr  6 03:50:21 lxd01 kernel: [11226.506997] eth0: renamed from veth1GSIRD
Apr  6 03:50:21 lxd01 kernel: [11226.543512] IPv6: ADDRCONF(NETDEV_CHANGE): vethR1QWJX: link becomes ready
Apr  6 03:50:21 lxd01 kernel: [11226.543559] br-testing: port 9(vethR1QWJX) entered blocking state
Apr  6 03:50:21 lxd01 kernel: [11226.543561] br-testing: port 9(vethR1QWJX) entered listening state
Apr  6 03:50:51 lxd01 kernel: [11256.765597] br-testing: port 9(vethR1QWJX) entered forwarding state
Apr  6 03:50:51 lxd01 kernel: [11256.765599] br-testing: topology change detected, propagating
Apr  6 03:54:33 lxd01 kernel: [11477.972883] br-testing: port 9(vethR1QWJX) entered disabled state

Owner

stgraber commented Apr 7, 2017

This still smells like a kernel issue as there is no lock in LXC or LXD that can block all containers starting or stopping.

I can't remember if the apparmor issue was causing a similar "Connection reset by peer" error to show up in the log though. "ps aux | grep apparmor" on a stuck system would be another way to check though as with that bug, all apparmor_parser processes get stuck indefinitely (causing the stop and start hang).

If it's not the usual apparmor issue, then your log suggests that the container monitor process is still running but in a very bad state. So when getting another one of those hangs at "lxc stop" time, it'd be interesting to do:

  • ps aux | grep CONTAINER-NAME
  • Look for a "[lxc monitor]" entry
  • See if it has any process left under it (ps fauxww)
  • If you have the old LXC tools installed (lxc-info), try: "lxc-info -P /var/lib/lxd/containers/ -n CONTAINER-NAME"
  • If the monitor process appears to be running, running "strace -fF -p" against it and attempting another "lxc stop" may show just how stuck it is

I expect it'd also be useful to know:

  • Can a completely new container start when the rest is stuck?
  • Does "systemctl restart lxd" fix things? (really doubt it but easy enough to try)

@stgraber stgraber added the Incomplete label Apr 7, 2017

Owner

stgraber commented Apr 7, 2017

Ah, you say "lxd restart" doesn't fix it, I'm assuming you mean "systemctl restart lxd" by that?

If so, then you're definitely looking at a kernel bug here, it's just not obvious what kernel bug if there's nothing suspicious in your kernel log...

Does "systemctl restart lxd" fix things? (really doubt it but easy enough to try)
Ah, you say "lxd restart" doesn't fix it, I'm assuming you mean "systemctl restart lxd" by that?

Correct. "systemctl restart lxd" doesn't fix this issue.

Can a completely new container start when the rest is stuck?

Yes, it will start. But "lxc stop" will still hang for new (and old) containers - though it does stop the container, just does not exit.

Owner

stgraber commented Apr 7, 2017

Okay, so then it really can't be any of the LXD locks as none of those would survive a daemon restart.

The fact that you can create and start new containers is interesting though, it really suggests that the monitor process of the container is somehow still running.

When you say the container isn't running anymore, how are you checking that? "lxc info", "lxc list", ... ?

tchwpkgorg commented Apr 7, 2017

When you say the container isn't running anymore, how are you checking that? "lxc info", "lxc list", ... ?

I check it with "lxc list" - it shows the container in STOPPED state. "lxc stop container" would not exit though.

I can then start it again (lxc start container) - but "lxc stop container" will hang, if used again.
The container stops fine when we use "poweroff" inside the container.

So if "lxc stop container" was checking for a state of container every few seconds, and could exit if it saw it was in STOPPED state - it would be a workaround for a potential kernel bug :)

Owner

stgraber commented Apr 7, 2017

Hmm, ok, so the fact that "lxc list" shows it as STOPPED suggests that the monitor process exited as it would otherwise have taken a while and eventually shown the status as "BROKEN".

So this suggests you're stuck in the Shutdown() function of the LXC C API and that the container actually going offline doesn't unstick... Also, that "connection reset by peer." entry looks safe to ignore, I'm getting it on normally shutdown containers here.

Looks like we can take advantage of the fact that you can create new containers and start them after the problem starts showing up. Next time this happens, can you:

  • lxc monitor > daemon.debug.log &
  • lxc launch SOME-IMAGE SOME-NAME -c raw.lxc="lxc.loglevel=debug"
  • lxc info SOME-NAME --show-log
  • lxc stop SOME-NAME
  • [wait 30s or so]
  • [ctrl+c]
  • fg
  • [ctrl+c]
  • lxc info SOME-NAME --show-log > container.debug.log

And provide us with both daemon.debug.log and container.debug.log.

That will show us both the debug output of the daemon during this whole exercise as well as a complete liblxc debug log for the container.

Owner

stgraber commented Apr 7, 2017

I'm also assuming that running other "lxc" command doesn't unstick your stuck "lxc stop" right?
Just checking that your issue and another similarly weird issue reported at #3050 aren't the same thing somehow.

I'm also assuming that running other "lxc" command doesn't unstick your stuck "lxc stop" right?

I think I haven't seen a case where "lxc stop" would unstuck at any point.

I'll run your suggested debugging next time the issue shows up.

Owner

stgraber commented Apr 20, 2017

(Un)fortunately it didn't happen again so far.

I'll update the ticket when it does.

Owner

stgraber commented May 23, 2017

Closing for now, please comment if this re-occurs and we'll re-open.

@stgraber stgraber closed this May 23, 2017

tchwpkgorg commented May 27, 2017

Here it is - unfortunately it doesn't give any extra info (I think):

# lxc launch images:ubuntu/xenial/amd64 lxd-hang-test -c raw.lxc="lxc.loglevel=debug" -c raw.lxc="lxc.aa_allow_incomplete=1"
Creating lxd-hang-test
Starting lxd-hang-test

# lxc info lxd-hang-test --show-log
Name: lxd-hang-test
Remote: unix:/var/lib/lxd/unix.socket
Architecture: x86_64
Created: 2017/05/27 10:02 UTC
Status: Running
Type: persistent
Profiles: default
Pid: 10554
Ips:
  eth0: inet    10.205.161.11   vethSJIYQ7
  eth0: inet6   fd9d:b02f:336e:d0e1:216:3eff:feeb:a695  vethSJIYQ7
  eth0: inet6   fe80::216:3eff:feeb:a695        vethSJIYQ7
  lo:   inet    127.0.0.1
  lo:   inet6   ::1
Resources:
  Processes: 11
  CPU usage:
    CPU usage (in seconds): 0
  Memory usage:
    Memory (current): 47.84MB
    Memory (peak): 52.35MB
  Network usage:
    eth0:
      Bytes received: 2.78kB
      Bytes sent: 1.75kB
      Packets received: 26
      Packets sent: 16
    lo:
      Bytes received: 0B
      Bytes sent: 0B
      Packets received: 0
      Packets sent: 0

Log:

            lxc 20170527100202.503 WARN     lxc_apparmor - lsm/apparmor.c:apparmor_process_label_set:218 - Incomplete AppArmor support in your kernel



# lxc stop lxd-hang-test

(...never returns...)

container.debug.log.txt
daemon.debug.log.txt

@stgraber stgraber reopened this May 29, 2017

Owner

stgraber commented May 29, 2017

Can you post an updated "lxc info"? we've got updated LXC, LXD and LXCFS in flight right now, so wondering what was used there :)

The debug log is pretty useful though. We clearly see the container being asked to shutdown, it being shutdown (calling the hook), LXD triggering the rebalance as a result of that, but the shutdown operation staying in running state...

@stgraber stgraber added Bug and removed Incomplete labels May 29, 2017

@stgraber stgraber added this to the lxd-2.15 milestone May 29, 2017

Here it is:

# lxc info
config:
  core.https_address: '[::]'
  core.trust_password: true
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
api_status: stable
api_version: "1.0"
auth: trusted
public: false
environment:
  addresses:
(...)
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
(...)
    -----END CERTIFICATE-----
  certificate_fingerprint: 7381b19ae7563676857393701b4ec45313a3d84981bd665d12daf2de9a0a0dfa
  driver: lxc
  driver_version: 2.0.8
  kernel: Linux
  kernel_architecture: x86_64
  kernel_version: 4.11.3-041103-generic
  server: lxd
  server_pid: 2319
  server_version: "2.13"
  storage: btrfs
  storage_version: 4.10.2
Owner

stgraber commented May 30, 2017

oh, that's nice, so that's latest everything :)

Member

brauner commented May 30, 2017

If I may chime in as well: @tchwpkgorg, can you please provide the logs of the failing container itself? Especially:

  • /var/log/lxd/<container-name>/lxc.log
  • /var/log/lxd/<container-name>/forkstart.log
  • /var/log/lxd/<container-name>/lxc.conf

That would be really helpful.

Owner

stgraber commented Jun 27, 2017

So since this was first reported, we've reproduced a number of similar hangs due to the lxc-monitord process. A fix for this has been merged in liblxc which effectively drops the use of lxc-monitord for normal operations. This fix will be in the next LXC stable release.

As the liblxc issue matches this report perfectly, I'm going to close this issue.

If you do see this happen again, can you please run "pkill -9 lxc-monitord" and see if that unstick things.
If it doesn't, then comment here and we'll re-open the issue as that'd point to something other than the monitord problem.

@stgraber stgraber closed this Jun 27, 2017

tchwpkgorg commented Jul 13, 2017

It hanged for me again with:

ii  lxd                               2.15-0ubuntu6~ubuntu16.04.1~ppa1    amd64        Container hypervisor based on LXC - daemon
ii  lxd-client                        2.15-0ubuntu6~ubuntu16.04.1~ppa1    amd64        Container hypervisor based on LXC - client

ii  liblxc1                           2.0.8-0ubuntu3~ubuntu16.04.1~ppa1   amd64        Linux Containers userspace tools (library)
ii  lxc-common                        2.0.8-0ubuntu3~ubuntu16.04.1~ppa1   amd64        Linux Containers userspace tools (common tools)
ii  lxcfs                             2.0.7-0ubuntu4~ubuntu16.04.1~ppa1   amd64        FUSE based filesystem for LXC

Running "pkill -9 lxc-monitord" made any "lxc stop" commands fail with code 1:

# lxc stop website-ff25905474-2017-07-13-08-40-24 
error: shutting down the container failed
Try `lxc info --show-log website-ff25905474-2017-07-13-08-40-24` for more info

# echo $?
1

After that, "lxc stop" seem to be working correctly.

Owner

stgraber commented Jul 13, 2017

Cool, so that does confirm that it's the monitord issue. This will be fixed with liblxc1 at version 2.0.9 once we release it.

Is there any ETA for liblxc1 version 2.0.9 release? Or - any reliable workaround? We're seeing these hangs quite often, and our systems malfunction as a result.

Would "pkill -9 lxc-monitord" executed via cron every minute work?

Member

brauner commented Jul 28, 2017

@tchwpkgorg there won't be a 2.0.9. There'll be a 2.1 and we're aiming for August!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment