New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container won't start, monitor is hung. #4468

Closed
19wolf opened this Issue Apr 17, 2018 · 45 comments

Comments

6 participants
@19wolf
Contributor

19wolf commented Apr 17, 2018

Required information

  • Distribution: Ubuntu
  • Distribution version: 16.04.04
  • The output of "lxc info" or if that fails:
$ lxc info
config:
 core.https_address: '[::]:8443'
 core.trust_password: true
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
 addresses:
 - 192.168.1.1:8443
 - '[2604:6000:6342:::::]:8443'
 - '[2604:6000:6342:::::]:8443'
 - '[2604:6000:6342:::::]:8443'
 - 192.168.122.1:8443
 architectures:
 - x86_64
 - i686
 certificate: |
   -----BEGIN CERTIFICATE-----
   MIIFRDCCAyygAwIBAgIRAJOZkpRrAKHaoGlOrtgqFL4wDQYJKoZIhvcNAQELBQAw
   NTEcMBoGA1UEChMTbGludXhjb250YWluZXJzLm9yZzEVMBMGA1UEAwwMcm9vdEBO
   ZXBoZWxlMB4XDTE3MTAwNDIwMzU0OVoXDTI3MTAwMjIwMzU0OVowNTEcMBoGA1UE
   ChMTbGludXhjb250YWluZXJzLm9yZzEVMBMGA1UEAwwMcm9vdEBOZXBoZWxlMIIC
   IjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEAt75i6L7oq8xyuD3sdhc/NXAN
   d+aeRbbxThGWURJUWxRY9ZDz/gPZr5hCfSPUlnvdL6ieWIqs2otdN3di84yKik3o
   EGS5QUJAa3yofa6HkBCUpyUXEFNGG4k8aiuLBv1R46Xgj/Zv8sqxPuKcrU5Y/Ckc
   KhHTq0o7xosU/XStq4SpL0Wh0prxEzj6P9luqCLm2+80L5Ri/uRVvOBtyjJqXbPz
   N3qnYHBxxHtZo4ipH/qWesajglAKdyZnFRnbdhJbRmmAkHJ3W7LG5L80WcGPKY1i
   zq3BfqRbKpwZ8wB2VFs6QnSu5wQKiIrUHOTutvHORFnGvymWuNL9eg+XkEc59RBK
   QCu6X+h2CT1BDxzGbUSdaUwQ2NFg/wllNbJDomt6mVxCAoEmZU+vnINqDzCXzmpR
   HeqLnu1Nq4IscXn9cJ+Vlvm5Kqd69XTsYIkt0R1k83CtMN6FxCGCZg1Jm9hKWFdr
   2ib/g6xmUleT/uI4ElQzYpuss6s959/eamA1REFoj26v8MWIQp2xnWQ34imk/QNV
   /HdHvGXpTIS06AOfuT9Jc0k3zjtbzJ/psjeCAiETKozm5vxJ4VB396OJb2eOJhsw
   4ULKIunb10M2AJ2UOo7JVv3YZrRyLJPKRUUew0CY1uTHEjZhyw4eXtxH2K0+OEY0
   FgaX/wEs23MIGex5VMMCAwEAAaNPME0wDgYDVR0PAQH/BAQDAgWgMBMGA1UdJQQM
   MAoGCCsGAQUFBwMBMAwGA1UdEwEB/wQCMAAwGAYDVR0RBBEwD4IHTmVwaGVsZYcE
   wKgBAjANBgkqhkiG9w0BAQsFAAOCAgEAhQpGVhgiBVbs3bS4lpyThPrjMBoLvCh2
   6lCXO4YkHO9bcJ3G358LTgLALjs3j1bzk5qHk+Z63Tdsgr0oDCcDLX9HPHkkYQUf
   mihO6o1mINOcQ6jjz0y7KGfxD/GexcN7/WH2htbO1VF2p6oBdNsxi0G/lur5ASwt
   dJrDRQi7ifP00/DUWKNgZoLQOG5z/hfsjvocLZwmwGMqgY+xW5Gh7vCfljiCPL+x
   tKZOpKaZ6FxwY7OlsW802P27sCftRrwn0rvw6jsyY2KHo+LwQxlVHmJtxnT2diZn
   NQXFhBT921gzX1kkJdP26JHDlJhv0iL7A3JHt8QzAdxXNfBIQ/Dwu5fILjFbncyS
   8y1ulaLQd/fybi1dSaJgdHMnojMQV3c7mydUjB1j21+VlaW14rRmSwCnr3cZtg3m
   OndA7luHKvvthekvTyiI+ioDsDsO1DFc6YcljI5nVZkjeBTBJF1ornRnb+TP+XQl
   PlG3uSNG0UGoDm7y/wv8WZUwGU+BFiTIsSPyEeYWY92ePYgoLHy1TkaUp17GwBJI
   XdARg7L/kwy0pwgLPSiSkSoH1lXDQmSf7RFzrpaUkEcKG7n/tXPR0yCYRy5L1WPM
   ESsFJryCN1MW1ftb2Xu0+K15AHvYaAlQbXhTPKR4vPwiWFN9fZiIFXOhhhQoVMOc
   wUPmTgDyfNA=
   -----END CERTIFICATE-----
 certificate_fingerprint: 203defdb53eb4eb5d574fb581e619836f89525e6fed7b29689106fb0cac44010
 driver: lxc
 driver_version: 3.0.0
 kernel: Linux
 kernel_architecture: x86_64
 kernel_version: 4.16.0-041600-generic
 server: lxd
 server_pid: 2635
 server_version: 3.0.0
 storage: btrfs
 storage_version: "4.4"
 server_clustered: false
 server_name: Nephele

Issue description

Certain container fails to start
Full system reboot fixes it, but restarting lxd alone does not seem to

Steps to reproduce

$lxc start container
--hangs, ctrl-c
$lxc list container
--container|error|etc
$lxc info container --show-log
--error: monitor is hung
$sudo kill -9 [[[relevant lxc monitor]]]
$lxc start container
--hangs, ctrl-c
$lxc info container --show-log
--error:monitor is hung
$sudo kill -9 [[[relevant lxc monitor]]]
$snap restart lxd

---rinse and repeat, even after $snap restart lxd

I don't need this container, so I won't reset my system until we can figure this out
Let me know if you need any more logs etc

@brauner

This comment has been minimized.

Member

brauner commented Apr 17, 2018

Can you strace the monitor process, please?

@stgraber

This comment has been minimized.

Member

stgraber commented Apr 17, 2018

Can you run lxc monitor --pretty --type logging in a separate terminal as you start the container? that may give us some pointer as to what LXD's doing.

@19wolf

This comment has been minimized.

Contributor

19wolf commented Apr 17, 2018

$ lxc monitor --pretty --type logging
DBUG[04-17|21:22:54] New event listener: ae5db3fc-cf15-4fd6-a326-bcd5c5a36138
DBUG[04-17|21:22:59] handling                                 ip=@ method=GET url=/1.0
DBUG[04-17|21:23:00] handling                                 ip=@ method=GET url=/1.0/containers/Downloader
DBUG[04-17|21:23:00] handling                                 ip=@ method=GET url=/1.0/events
DBUG[04-17|21:23:00] New event listener: e2cbadaa-d449-4e31-b018-0e6f252fdca5
DBUG[04-17|21:23:00] handling                                 ip=@ method=PUT url=/1.0/containers/Downloader/state
DBUG[04-17|21:23:00] New task operation: 83f17b29-da58-4bfc-a562-a3dbe8c19d6a
DBUG[04-17|21:23:00] Started task operation: 83f17b29-da58-4bfc-a562-a3dbe8c19d6a
DBUG[04-17|21:23:00] handling                                 ip=@ method=GET url=/1.0/operations/83f17b29-da58-4bfc-a562-a3dbe8c19d6a
DBUG[04-17|21:23:00] Initializing a BTRFS driver.
DBUG[04-17|21:23:00] Mounting BTRFS storage volume for container "Downloader" on storage pool "default".
DBUG[04-17|21:23:00] Mounting BTRFS storage pool "default".
DBUG[04-17|21:23:00] Mounted BTRFS storage volume for container "Downloader" on storage pool "default".
DBUG[04-17|21:23:00] Mounting BTRFS storage pool "default".
DBUG[04-17|21:23:00] Mounting BTRFS storage volume for container "Downloader" on storage pool "default".
DBUG[04-17|21:23:00] Mounted BTRFS storage volume for container "Downloader" on storage pool "default".
INFO[04-17|21:23:00] Starting container                       action=start created="2017-10-05 02:21:09 +0000 UTC" ephemeral=false name=Downloader stateful=false used="2018-04-17 20:04:06 +0000 UTC"
DBUG[04-17|21:23:01] handling                                 ip=@ method=GET url=/1.0
DBUG[04-17|21:23:01] handling                                 ip=@ method=GET url=/internal/containers/9/onstart
DBUG[04-17|21:23:01] Mounting BTRFS storage volume for container "Downloader" on storage pool "default".
DBUG[04-17|21:23:01] Initializing a BTRFS driver.
DBUG[04-17|21:23:01] Mounting BTRFS storage pool "default".
DBUG[04-17|21:23:01] Mounted BTRFS storage volume for container "Downloader" on storage pool "default".
DBUG[04-17|21:23:01] Scheduler: container Downloader started: re-balancing

after lxc info/monitor is hung the Dbug gives

DBUG[04-17|21:25:24] handling                                 ip=@ method=GET url=/1.0
DBUG[04-17|21:25:24] Disconnected event listener: e2cbadaa-d449-4e31-b018-0e6f252fdca5
DBUG[04-17|21:25:24] handling                                 ip=@ method=GET url=/1.0/containers/Downloader

Can you strace the monitor process, please?

Not really, the specific monitor process doesn't exist until I attempt to start the container and then it gives no output.

@19wolf

This comment has been minimized.

Contributor

19wolf commented Apr 17, 2018

After I kill the lxc-monitor the dbug shows

EROR[04-17|21:32:28] balance: Unable to set cpuset            err="Failed to set cgroup cpuset.cpus=\"0,1\": setting cgroup item for the container failed" name=Downloader value=0,1
EROR[04-17|21:32:28] Failed starting container                action=start created="2017-10-05 02:21:09 +0000 UTC" ephemeral=false name=Downloader stateful=false used="2018-04-17 20:04:06 +0000 UTC"
DBUG[04-17|21:32:28] Failure for task operation: 83f17b29-da58-4bfc-a562-a3dbe8c19d6a: Failed to run: /snap/lxd/current/bin/lxd forkstart Downloader /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/Downloader/lxc.conf:
EROR[04-17|21:32:28] Failed to apply network priority         container=Downloader err="Failed to set network device priority: Can't set cgroups on a stopped container"
@brauner

This comment has been minimized.

Member

brauner commented Apr 17, 2018

Oh, I thought we were talking about [lxc monitor]. Nm, can you show the container's config located under:

/var/log/lxd/Downloader/lxc.conf

please?

@19wolf

This comment has been minimized.

Contributor

19wolf commented Apr 17, 2018

$ cat /var/log/lxd/Downloader/lxc.conf
cat: /var/log/lxd/Downloader/lxc.conf: No such file or directory

$ lxc config show Downloader
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Ubuntu xenial amd64 (20171004_03:49)
  image.os: Ubuntu
  image.release: xenial
  image.serial: "20171004_03:49"
  limits.cpu.allowance: 10%
  limits.cpu.priority: "1"
  limits.network.priority: "1"
  linux.kernel_modules: ip_tables, ip6_tables
  security.privileged: "true"
  volatile.base_image: 369f84766c53e842cb0239d51c4b78743bd9d4b48dbbfde2020c1e6d9604760f
  volatile.eth0.hwaddr: 00:16:3e:8c:92:4c
  volatile.idmap.base: "0"
  volatile.idmap.next: '[]'
  volatile.last_state.idmap: '[]'
  volatile.last_state.power: RUNNING
devices:
  hello:
    path: /data/hello
    source: /data/hello
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""

@brauner

This comment has been minimized.

Member

brauner commented Apr 17, 2018

Hm, you're using the snap. What's the correct path to show the container's config for the snap, @stgraber?

Also, please show:

  • cat /proc/1/cgroup
  • findmnt | grep cgroup
@19wolf

This comment has been minimized.

Contributor

19wolf commented Apr 17, 2018

Not sure if timing matters for this- while the container is trying to start or whenever -- the container is not trying to start during this output

$ cat /proc/1/cgroup
12:memory:/init.scope
11:freezer:/
10:blkio:/init.scope
9:hugetlb:/
8:cpuset:/
7:devices:/init.scope
6:rdma:/
5:cpu,cpuacct:/init.scope
4:pids:/init.scope
3:perf_event:/
2:net_cls,net_prio:/
1:name=systemd:/init.scope

$ findmnt | grep cgroup
│ ├─/sys/fs/cgroup                    tmpfs                  tmpfs       rw,mode=755
│ │ ├─/sys/fs/cgroup/systemd          cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
│ │ ├─/sys/fs/cgroup/net_cls,net_prio cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,net_cls,net_prio
│ │ ├─/sys/fs/cgroup/perf_event       cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,perf_event,release_agent=/run/cgmanager/agents/cgm-release-agent.perf_event
│ │ ├─/sys/fs/cgroup/pids             cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,pids,release_agent=/run/cgmanager/agents/cgm-release-agent.pids
│ │ ├─/sys/fs/cgroup/cpu,cpuacct      cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,cpu,cpuacct
│ │ ├─/sys/fs/cgroup/rdma             cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,rdma,release_agent=/run/cgmanager/agents/cgm-release-agent.rdma
│ │ ├─/sys/fs/cgroup/devices          cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,devices
│ │ ├─/sys/fs/cgroup/cpuset           cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,cpuset,clone_children
│ │ ├─/sys/fs/cgroup/hugetlb          cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,hugetlb,release_agent=/run/cgmanager/agents/cgm-release-agent.hugetlb
│ │ ├─/sys/fs/cgroup/blkio            cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,blkio
│ │ ├─/sys/fs/cgroup/freezer          cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,freezer
│ │ └─/sys/fs/cgroup/memory           cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,memory
@stgraber

This comment has been minimized.

Member

stgraber commented Apr 17, 2018

/var/snap/lxd/common/lxd/logs/Downloader/lxc.conf and /var/snap/lxd/common/lxd/logs/Downloader/lxc.log may be useful

@19wolf

This comment has been minimized.

Contributor

19wolf commented Apr 17, 2018

$ cat /var/snap/lxd/common/lxd/logs/Downloader/lxc.conf
lxc.log.file = /var/snap/lxd/common/lxd/logs/Downloader/lxc.log
lxc.log.level = warn
lxc.console.buffer.size = auto
lxc.console.size = auto
lxc.console.logfile = /var/snap/lxd/common/lxd/logs/Downloader/console.log
lxc.cap.drop = sys_time sys_module sys_rawio
lxc.mount.auto = proc:mixed sys:mixed
lxc.autodev = 1
lxc.pty.max = 1024
lxc.mount.entry = mqueue dev/mqueue mqueue rw,relatime,create=dir,optional
lxc.mount.entry = /dev/fuse dev/fuse none bind,create=file,optional
lxc.mount.entry = /dev/net/tun dev/net/tun none bind,create=file,optional
lxc.mount.entry = /proc/sys/fs/binfmt_misc proc/sys/fs/binfmt_misc none rbind,create=dir,optional
lxc.mount.entry = /sys/fs/fuse/connections sys/fs/fuse/connections none rbind,create=dir,optional
lxc.mount.entry = /sys/fs/pstore sys/fs/pstore none rbind,create=dir,optional
lxc.mount.entry = /sys/kernel/debug sys/kernel/debug none rbind,create=dir,optional
lxc.mount.entry = /sys/kernel/security sys/kernel/security none rbind,create=dir,optional
lxc.include = /snap/lxd/current/lxc/config//common.conf.d/
lxc.cgroup.devices.deny = a
lxc.cgroup.devices.allow = b *:* m
lxc.cgroup.devices.allow = c *:* m
lxc.cgroup.devices.allow = c 136:* rwm
lxc.cgroup.devices.allow = c 1:3 rwm
lxc.cgroup.devices.allow = c 1:5 rwm
lxc.cgroup.devices.allow = c 1:7 rwm
lxc.cgroup.devices.allow = c 1:8 rwm
lxc.cgroup.devices.allow = c 1:9 rwm
lxc.cgroup.devices.allow = c 5:0 rwm
lxc.cgroup.devices.allow = c 5:1 rwm
lxc.cgroup.devices.allow = c 5:2 rwm
lxc.cgroup.devices.allow = c 10:229 rwm
lxc.cgroup.devices.allow = c 10:200 rwm
lxc.arch = linux64
lxc.hook.pre-start = /snap/lxd/current/bin/lxd callhook /var/snap/lxd/common/lxd 9 start
lxc.hook.post-stop = /snap/lxd/current/bin/lxd callhook /var/snap/lxd/common/lxd 9 stop
lxc.tty.max = 0
lxc.uts.name = Downloader
lxc.mount.entry = /var/snap/lxd/common/lxd/devlxd dev/lxd none bind,create=dir 0 0
lxc.apparmor.profile = lxd-Downloader_</var/snap/lxd/common/lxd>//&:lxd-Downloader_<var-snap-lxd-common-lxd>:
lxc.seccomp.profile = /var/snap/lxd/common/lxd/security/seccomp/Downloader
lxc.environment = http_proxy=
lxc.cgroup.cpu.shares = 115
lxc.cgroup.cpu.cfs_period_us = 100000
lxc.rootfs.path = dir:/var/snap/lxd/common/lxd/containers/Downloader/rootfs
lxc.mount.entry = /var/snap/lxd/common/lxd/devices/Downloader/disk.hello.data-hello data/Downloads none bind,create=dir
lxc.net.0.type = veth
lxc.net.0.flags = up
lxc.net.0.link = br0
lxc.net.0.hwaddr = 00:16:3e:8c:92:4c
lxc.net.0.name = eth0
lxc.mount.entry = /var/snap/lxd/common/lxd/shmounts/Downloader dev/.lxd-mounts none bind,create=dir 0 0

$ cat /var/snap/lxd/common/lxd/logs/Downloader/lxc.log
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:create_path_for_hierarchy:1739 - The cgroup "/sys/fs/cgroup/systemd//lxc/Downloader" already existed
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:cgfsng_create:1826 - Failed to create cgroup "/sys/fs/cgroup/systemd//lxc/Downloader"
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:create_path_for_hierarchy:1739 - The cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-1" already existed
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:cgfsng_create:1826 - Failed to create cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-1"
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:create_path_for_hierarchy:1739 - The cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-2" already existed
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:cgfsng_create:1826 - Failed to create cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-2"
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:create_path_for_hierarchy:1739 - The cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-3" already existed
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:cgfsng_create:1826 - Failed to create cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-3"
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:create_path_for_hierarchy:1739 - The cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-4" already existed
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:cgfsng_create:1826 - Failed to create cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-4"
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:create_path_for_hierarchy:1739 - The cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-5" already existed
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:cgfsng_create:1826 - Failed to create cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-5"
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:create_path_for_hierarchy:1739 - The cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-6" already existed
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:cgfsng_create:1826 - Failed to create cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-6"
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:create_path_for_hierarchy:1739 - The cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-7" already existed
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:cgfsng_create:1826 - Failed to create cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-7"
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:create_path_for_hierarchy:1739 - The cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-8" already existed
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:cgfsng_create:1826 - Failed to create cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-8"
lxc 20180417213415.851 ERROR    lxc_container - lxccontainer.c:wait_on_daemonized_start:815 - No such file or directory - Failed to receive the container state
lxc 20180417213415.851 WARN     lxc_commands - commands.c:lxc_cmd_rsp_recv:130 - Connection reset by peer - Failed to receive response for command "get_cgroup"
lxc 20180417213415.851 WARN     lxc_commands - commands.c:lxc_cmd_rsp_recv:130 - Connection reset by peer - Failed to receive response for command "get_cgroup"
lxc 20180417213415.851 WARN     lxc_commands - commands.c:lxc_cmd_rsp_recv:130 - Connection reset by peer - Failed to receive response for command "get_cgroup"
lxc 20180417213415.851 WARN     lxc_commands - commands.c:lxc_cmd_rsp_recv:130 - Connection reset by peer - Failed to receive response for command "get_cgroup"
lxc 20180417213415.852 WARN     lxc_commands - commands.c:lxc_cmd_rsp_recv:130 - Connection reset by peer - Failed to receive response for command "get_cgroup"
@stgraber

This comment has been minimized.

Member

stgraber commented Apr 17, 2018

snap info lxd?

@19wolf

This comment has been minimized.

Contributor

19wolf commented Apr 17, 2018

$ snap info lxd
name:      lxd
summary:   System container manager and API
publisher: canonical
contact:   https://github.com/lxc/lxd/issues
license:   unknown
description: |
  LXD is a container manager for system containers.
  
  It offers a REST API to remotely manage containers over the network, using an image based workflow
  and with support for live migration.
  
  Images are available for all Ubuntu releases and architectures as well as for a wide number of other
  Linux distributions.
  
  LXD containers are lightweight, secure by default and a great alternative to virtual machines.
commands:
  - lxd.benchmark
  - lxd.check-kernel
  - lxd.database
  - lxd.lxc
  - lxd
  - lxd.migrate
services:
  lxd.daemon: simple, enabled, inactive
snap-id:   J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking:  stable
refreshed: 2018-04-09T19:13:08-04:00
installed:       3.0.0       (6643) 56MB -
channels:                           
  stable:        3.0.0       (6643) 56MB -
  candidate:     3.0.0       (6729) 56MB -
  beta:          ↑                       
  edge:          git-7c74528 (6747) 56MB -
  2.0/stable:    2.0.11      (6627) 27MB -
  2.0/candidate: 2.0.11      (6627) 27MB -
  2.0/beta:      ↑                       
  2.0/edge:      git-d71807e (6630) 25MB -
  3.0/stable:    3.0.0       (6649) 56MB -
  3.0/candidate: 3.0.0       (6730) 56MB -
  3.0/beta:      ↑                       
  3.0/edge:      git-bd47a49 (6716) 56MB -
@stgraber

This comment has been minimized.

Member

stgraber commented Apr 17, 2018

Anything suspicious looking in dmesg?

@brauner

This comment has been minimized.

Member

brauner commented Apr 17, 2018

So one unlikely thing could be that something inside the container might create subcgroups and sets cpuset values. Then LXD won't be able to change them anymore because in cgroup v1 they don't apply recursively unless a specific flag is set that has been added to the kernel just recently.

@stgraber

This comment has been minimized.

Member

stgraber commented Apr 17, 2018

@brauner we're not seeing any cgroup related failure though. The failure is:

lxc 20180417213415.851 ERROR    lxc_container - lxccontainer.c:wait_on_daemonized_start:815 - No such file or directory - Failed to receive the container state
@stgraber

This comment has been minimized.

Member

stgraber commented Apr 17, 2018

@19wolf can you post ps fauxww when lxc start is hanging too?

@brauner

This comment has been minimized.

Member

brauner commented Apr 17, 2018

@stgraber that's what I was going by

EROR[04-17|21:32:28] balance: Unable to set cpuset            err="Failed to set cgroup cpuset.cpus=\"0,1\": setting cgroup item for the container failed" name=Downloader value=0,1
@brauner

This comment has been minimized.

Member

brauner commented Apr 17, 2018

lxc 20180417213415.851 ERROR lxc_container - lxccontainer.c:wait_on_daemonized_start:815 - No such file or directory - Failed to receive the container state

Doesn't really tell you anything. It just means that the socket was closed because the container failed to start.

@19wolf

This comment has been minimized.

Contributor

19wolf commented Apr 17, 2018

dmesg output from only after attempting to launch

245357.391787] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20180105/exfield-427)
[245357.394708] No Local Variables are initialized for Method [_PMM]
[245357.394709] No Arguments are initialized for method [_PMM]
[245357.394716] ACPI Error: Method parse/execution failed \_SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20180105/psparse-550)
[245357.396685] ACPI Error: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20180105/power_meter-338)
[245415.545209] audit: type=1400 audit(1524004327.881:284): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="lxd-Downloader_</var/snap/lxd/common/lxd>" pid=49828 comm="apparmor_parser"
[245415.590049] br0: port 20(vethX87JPV) entered blocking state
[245415.590052] br0: port 20(vethX87JPV) entered disabled state
[245415.590222] device vethX87JPV entered promiscuous mode
[245415.590411] IPv6: ADDRCONF(NETDEV_UP): vethX87JPV: link is not ready
@19wolf

This comment has been minimized.

Contributor

19wolf commented Apr 17, 2018

I think this is the relevant output? The only things since 12pm today

$ps fauxww
nephele  50472  0.0  0.0  27332  2932 ?        Ss   18:35   0:00 SCREEN lxc start Downloader
nephele  50473  1.0  0.0  64620 11252 pts/1    Ssl+ 18:35   0:00  \_ lxc start Downloader
root     50502  0.0  0.0 122500  6808 ?        Ds   18:35   0:00 [lxc monitor] /var/snap/lxd/common/lxd/containers Downloader
@stgraber

This comment has been minimized.

Member

stgraber commented Apr 17, 2018

stuck in I/O wait, that's no good.

Can you post cat /proc/50502/stack?

@stgraber

This comment has been minimized.

Member

stgraber commented Apr 17, 2018

@brauner The failed rebalance is expected, it was hanging on the container monitor process, once the monitor died, it continued but the container isn't running so the cgroups can't be updated. The question is why isn't the container running :)

@brauner

This comment has been minimized.

Member

brauner commented Apr 17, 2018

That's why I was asking for the strace of the [lxc monitor] before. :)

@stgraber

This comment has been minimized.

Member

stgraber commented Apr 17, 2018

@brauner strace won't help you much on a process stuck in I/O wait though, it'll just hang strace too :)

@brauner

This comment has been minimized.

Member

brauner commented Apr 17, 2018

I couldn't have known that it was stuck in Ds then though. :)

@brauner

This comment has been minimized.

Member

brauner commented Apr 17, 2018

Exciting :)

@19wolf

This comment has been minimized.

Contributor

19wolf commented Apr 17, 2018

I killed the monitor so I ran it again, hence different pid

$sudo cat /proc/52048/stack
[<0>] copy_net_ns+0xa6/0x230
[<0>] create_new_namespaces+0x11f/0x1d0
[<0>] copy_namespaces+0x73/0xa0
[<0>] copy_process.part.34+0x949/0x1ab0
[<0>] _do_fork+0xcf/0x3b0
[<0>] SyS_clone+0x19/0x20
[<0>] do_syscall_64+0x73/0x130
[<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<0>] 0xffffffffffffffff
@brauner

This comment has been minimized.

Member

brauner commented Apr 17, 2018

Ok, this is a bug that @Blub and I have seen multiple times.

@brauner

This comment has been minimized.

Member

brauner commented Apr 17, 2018

I think that @Blub has already sent something to lkml but he's currently on vacation, I think.

@brauner

This comment has been minimized.

Member

brauner commented Apr 17, 2018

See also similar and in earlier kernels (just for me an @Blub): lxc/lxc#2141

@19wolf

This comment has been minimized.

Contributor

19wolf commented Apr 17, 2018

So do we have the information we need? If so I'd like to restart my system and get the container back (it's not necessary, but I would like it online).

It seems to fail a few times a month (more specifically, an app running in the container fails and it's easiest to just restart the container, but about every 3rd time I do, it doesn't come back up)

@stgraber

This comment has been minimized.

Member

stgraber commented Apr 17, 2018

Oh yeah, it's that kernel bug... though what's weird is that @19wolf is reporting a 4.16 kernel here so that should have the fix.

@brauner

This comment has been minimized.

Member

brauner commented Apr 17, 2018

The kernel networking code is currently pretty heavily refactored. A bunch of operations have been marked as async that weren't before so the regression potential is there.

@brauner

This comment has been minimized.

Member

brauner commented Apr 17, 2018

  • Anything exciting about br0?
  • Does it only happen with privileged containers?
@brauner

This comment has been minimized.

Member

brauner commented Apr 17, 2018

I think the last time someone touched that code significantly it was Kirill from the Virtuozzo guys. I'll ping him tomorrow and see what he thinks.

@brauner

This comment has been minimized.

Member

brauner commented Apr 17, 2018

@19wolf

This comment has been minimized.

Contributor

19wolf commented Apr 17, 2018

  • br0 has a lot of veths, but I think that's normal for me.
  • I just tested it, seems anything I start/restart will hang, privileged or not.
@brauner

This comment has been minimized.

Member

brauner commented Apr 17, 2018

Hi @tkhai, I'm just gonna rope you in here for a second. Seems we're running into each other quite often recently. :) We're seeing an issue with copy_net_ns() that we've seen before and I think you're one of the last persons to touch that code. Any idea what's going on?:

$sudo cat /proc/52048/stack
[<0>] copy_net_ns+0xa6/0x230
[<0>] create_new_namespaces+0x11f/0x1d0
[<0>] copy_namespaces+0x73/0xa0
[<0>] copy_process.part.34+0x949/0x1ab0
[<0>] _do_fork+0xcf/0x3b0
[<0>] SyS_clone+0x19/0x20
[<0>] do_syscall_64+0x73/0x130
[<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<0>] 0xffffffffffffffff
@tkhai

This comment has been minimized.

tkhai commented Apr 17, 2018

The problem is that net_mutex is exclusive, and it's taken for both copy_net_ns() and cleanup_net(). The second function makes many heavy sleepable actions under net_mutex such as taking rtnl_lock() and rcu synchronization. So, it's impossible to create new net before cleanup_net() is finished.

I've submitted patches to change the locking of these functions, solving this problem. They will be in 4.17, while the kernel in the issue is 4.16.

@brauner

This comment has been minimized.

Member

brauner commented Apr 17, 2018

@tkhai, cool! Thanks for stopping by! :)

@stgraber

This comment has been minimized.

Member

stgraber commented Apr 17, 2018

Closing as this is a kernel issue which has already been resolved upstream.
A temporary workaround may be to try another kernel version while waiting for the fix to make it to your current kernel. Or if it's your own kernel, then cherry-pick the fix.

@stgraber stgraber closed this Apr 17, 2018

@19wolf

This comment has been minimized.

Contributor

19wolf commented Apr 17, 2018

I'm updating to 4.17-rc1, hopefully the issue doesn't recur

@bodleytunes

This comment has been minimized.

bodleytunes commented Nov 28, 2018

Hiya, sorry to dig up old thread but this looks like the same issue I'm having with a nested container...

Difference is I'm trying 4.15.0-42-generic kernel

Is this something still an issue on the latest kernels?

Cheers,
Jon

image

image

image

image

image

root 22999 0.0 0.0 297888 7628 ? Ds Nov27 0:00 [lxc monitor] /var/snap/lxd/common/lxd/containers telegraf1 root@gns3vm:/home/jclayton# cat /proc/22999/stack [<0>] copy_net_ns+0xab/0x230 [<0>] create_new_namespaces+0x11b/0x1e0 [<0>] copy_namespaces+0x6d/0xa0 [<0>] copy_process.part.35+0x941/0x1ad0 [<0>] _do_fork+0xdf/0x3f0 [<0>] SyS_clone+0x19/0x20 [<0>] do_syscall_64+0x73/0x130 [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [<0>] 0xffffffffffffffff root@gns3vm:/home/jclayton# uname -r 4.15.0-42-generic

@C0rn3j

This comment has been minimized.

C0rn3j commented Dec 5, 2018

Is this something still an issue on the latest kernels?

Doesn't seem like it, but latest kernel is 4.19.

"I've submitted patches to change the locking of these functions, solving this problem. They will be in 4.17, while the kernel in the issue is 4.16."

Unless your kernel has the patches backported, you could still hit the same issue on 4.15.

@bodleytunes

This comment has been minimized.

bodleytunes commented Dec 5, 2018

Ahh right fair enough, I thought I was on a super new kernel.

I'll try and update.

Thanks,
Jon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment