Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't login via SSH to Gentoo container on Gentoo host. Cant't reboot SSHD and OS from container. #3569

Open
1 of 3 tasks
kolkov opened this issue Oct 29, 2020 · 4 comments
Open
1 of 3 tasks

Comments

@kolkov
Copy link

kolkov commented Oct 29, 2020

Required information

  • Distribution:
    Gentoo

  • Distribution version:
    Latest

  • The output of

  • lxc-start --version

4.0.5
  • lxc-checkconfig
LXC version 4.0.5
--- Namespaces ---
Namespaces: enabled
Utsname namespace: enabled
Ipc namespace: enabled
Pid namespace: enabled
User namespace: enabled
Network namespace: enabled

--- Control groups ---
Cgroups: enabled

Cgroup v1 mount points:
/sys/fs/cgroup/openrc
/sys/fs/cgroup/cpuset
/sys/fs/cgroup/cpu
/sys/fs/cgroup/cpuacct
/sys/fs/cgroup/blkio
/sys/fs/cgroup/memory
/sys/fs/cgroup/devices
/sys/fs/cgroup/freezer
/sys/fs/cgroup/net_cls
/sys/fs/cgroup/perf_event
/sys/fs/cgroup/net_prio
/sys/fs/cgroup/hugetlb
/sys/fs/cgroup/pids

Cgroup v2 mount points:
/sys/fs/cgroup/unified

Cgroup v1 systemd controller: missing
Cgroup v1 clone_children flag: enabled
Cgroup device: enabled
Cgroup sched: enabled
Cgroup cpu account: enabled
Cgroup memory controller: enabled
Cgroup cpuset: enabled

--- Misc ---
Veth pair device: enabled, loaded
Macvlan: enabled, not loaded
Vlan: enabled, not loaded
Bridges: enabled, loaded
Advanced netfilter: enabled, not loaded
CONFIG_NF_NAT_IPV4: missing
CONFIG_NF_NAT_IPV6: missing
CONFIG_IP_NF_TARGET_MASQUERADE: enabled, not loaded
CONFIG_IP6_NF_TARGET_MASQUERADE: enabled, not loaded
CONFIG_NETFILTER_XT_TARGET_CHECKSUM: enabled, not loaded
CONFIG_NETFILTER_XT_MATCH_COMMENT: enabled, not loaded
FUSE (for use with lxcfs): enabled, not loaded

--- Checkpoint/Restore ---
checkpoint restore: enabled
CONFIG_FHANDLE: enabled
CONFIG_EVENTFD: enabled
CONFIG_EPOLL: enabled
CONFIG_UNIX_DIAG: enabled
CONFIG_INET_DIAG: enabled
CONFIG_PACKET_DIAG: enabled
CONFIG_NETLINK_DIAG: enabled
File capabilities:

Note : Before booting a new kernel, you can check its configuration
usage : CONFIG=/path/to/config /usr/bin/lxc-checkconfig
  • uname -a

Linux mainserver 5.4.72-gentoo-x86_64 #1 SMP Sun Oct 25 14:41:42 MSK 2020 x86_64 Intel(R) Xeon(R) CPU E5640 @ 2.67GHz GenuineIntel GNU/Linux

  • cat /proc/self/cgroup
13:pids:/
12:hugetlb:/
11:net_prio:/
10:perf_event:/
9:net_cls:/
8:freezer:/
7:devices:/
6:memory:/
5:blkio:/
4:cpuacct:/
3:cpu:/
2:cpuset:/
1:name=openrc:/sshd
0::/sshd
  • cat /proc/1/mounts
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
none /run tmpfs rw,nosuid,nodev,relatime,mode=755 0 0
udev /dev devtmpfs rw,nosuid,relatime,size=10240k,nr_inodes=756688,mode=755 0 0
devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev,noexec 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
/dev/sda2 / xfs rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,nosuid,nodev,noexec,relatime 0 0
selinuxfs /sys/fs/selinux selinuxfs rw,relatime 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
cgroup_root /sys/fs/cgroup tmpfs rw,nosuid,nodev,noexec,relatime,size=10240k,mode=755 0 0
openrc /sys/fs/cgroup/openrc cgroup rw,nosuid,nodev,noexec,relatime,release_agent=/lib/rc/sh/cgroup-release-agent.sh,name=openrc 0 0
none /sys/fs/cgroup/unified cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate 0 0
cpuset /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cpu /sys/fs/cgroup/cpu cgroup rw,nosuid,nodev,noexec,relatime,cpu 0 0
cpuacct /sys/fs/cgroup/cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpuacct 0 0
blkio /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
memory /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
devices /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
freezer /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
net_cls /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0
perf_event /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
net_prio /sys/fs/cgroup/net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_prio 0 0
hugetlb /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
pids /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
mqueue /dev/mqueue mqueue rw,nosuid,nodev,noexec,relatime 0 0
binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,nosuid,nodev,noexec,relatime 0 0
/dev/sda3 /var xfs rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
/dev/sdb3 /xdata xfs rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,usrquota,prjquota 0 0
shm /dev/shm tmpfs rw,nosuid,nodev,noexec,relatime 0 0
/dev/sda3 /var/lib/docker xfs rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
nsfs /run/docker/netns/624faf15afb2 nsfs rw 0 0

Issue description

After some days from rebot, I can't login via SSH to privileged container, lxc-attache works correctly.

Steps to reproduce

I setup new Gentoo container one year ago. One-two montth ago, I was can't login to my container via SSH. I connected to container via lxc-attach and try to reboot SSHD via rc-service sshd restart, but I was can't to do this, command frozen. I try to reboot container via reboot command from new bash via lxc-attach, but the same situation was.
Ok, then I tryed to stop container via host lxc-stop, after long long time my container was stoped.
Then I tryed to start container again, and all worked correctly for some days.
This situation was repeats again and again.
I updated all apps in container, nothing helped me.
I updated all apps incuding latest LXC at the host, but nothing helped.
I updated my kernel to the latest green version from gentoo-sources, but my sshd on container not working as expeced yet.
Via ssh -vvv I see that my sshd works, respond for packets, autorize, but not works for login up me to system and can't reboot.
At the sshd logs - nothing. :(
Whats wrong?

Information to attach

  • any relevant kernel output (dmesg)
  • container log (The file from running lxc-start -n <c> -l TRACE -o <logfile> )
  • the containers configuration file
# Template used to create this container: /usr/share/lxc/templates/lxc-download
# Parameters passed to the template:
# Template script checksum (SHA-1): 273c51343604eb85f7e294c8da0a5eb769d648f3
# For additional config options, please look at lxc.container.conf(5)

# Uncomment the following line to support nesting containers:
#lxc.include = /usr/share/lxc/config/nesting.conf
# (Be aware this has security implications)


# Distribution configuration
lxc.include = /usr/share/lxc/config/common.conf
lxc.arch = x86_64

# Container specific configuration
lxc.rootfs.path = dir:/xdata/lxc/frei/rootfs
lxc.uts.name = frei

# Network configuration
lxc.net.0.type = veth
lxc.net.0.flags = up
lxc.net.0.name = eth0
lxc.net.0.link = br0
lxc.net.0.veth.pair = veth-frei
lxc.net.0.hwaddr = 02:03:04:05:06:07
lxc.net.0.ipv4.address = 92.*.*.*/24
lxc.net.0.ipv4.gateway = 92.*.*.1
#lxc.network.hwaddr=98:22:e1:32:22:81


#filesystem
#lxc.mount.entry=proc proc proc nodev,noexec,nosuid 0 0
#lxc.mount.entry=shm dev/shm tmpfs rw,nosuid,nodev,noexec,relatime 0 0
#lxc.mount.entry=run run tmpfs rw,nosuid,nodev,relatime,mode=755 0 0
#lxc.mount.entry=run run tmpfs rw,nosuid,nodev,relatime,mode=755 0 0
lxc.mount.auto = proc sys cgroup

#lxc.mount.entry = /usr/portage /var/virt/eweb/usr/portage none ro,bind 0 0
lxc.mount.entry = /usr/portage/distfiles usr/portage/distfiles none,rw bind,optional,create=dir 0 0
#lxc.mount.entry = /var/run/mysqld /var/virt/eweb/run/mysqld none ro,bind 0 0
lxc.mount.entry = /var/run/mysqld var/mysqld none,ro bind,optional,create=dir 0 0

lxc.start.auto = 1
lxc.start.delay = 15
@kolkov kolkov changed the title Can't login via SSH to Gentoo container on Gentoo host. Cant't reboot SSHD and container from container. Can't login via SSH to Gentoo container on Gentoo host. Cant't reboot SSHD and OS from container. Oct 29, 2020
@kolkov
Copy link
Author

kolkov commented Oct 29, 2020

I tryed to discover this bug and found, that syslog-ng stop responding at start if I debug sshd with debug level DEBUG3.
When I comment 2 lines in the syslog-ng config. By default messages are logged to tty12, but I changed this to tty4 at August, because tty12 not present in the container by default..
image
and now sshd DEBUG3 level works correctly.
It's look like lxc 4 runtime issue... at the host this works as expected.
Now I waiting for sshd stops working again.

@kolkov
Copy link
Author

kolkov commented Oct 29, 2020

syslog-ng/syslog-ng#2595 look like this issue.

@kolkov
Copy link
Author

kolkov commented Oct 29, 2020

steps to reproduce:

  1. Create new Gentoo container.
  2. Setup SSHD
  3. Change sshd debug level to DEBUG3
  4. Install syslog-ng by emerge --ask syslog-ng
  5. Change syslog-ng config from tty12 to tty4
  6. Reboot container
  7. Try to login via ssh.
  8. Ssh stop responding
  9. syslog-ng freezes container reboot and sshd

@jkroonza
Copy link

syslog-ng/syslog-ng#2595 look like this issue.

Indeed it does.

In my case I'm also unable to log in on for example tty1, so it's particularly difficult to troubleshoot.

We are also contemplating a kernel bug. Systems on 5.8.14 kernel seems to behave better, which we started to deploy a bit back. We did find one kernel backtrace on one host where we were privileged enough to have a terminal open at the time the host went into this state (which according to what I recall pointed at IO issues). Unfortunately I'm failing to find the dmesg we gathered at the time, but there were IO requests that were stuck.

We do make heavy use of LVM and snapshots (both thin and traditional/exception based), the above kernel version included a number of dm-mapper deadlock fixes which we suspect may relate. Scanning through the later kernel changelogs I do see tty: serial: fixes, and some ext4 stuff, which is always possible may be the actual underlying stuff ... always hard to say. But seeing that we've noticed in that one case that other IO was also blocked, but just syslog-ng, it tends to hint at a larger underlying problem potentially, but it could also be that we're looking at multiple issues. I really find it hard to trouble-shoot these problems.

You are however able to simply reboot the container - which seems to hint at not a kernel issue ... especially since if you disable tty output it works correctly (assuming I understand you correctly?).

What further bugs me around the tty output theory ... yes, tty's are extremely slow when being viewed, but when they're not I get the impression they're really fast (eg, if you're watching tty12 you can sometimes see lag, switch away and back and the backlog is "instantly" cleared - we actually were able to measure the difference in time on a kernel compile on a tty when it was the active tty vs not).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants