New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong 'ps' process start times shown inside containers #3517

Closed
candlerb opened this Issue Jul 10, 2017 · 2 comments

Comments

2 participants
@candlerb

candlerb commented Jul 10, 2017

Required information

  • Distribution: ubuntu
  • Distribution version: 16.04
  • The output of "lxc info"
config:
  core.https_address: '[::]:8443'
  core.trust_password: true
api_extensions:
- id_map
api_status: stable
api_version: "1.0"
auth: trusted
public: false
environment:
  addresses:
  - 10.12.255.21:8443
  - '[2001:XXXX:XXXX:XXXX::21]:8443'
  - 10.12.21.1:8443
  - '[2001:YYYY:YYYY:YYYY::1]:8443'
  architectures:
  - x86_64
  - i686
  certificate: <snip>
  certificate_fingerprint: <snip>
  driver: lxc
  driver_version: 2.0.8
  kernel: Linux
  kernel_architecture: x86_64
  kernel_version: 4.4.0-83-generic
  server: lxd
  server_pid: 1382
  server_version: 2.0.10
  storage: btrfs
  storage_version: "4.4"

Issue description

Inside a container, the process start times shown by ps appear to be wrong.

Steps to reproduce

Inside a container (which was launched from ubuntu:16.04 image)

root@apt-cacher:~# ps auxwww | grep apt
apt-cac+   309  0.0  0.6 653476  6884 ?        Ssl  Jul09   0:59 /usr/sbin/apt-cacher-ng SocketPath=/run/apt-cacher-ng/socket -c /etc/apt-cacher-ng ForeGround=1
root      1356  0.0  0.1  14620  1096 ?        S+   21:38   0:00 grep --color=auto apt
root@apt-cacher:~# date
Mon Jul 10 08:12:57 UTC 2017

Note the start time of the transient "grep" process (pid 1356) is shown as 21:38, although the system time is 08:12 ! The system time is correct in UTC.

I did wonder whether the timezone on the outer host made a difference. As it happens, this is set for a different zone (UTC+3)

root@lxd1:~# date; lxc exec apt-cacher date
Mon Jul 10 11:15:18 EEST 2017
Mon Jul 10 08:15:18 UTC 2017
root@lxd1:~# ps auxwww | grep ntp
ntp       1356  0.0  0.2 110032  2060 ?        Ss   Jul08   0:17 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 111:117
root     23428  0.0  0.0  14228   944 pts/1    S+   11:15   0:00 grep --color=auto ntp

Notice that (a) ntp is running on the outer host, and (b) the process start time for the grep process is correct (11:15). However this cannot explain the offset from 08:12 to 21:38.

I first discovered this on a different, production host. Inside a container there:

# ps auxwww | grep nagios
nagios   27954  0.0  0.1  39924  5724 ?        SNsl Jul17   0:00 /usr/sbin/nagios3 -d /etc/nagios3/nagios.cfg
root     31549  0.0  0.0  14620   948 ?        S+   Jul17   0:00 grep --color=auto nagios
# date
Mon Jul 10 08:16:46 UTC 2017

Notice how this time it gives the ps start time as Jul17. I think that means 7 days in the future! (The nagios process had itself been recently restarted too)

I don't think it means July 2017: there are other processes showing Jul06

# ps auxwww
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.1  37576  5756 ?        Ss   Jul06   0:04 /sbin/init
root        53  0.0  0.0  41588  2880 ?        Ss   Jul06   0:00 /lib/systemd/systemd-udevd
...
www-data 18900  0.0  0.1 255888  7924 ?        S    Jul17   0:00 /usr/sbin/apache2 -k start
root     29034  0.0  0.0  21624  3788 ?        Ss   Jul17   0:00 bash

Since the nagios process is stable and showing this problem, I can investigate. I presume the start time information comes from the /proc filesystem.

# strace -f 2>&1 ps -uw -q 27954 | grep /proc
open("/proc/filesystems", O_RDONLY)     = 4
open("/proc/self/stat", O_RDONLY)       = 4
open("/proc/uptime", O_RDONLY)          = 4
open("/usr/share/locale/C.UTF-8/LC_MESSAGES/procps-ng.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/C.utf8/LC_MESSAGES/procps-ng.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/C/LC_MESSAGES/procps-ng.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/C.UTF-8/LC_MESSAGES/procps-ng.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/C.utf8/LC_MESSAGES/procps-ng.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/C/LC_MESSAGES/procps-ng.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/proc/self/stat", O_RDONLY)       = 5
open("/proc/sys/kernel/pid_max", O_RDONLY) = 5
open("/proc/meminfo", O_RDONLY)         = 5
stat("/proc/self/task", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
stat("/proc/27954", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
open("/proc/27954/stat", O_RDONLY)      = 6
open("/proc/27954/status", O_RDONLY)    = 6
open("/proc/27954/cmdline", O_RDONLY)   = 6
open("/proc/stat", O_RDONLY)            = 6

# cat /proc/27954/stat
27954 (nagios3) S 1 27954 27954 0 -1 1077936192 107582 3722456 0 0 140 163 1608 550 25 5 2 0 94495291 40882176 1431 18446744073709551615 93841558839296 93841559513156 140728487896912 140728487896016 140435164347421 0 0 4224 16794629 0 0 0 17 0 0 0 0 0 0 93841561614312 93841561617368 93841566793728 140728487898946 140728487898991 140728487898991 140728487899110 0

According to kernel doc the process start time is column 22 of /proc/PID/stat, and this is start time relative to system boot (although it's unclear what the units are - jiffies?)

# cat /proc/27954/stat | awk '{print $22}'; cat /proc/uptime
94495291
354711.0 285797.0

Aha. I wonder if /proc/uptime is showing the time the container started? The time since system boot in the outer host is different:

root@wrn-mon2:~# cat /proc/uptime
945951.46 3555916.94

The difference between 945951.46 and 354711.0 (in seconds) is 6.84 days, hence the 7 days difference I'm seeing.

Going back to the test host I used at the start:

root@lxd1:~# cat /proc/uptime; lxc exec apt-cacher cat /proc/uptime
142834.76 281500.47
94499.0 94361.0

DIfference of 48335.76 seconds, or 13 hours 25.6 minutes, which gets you from 08:12 to 21:38 (in the future). Matches exactly what I saw.

So in summary: it appears that the kernel is giving process start times relative to host boot, but /proc/uptime is giving the time since container startup, and this is confusing ps.

The way to reproduce this reliably is to start a fresh lxd container on a host which has itself been running for a long period of time; the longer the host has been running, the longer the ps offset will be seen.

@brauner

This comment has been minimized.

Show comment
Hide comment
@brauner

brauner Jul 10, 2017

Member

This is not a bug in LXD but a bug in LXCFS which we already fixed. The new LXCFS package should be available soon!

Member

brauner commented Jul 10, 2017

This is not a bug in LXD but a bug in LXCFS which we already fixed. The new LXCFS package should be available soon!

@brauner brauner closed this Jul 10, 2017

@candlerb

This comment has been minimized.

Show comment
Hide comment
@candlerb

candlerb Jul 10, 2017

Thanks, for reference this appears to be lxc/lxcfs#189 and lxc/lxcfs#164

candlerb commented Jul 10, 2017

Thanks, for reference this appears to be lxc/lxcfs#189 and lxc/lxcfs#164

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment