New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libcontainerd: fix reaper goroutine position #33419

Merged
merged 1 commit into from May 30, 2017

Conversation

Projects
None yet
6 participants
@runcom
Member

runcom commented May 27, 2017

It has observed defunct containerd processes accumulating over
time while dockerd was permanently failing to restart containerd.
Due to a bug in the runContainerdDaemon() function, dockerd does not clean up
its child process if containerd already exits very soon after the (re)start.

The reproducer and analysis below comes from docker 1.12.x but bug
still applies on latest master.

Analysis from Ulrich Obergfell.

  • from libcontainerd/remote_linux.go:
  329 func (r *remote) runContainerdDaemon() error {
   :
   :      // start the containerd child process
   :
  403     if err := cmd.Start(); err != nil {
  404             return err
  405     }
   :
   :      // If containerd exits very soon after (re)start, it is
possible
   :      // that containerd is already in defunct state at the time
when
   :      // dockerd gets here. The setOOMScore() function tries to
write
   :      // to /proc/PID_OF_CONTAINERD/oom_score_adj. However, this
fails
   :      // with errno EINVAL because containerd is defunct. Please see
   :      // snippets of kernel source code and further explanation
below.
   :
  407     if err := setOOMScore(cmd.Process.Pid, r.oomScore); err != nil
{
  408             utils.KillProcess(cmd.Process.Pid)
   :
   :              // Due to the error from write() we return here. As
the
   :              // goroutine that would clean up the child has not
been
   :              // started yet, containerd remains in the defunct
state
   :              // and never gets reaped.
   :
  409             return err
  410     }
   :
  417     go func() {
  418             cmd.Wait()
  419             close(r.daemonWaitCh)
  420     }() // Reap our child when needed
   :
  423 }

This is the kernel function that gets invoked when dockerd tries to
write
to /proc/PID_OF_CONTAINERD/oom_score_adj.

  • from fs/proc/base.c:
 1197 static ssize_t oom_score_adj_write(struct file *file, ...
 1198                                         size_t count, loff_t
*ppos)
 1199 {
   :
 1223         task = get_proc_task(file_inode(file));
   :
   :          // The defunct containerd process does not have a virtual
   :          // address space anymore, i.e. task->mm is NULL. Thus the
   :          // following code returns errno EINVAL to dockerd.
   :
 1230         if (!task->mm) {
 1231                 err = -EINVAL;
 1232                 goto err_task_lock;
 1233         }
   :
 1253 err_task_lock:
   :
 1257         return err < 0 ? err : count;
 1258 }

The purpose of the following program is to demonstrate the behavior of
the oom_score_adj_write() function in connection with a defunct process.

$ cat defunct_test.c

#include <unistd.h>

main()
{
    pid_t pid = fork();

    if (pid == 0)
        // child
        _exit(0);

    // parent
    pause();
}

$ make defunct_test
cc     defunct_test.c   -o defunct_test

$ ./defunct_test &
[1] 3142

$ ps -f | grep defunct_test | grep -v grep
root      3142  2956  0 13:04 pts/0    00:00:00 ./defunct_test
root      3143  3142  0 13:04 pts/0    00:00:00 [defunct_test] <defunct>

$ echo "ps 3143" | crash -s
  PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
  3143   3142   2  ffff880035def300  ZO   0.0       0      0
defunct_test

$ echo "px ((struct task_struct *)0xffff880035def300)->mm" | crash -s
$1 = (struct mm_struct *) 0x0
                          ^^^ task->mm is NULL

$ cat /proc/3143/oom_score_adj
0

$ echo 0 > /proc/3143/oom_score_adj
-bash: echo: write error: Invalid argument"

This patch fixes the above issue by making sure we start the reaper
goroutine as soon as possible.

Signed-off-by: Antonio Murdaca runcom@redhat.com

libcontainerd: fix reaper goroutine position
It has observed defunct containerd processes accumulating over
time while dockerd was permanently failing to restart containerd.
Due to a bug in the runContainerdDaemon() function, dockerd does not clean up
its child process if containerd already exits very soon after the (re)start.

The reproducer and analysis below comes from docker 1.12.x but bug
still applies on latest master.

- from libcontainerd/remote_linux.go:

  329 func (r *remote) runContainerdDaemon() error {
   :
   :      // start the containerd child process
   :
  403     if err := cmd.Start(); err != nil {
  404             return err
  405     }
   :
   :      // If containerd exits very soon after (re)start, it is
possible
   :      // that containerd is already in defunct state at the time
when
   :      // dockerd gets here. The setOOMScore() function tries to
write
   :      // to /proc/PID_OF_CONTAINERD/oom_score_adj. However, this
fails
   :      // with errno EINVAL because containerd is defunct. Please see
   :      // snippets of kernel source code and further explanation
below.
   :
  407     if err := setOOMScore(cmd.Process.Pid, r.oomScore); err != nil
{
  408             utils.KillProcess(cmd.Process.Pid)
   :
   :              // Due to the error from write() we return here. As
the
   :              // goroutine that would clean up the child has not
been
   :              // started yet, containerd remains in the defunct
state
   :              // and never gets reaped.
   :
  409             return err
  410     }
   :
  417     go func() {
  418             cmd.Wait()
  419             close(r.daemonWaitCh)
  420     }() // Reap our child when needed
   :
  423 }

This is the kernel function that gets invoked when dockerd tries to
write
to /proc/PID_OF_CONTAINERD/oom_score_adj.

- from fs/proc/base.c:

 1197 static ssize_t oom_score_adj_write(struct file *file, ...
 1198                                         size_t count, loff_t
*ppos)
 1199 {
   :
 1223         task = get_proc_task(file_inode(file));
   :
   :          // The defunct containerd process does not have a virtual
   :          // address space anymore, i.e. task->mm is NULL. Thus the
   :          // following code returns errno EINVAL to dockerd.
   :
 1230         if (!task->mm) {
 1231                 err = -EINVAL;
 1232                 goto err_task_lock;
 1233         }
   :
 1253 err_task_lock:
   :
 1257         return err < 0 ? err : count;
 1258 }

The purpose of the following program is to demonstrate the behavior of
the oom_score_adj_write() function in connection with a defunct process.

$ cat defunct_test.c

\#include <unistd.h>

main()
{
    pid_t pid = fork();

    if (pid == 0)
        // child
        _exit(0);

    // parent
    pause();
}

$ make defunct_test
cc     defunct_test.c   -o defunct_test

$ ./defunct_test &
[1] 3142

$ ps -f | grep defunct_test | grep -v grep
root      3142  2956  0 13:04 pts/0    00:00:00 ./defunct_test
root      3143  3142  0 13:04 pts/0    00:00:00 [defunct_test] <defunct>

$ echo "ps 3143" | crash -s
  PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
  3143   3142   2  ffff880035def300  ZO   0.0       0      0
defunct_test

$ echo "px ((struct task_struct *)0xffff880035def300)->mm" | crash -s
$1 = (struct mm_struct *) 0x0
                          ^^^ task->mm is NULL

$ cat /proc/3143/oom_score_adj
0

$ echo 0 > /proc/3143/oom_score_adj
-bash: echo: write error: Invalid argument"

---

This patch fixes the above issue by making sure we start the reaper
goroutine as soon as possible.

Signed-off-by: Antonio Murdaca <runcom@redhat.com>
@runcom

This comment has been minimized.

Show comment
Hide comment
@runcom

runcom May 28, 2017

Member

For completeness I'd like to point out that newer kernel versions such as
4.10.17-100.fc24.x86_64 behave differently if an attempt is made to write
to /proc/PID/oom_score_adj of a defunct process. These kernel versions do
no longer return EINVAL because the if (!task->mm) { err = -EINVAL ... }
code block has been removed from the oom_score_adj_write() function by the
following commit in upstream kernel 4.8:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d49fbf766d27bc721caa48b391103d71e90804fa

  commit d49fbf766d27bc721caa48b391103d71e90804fa
  Author: Michal Hocko <mhocko@suse.com>
  Date:   Thu Jul 28 15:44:34 2016 -0700

      proc, oom: drop bogus task_lock and mm check

  $ git tag --contains d49fbf766d27bc721caa48b391103d71e90804fa | \
  > grep -v "\-rc"
  v4.10
  v4.11
  v4.8
  v4.9

In other words, if dockerd is running on these kernel versions it is less
prone to the bug in the runContainerdDaemon() function. The following test
demonstrates the different behavior:

  # uname -r
  4.10.17-100.fc24.x86_64

  # ./defunct_test &
  [1] 8801

  # ps -f | grep defunct_test | grep -v grep
  root      8801  3518  0 13:19 pts/2    00:00:00 ./defunct_test
  root      8802  8801  0 13:19 pts/2    00:00:00 [defunct_test] <defunct>

  # cat /proc/8802/oom_score_adj
  0

  # echo 500 > /proc/8802/oom_score_adj

  # cat /proc/8802/oom_score_adj
  500
Member

runcom commented May 28, 2017

For completeness I'd like to point out that newer kernel versions such as
4.10.17-100.fc24.x86_64 behave differently if an attempt is made to write
to /proc/PID/oom_score_adj of a defunct process. These kernel versions do
no longer return EINVAL because the if (!task->mm) { err = -EINVAL ... }
code block has been removed from the oom_score_adj_write() function by the
following commit in upstream kernel 4.8:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d49fbf766d27bc721caa48b391103d71e90804fa

  commit d49fbf766d27bc721caa48b391103d71e90804fa
  Author: Michal Hocko <mhocko@suse.com>
  Date:   Thu Jul 28 15:44:34 2016 -0700

      proc, oom: drop bogus task_lock and mm check

  $ git tag --contains d49fbf766d27bc721caa48b391103d71e90804fa | \
  > grep -v "\-rc"
  v4.10
  v4.11
  v4.8
  v4.9

In other words, if dockerd is running on these kernel versions it is less
prone to the bug in the runContainerdDaemon() function. The following test
demonstrates the different behavior:

  # uname -r
  4.10.17-100.fc24.x86_64

  # ./defunct_test &
  [1] 8801

  # ps -f | grep defunct_test | grep -v grep
  root      8801  3518  0 13:19 pts/2    00:00:00 ./defunct_test
  root      8802  8801  0 13:19 pts/2    00:00:00 [defunct_test] <defunct>

  # cat /proc/8802/oom_score_adj
  0

  # echo 500 > /proc/8802/oom_score_adj

  # cat /proc/8802/oom_score_adj
  500
@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah
Member

thaJeztah commented May 30, 2017

ping @justincormack PTAL

@thaJeztah thaJeztah referenced this pull request May 30, 2017

Closed

17.06.0 RC2 tracker #2

23 of 23 tasks complete
@mlaventure

LGTM, thanks!

// process.
r.daemonWaitCh = make(chan struct{})
go func() {

This comment has been minimized.

@cpuguy83

cpuguy83 May 30, 2017

Contributor

Maybe we can make sure this goroutine is started before proceeding if the above note is so important (I'm not familiar with this code path).
Something like:

waitCh := make(chan struct{})
go func() {
    waitCh <- struct{}{}
    cmd.Wait()
    close(waitCh)
}()

<-waitCh
r.daemonWaitCh = waitCh
@cpuguy83

cpuguy83 May 30, 2017

Contributor

Maybe we can make sure this goroutine is started before proceeding if the above note is so important (I'm not familiar with this code path).
Something like:

waitCh := make(chan struct{})
go func() {
    waitCh <- struct{}{}
    cmd.Wait()
    close(waitCh)
}()

<-waitCh
r.daemonWaitCh = waitCh

This comment has been minimized.

@runcom

runcom May 30, 2017

Member

mmm that cmd.Wait() will lay around to reap processes afaict and we can't wait on it with a channel (we'll hang the whole process otherwise), @mlaventure correct me if I'm wrong. With this PR we're not making sure cmd.Wait starts at all, we're making sure it starts really before any error code path kicks in and reaper won't work.

@runcom

runcom May 30, 2017

Member

mmm that cmd.Wait() will lay around to reap processes afaict and we can't wait on it with a channel (we'll hang the whole process otherwise), @mlaventure correct me if I'm wrong. With this PR we're not making sure cmd.Wait starts at all, we're making sure it starts really before any error code path kicks in and reaper won't work.

This comment has been minimized.

@mlaventure

mlaventure May 30, 2017

Contributor

Yes, all that matters is that cmd.Wait() is invoked at one point for this process, no need to have it synced at all. If dockerd dies before that happens, init will reap it for us.

@mlaventure

mlaventure May 30, 2017

Contributor

Yes, all that matters is that cmd.Wait() is invoked at one point for this process, no need to have it synced at all. If dockerd dies before that happens, init will reap it for us.

This comment has been minimized.

@cpuguy83

cpuguy83 May 30, 2017

Contributor

Ok that makes sense. Thanks for clarifying.

@cpuguy83

cpuguy83 May 30, 2017

Contributor

Ok that makes sense. Thanks for clarifying.

@cpuguy83

LGTM

@mlaventure mlaventure merged commit 4f55e39 into moby:master May 30, 2017

6 checks passed

dco-signed All commits are signed
experimental Jenkins build Docker-PRs-experimental 34531 has succeeded
Details
janky Jenkins build Docker-PRs 43128 has succeeded
Details
powerpc Jenkins build Docker-PRs-powerpc 3517 has succeeded
Details
windowsRS1 Jenkins build Docker-PRs-WoW-RS1 14433 has succeeded
Details
z Jenkins build Docker-PRs-s390x 3241 has succeeded
Details

@GordonTheTurtle GordonTheTurtle added this to the 17.06.0 milestone May 30, 2017

@runcom runcom deleted the runcom:fix-reaper-containerd branch May 30, 2017

@andrewhsu andrewhsu referenced this pull request Jun 6, 2017

Closed

17.06.0 RC3 tracker #8

40 of 40 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment