Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "cgfsng: try to delete parent cgroups" #1769

Merged
merged 2 commits into from
Aug 30, 2017

Conversation

brauner
Copy link
Member

@brauner brauner commented Aug 30, 2017

This reverts commit 92c590a.

Problem:

Commit 92c590a introduced the following
behavior:

> cgfsng: try to delete parent cgroups
>
> Say we have
>
>     lxc.uts.name = c1
>     lxc.cgroup.dir = lxd/a/b/c
>
> the path for the container's cgroup would be
>
>     lxd/a/b/c/c1
>
> When the container is shutdown we should not just try to delete "c1" we
> should also try to delete "c", "b", "a", and "lxd". This is to ensure
> that we don't leave empty cgroups around thereby increasing the chance
> that we run into trouble with cgroup limits. The algorithm for this isn't
> too costly since we can simply stop walking upwards at the first rmdir()
> failure.

The algorithm employs recursive_destroy() which opens each directory
specified in lxc.cgroup.dir and tries to delete each directory within that
directory. For example, assume "/sys/fs/cgroup/memory/lxd/a/b/c" only
contains the cgroup "c1" for container "c1". Assume that "c1" calls
recursive_destroy() to cleanup it's cgroups. It will first delete "c1" and
anything underneath it. This is perfectly fine since anything underneath
that cgroup is under its control. The new algorithm will then tell it to
"recurse upwards". So recursive_destroy() will try to delete
"/sys/fs/cgroup/lxd/a/b/c" next. Now assume that a second container "c2"
has "lxc.cgroup.dir = lxd/a/b/c" set in its config file and calls
cgroup_create(). This will create the *empty* cgroup
"/sys/fs/cgroup/memory/lxd/a/b/c/c2". Now assume that after having created
"c2" container "c1"'s call to recursive_destroy() reaches
"/sys/fs/cgroup/memory/lxd/a/b/c/c2" before it is populated. Then the
cgroup "c2" will be removed. Now "c2" calls cgroup_enter() to enter its
created cgroup. This will fail since c1 deleted the cgroup "c2". (As a
sidenote: This is in the set of the few race conditions that are actually
easy to describe.)

Possible Solution:

Instead of calling recursive_destroy() on all cgroups specified in
lxc.cgroup.dir we only call recursive_destroy() on the container's own
cgroup "/sys/fs/cgroup/memory/lxd/a/b/c/c1". When we start to recurse
upwards we only call unlinkat(AT_FDCWD, path, AT_REMOVEDIR). This should
avoid the race described above. My argument is as follows. Assume that the
container c1 has created the cgroup "/sys/fs/cgroup/lxd/a/b/c/c1" for
itself. Now c1 calls cgroup_destroy(). First, recursive_destroy() will be
called on the cgroup "c1" which will delete any emtpy cgroup directories
underneath "c1" and finally "c1" itself. This is fine since everything
under "c1" is the container's c1 sole property. Now container c1 will call
unlinkat() on "/sys/fs/cgroup/memory/lxd/a/b/c/c1":
- Assume that in the meantime container c2 has created the cgroup
  "/sys/fs/cgroup/memory/lxd/a/b/c/c2". Then c1's unlinkat() will fail.
  This will stop c1 from recursing upwards. So c2's cgroup_enter() call
  will find all its cgroups intact and well. unlinkat() will come with the
  appropriate in-kernel locking which will stop it from racing with
  mkdir().
- There's still a subtle race left. c2 might be calling an implementation
  of mkdir -p to try and create e.g. the cgroup
  "/sys/fs/cgroup/memory/lxd/a/b". Let's assume "b" exists then c2 will
  receive EEXIST on "b" and move on to create "c". Let's further assume c1
  has already deleted "c". c1 will now be able to delete
  "/sys/fs/cgroup/memory/lxd/a/b/" and c2's call to create "c" will fail.

The latter subtle race makes me rethink this approach. For now we'll just leave
empty cgroups behind since I don't want to start locking stuff.

Signed-off-by: Christian Brauner christian.brauner@ubuntu.com

Christian Brauner added 2 commits August 30, 2017 12:26
This reverts commit 92c590a.

Problem:

    Commit 92c590a introduced the following
    behavior:

    > cgfsng: try to delete parent cgroups
    >
    > Say we have
    >
    >     lxc.uts.name = c1
    >     lxc.cgroup.dir = lxd/a/b/c
    >
    > the path for the container's cgroup would be
    >
    >     lxd/a/b/c/c1
    >
    > When the container is shutdown we should not just try to delete "c1" we
    > should also try to delete "c", "b", "a", and "lxd". This is to ensure
    > that we don't leave empty cgroups around thereby increasing the chance
    > that we run into trouble with cgroup limits. The algorithm for this isn't
    > too costly since we can simply stop walking upwards at the first rmdir()
    > failure.

    The algorithm employs recursive_destroy() which opens each directory
    specified in lxc.cgroup.dir and tries to delete each directory within that
    directory. For example, assume "/sys/fs/cgroup/memory/lxd/a/b/c" only
    contains the cgroup "c1" for container "c1". Assume that "c1" calls
    recursive_destroy() to cleanup it's cgroups. It will first delete "c1" and
    anything underneath it. This is perfectly fine since anything underneath
    that cgroup is under its control. The new algorithm will then tell it to
    "recurse upwards". So recursive_destroy() will try to delete
    "/sys/fs/cgroup/lxd/a/b/c" next. Now assume that a second container "c2"
    has "lxc.cgroup.dir = lxd/a/b/c" set in its config file and calls
    cgroup_create(). This will create the *empty* cgroup
    "/sys/fs/cgroup/memory/lxd/a/b/c/c2". Now assume that after having created
    "c2" container "c1"'s call to recursive_destroy() reaches
    "/sys/fs/cgroup/memory/lxd/a/b/c/c2" before it is populated. Then the
    cgroup "c2" will be removed. Now "c2" calls cgroup_enter() to enter its
    created cgroup. This will fail since c1 deleted the cgroup "c2". (As a
    sidenote: This is in the set of the few race conditions that are actually
    easy to describe.)

Possible Solution:

    Instead of calling recursive_destroy() on all cgroups specified in
    lxc.cgroup.dir we only call recursive_destroy() on the container's own
    cgroup "/sys/fs/cgroup/memory/lxd/a/b/c/c1". When we start to recurse
    upwards we only call unlinkat(AT_FDCWD, path, AT_REMOVEDIR). This should
    avoid the race described above. My argument is as follows. Assume that the
    container c1 has created the cgroup "/sys/fs/cgroup/lxd/a/b/c/c1" for
    itself. Now c1 calls cgroup_destroy(). First, recursive_destroy() will be
    called on the cgroup "c1" which will delete any emtpy cgroup directories
    underneath "c1" and finally "c1" itself. This is fine since everything
    under "c1" is the container's c1 sole property. Now container c1 will call
    unlinkat() on "/sys/fs/cgroup/memory/lxd/a/b/c/c1":
    - Assume that in the meantime container c2 has created the cgroup
      "/sys/fs/cgroup/memory/lxd/a/b/c/c2". Then c1's unlinkat() will fail.
      This will stop c1 from recursing upwards. So c2's cgroup_enter() call
      will find all its cgroups intact and well. unlinkat() will come with the
      appropriate in-kernel locking which will stop it from racing with
      mkdir().
    - There's still a subtle race left. c2 might be calling an implementation
      of mkdir -p to try and create e.g. the cgroup
      "/sys/fs/cgroup/memory/lxd/a/b". Let's assume "b" exists then c2 will
      receive EEXIST on "b" and move on to create "c". Let's further assume c1
      has already deleted "c". c1 will now be able to delete
      "/sys/fs/cgroup/memory/lxd/a/b/" and c2's call to create "c" will fail.

The latter subtle race makes me rethink this approach. For now we'll just leave
empty cgroups behind since I don't want to start locking stuff.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
set_config_string_item() already free()s before setting the new value.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
@stgraber stgraber merged commit 70a4981 into lxc:master Aug 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants