Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cgroupv2: ebpf: check for BPF_F_REPLACE support and degrade gracefully #2986

Merged
merged 2 commits into from
Jun 8, 2021
Merged

cgroupv2: ebpf: check for BPF_F_REPLACE support and degrade gracefully #2986

merged 2 commits into from
Jun 8, 2021

Conversation

cyphar
Copy link
Member

@cyphar cyphar commented Jun 2, 2021

It turns out that the cilium eBPF library doesn't degrade gracefully if
BPF_F_REPLACE is not supported, so we need to work around it by treating
that case as we treat the more-than-one program case.

Fixes: d0f2c25 ("cgroup2: devices: replace all existing filters when attaching")
Signed-off-by: Aleksa Sarai cyphar@cyphar.com

@cyphar cyphar added this to the 1.0.0 milestone Jun 2, 2021
@cyphar
Copy link
Member Author

cyphar commented Jun 2, 2021

@AkihiroSuda Is it possible to do a test moby CI run with this PR to make sure I'm actually fixing the original issue? Thanks.

@AkihiroSuda
Copy link
Member

@cyphar Yes, being tested in moby/moby#42450

@AkihiroSuda
Copy link
Member

Still failing 😢
moby/moby#42450 (comment)

@cyphar
Copy link
Member Author

cyphar commented Jun 2, 2021

Oh, my bad -- the cilium library doesn't automatically pass BPF_F_REPLACE. Oops. It's a bit frustrating we don't have unit tests for this BPF_F_REPLACE behaviour but it seems that runc update doesn't trigger it? (Forgetting BPF_F_REPLACE means this was always broken AFIACS?)

@kolyshkin
Copy link
Contributor

I tried using the fix from this PR together with the test case in #3000 (on my Fedora 34 with the kernel 5.12.8-300.fc34.x86_64).

It fixes the "found more than one filter (2) attached to a cgroup" warning for the non-systemd case (i.e. when using fs2 cgroup driver), but for systemd it's still there (this is with some additional debug for the test case itself):

=== RUN   TestUpdateDevicesSystemd
    update_test.go:76: [0] allowed: true output: sh: write error: No space left on device
WARN[0004] found more than one filter (2) attached to a cgroup -- removing extra filters! 
INFO[0004] removing old filter 0 from cgroup             id=14155 name= run_count=0 runtime=0s tag=531db05b114e9af3 type=CGroupDevice
INFO[0004] removing old filter 1 from cgroup             id=14156 name= run_count=0 runtime=0s tag=a04f5eef06a7f555 type=CGroupDevice
    update_test.go:76: [1] allowed: false output: /bin/sh: can't create /dev/full: Operation not permitted
        cat: can't open '/dev/null': Operation not permitted
WARN[0004] found more than one filter (2) attached to a cgroup -- removing extra filters! 
INFO[0004] removing old filter 0 from cgroup             id=14157 name= run_count=0 runtime=0s tag=fb6cb1c301453333 type=CGroupDevice
INFO[0004] removing old filter 1 from cgroup             id=14158 name= run_count=0 runtime=0s tag=3b0b81b071f088cd type=CGroupDevice

I did not take a look at what happens but think this happens because we let systemd apply the device configuration, then apply it again using fs2 driver (as (* systemd.unifiedManager).Set() calls (* fs2.manager).Set()).

This fs[2].Set() call is and was always a problem, not just for devices, but in general -- we let systemd set everything, then re-apply it again using fs[2] driver, and this is against what systemd docs tell us to do.

OTOH removing this entirely can be problematic as not all the resources can be set using systemd.

Ultimately we should solve this by only using fs drivers for those parameters that systemd can't set -- this is something I was thinking about but never got to implement.

@cyphar
Copy link
Member Author

cyphar commented Jun 6, 2021

Yeah the reason why we set it with both was because:

  1. You need to tell systemd the settings it can set, because it will happily reset them behind your back if you don't tell it what the correct settings are.
  2. Systemd doesn't support setting all cgroup knobs.
  3. Older versions of systemd don't support all the knobs we might set, and systemd doesn't tell us whether a knob we set was supported or not.

So while fixing this might seem like a good idea given (1) and (2), (3) makes this quite a bit more difficult and a bit worrying -- if tomorrow systemd added support for ControllerFoo, and we didn't set ControllerFoo in the fs controller if running under systemd, neither runc nor systemd would set up ControllerFoo on older systemd versions. You could probably fix this with version or feature detection, though I'm not sure how expensive that would be.

However with regards to this particular issue -- I guess this means systemd doesn't remove existing device policies when it adds its own (if it did then we would just BPF_F_REPLACE the systemd policy)? I guess that technically isn't a bad way of doing it, but it goes against what I would expect systemd to do (set everything to be in the state it thinks it should be in).

@kolyshkin
Copy link
Contributor

kolyshkin commented Jun 6, 2021

Yeah the reason why we set it with both was because:

  1. You need to tell systemd the settings it can set, because it will happily reset them behind your back if you don't tell it what the correct settings are.

Can you elaborate?

  1. Systemd doesn't support setting all cgroup knobs.
  1. Older versions of systemd don't support all the knobs we might set, and systemd doesn't tell us whether a knob we set was supported or not.

Since systemd errors out when we try to set a parameter which it does not know about, we had to introduce and use a check for systemd version (initially for CPUQuotaPeriod, which requires systemd v242+, commit e751a16, and later added AllowedCPUs and AllowedMemoryNodes, which requires 244+, commit a35cad3).

So we alread know which parameters we do and do not set using systemd (at least in theory). Surely this can be coded much better, currently it looks like a band-aid (which I guess is OK since it is only applied in two places).

However with regards to this particular issue -- I guess this means systemd doesn't remove existing device policies when it adds its own (if it did then we would just BPF_F_REPLACE the systemd policy)? I guess that technically isn't a bad way of doing it, but it goes against what I would expect systemd to do (set everything to be in the state it thinks it should be in).

It seems that this warning becomes a common case -- maybe demote it to debug?

@cyphar
Copy link
Member Author

cyphar commented Jun 7, 2021

@kolyshkin

Can you elaborate?

Sure. Systemd has (at least in the past, I don't know if this is still the case today) had a habit of writing to the cgroup files of running services (with the setting it thinks the service should have) -- I remember there was a systemd service on SLES which would trigger this every once in a while but we couldn't really pin down the original cause. It's possible this was a bug of some kind, but we never managed to figure out what was causing it. I also don't know how much this is or is not true with Delegate=yes but given the experience with DevicesAllow= recently, I think it's fair to say this is still a realistic issue.

I appreciate this is kind of vague, but it happened a while ago and was a really frustrating bug to try to nail down so I'm just a little bit paranoid about it coming back. 😅

Since systemd errors out when we try to set a parameter which it does not know about, we had to introduce and use a check for systemd version (initially for CPUQuotaPeriod, which requires systemd v242+, commit e751a16, and later added AllowedCPUs and AllowedMemoryNodes, which requires 244+, commit a35cad3).

Ah sorry, you're quite right. I was mixing up systemd's behaviour when dealing with actual .service files -- in a .service file systemd will ignore unknown fields, and I had assumed this was also the case for transient units created through the API. But yeah, I just tested it and you do get an error (and now that you mention it, I remember the PRs you linked).

Using a version check like you suggested is probably an okay solution.

It seems that this warning becomes a common case -- maybe demote it to debug?

Sure. It was mostly a warning because it should indicate that some other process is messing with our cgroup policies (and those policies will be deleted by us) but since it appears to be common under systemd I'll make it debug (sucks that INFO doesn't go to stderr, since INFO is probably a better log level to use...).

@kolyshkin
Copy link
Contributor

sucks that INFO doesn't go to stderr

I think it does (AFAIR everything from logrus goes to os.Stderr by default), and we already use info in a a few places for similar reasons:

libcontainer/cgroups/fscommon/fscommon.go:                      logrus.Infof("interrupted while writing %s to %s", data, fd.Name())
libcontainer/cgroups/systemd/v1.go:                     logrus.Infof("freeze container before SetUnitProperties failed: %v", err)
libcontainer/cgroups/systemd/v2.go:                     logrus.Infof("freeze container before SetUnitProperties failed: %v", err)

@cyphar
Copy link
Member Author

cyphar commented Jun 8, 2021

Ah, I was going off what you said earlier. I expected it to go to stderr, but let me double-check.

@kolyshkin
Copy link
Contributor

Ah, I was going off what you said earlier. I expected it to go to stderr, but let me double-check.

Ah, past me strikes back 🤦🏻 I did check it this time:

[kir@kir-rhat x]$ cat a.go 
package main

import "github.com/sirupsen/logrus"

func main() {
	logrus.SetLevel(logrus.TraceLevel)
	logrus.Trace("trace")
	logrus.Debug("debug")
	logrus.Info("info")
	logrus.Warn("warn")
	logrus.Error("err")
	logrus.Fatal("fatal")
}
[kir@kir-rhat x]$ go run a.go
TRAC[0000] trace                                        
DEBU[0000] debug                                        
INFO[0000] info                                         
WARN[0000] warn                                         
ERRO[0000] err                                          
FATA[0000] fatal                                        
exit status 1
[kir@kir-rhat x]$ go run a.go 2>/dev/null

@kolyshkin
Copy link
Contributor

So, other than log level nuances, is this PR ready? Looks like moby/moby#42450 is still failing for some reason :(

[2021-06-07T08:25:16.409Z] === FAIL: amd64.integration.build TestBuildMultiStageLayerLeak (0.85s)

[2021-06-07T08:25:16.410Z] build_test.go:483: assertion failed: string "{"stream":"Step 1/8 : FROM busybox"}\r\n{"stream":"\n"}\r\n{"stream":" ---\u003e 1c35c4412082\n"}\r\n{"stream":"Step 2/8 : WORKDIR /foo"}\r\n{"stream":"\n"}\r\n{"stream":" ---\u003e Running in c18eb63b01a2\n"}\r\n{"stream":"Removing intermediate container c18eb63b01a2\n"}\r\n{"stream":" ---\u003e fcf5f81addec\n"}\r\n{"stream":"Step 3/8 : COPY foo ."}\r\n{"stream":"\n"}\r\n{"stream":" ---\u003e a0224f332e5a\n"}\r\n{"aux":{"ID":"sha256:a0224f332e5a455053ab1a8f2f76ee467aa4fd88a48725b2c0873655e79322ce"}}\r\n{"stream":"Step 4/8 : FROM busybox"}\r\n{"stream":"\n"}\r\n{"stream":" ---\u003e 1c35c4412082\n"}\r\n{"stream":"Step 5/8 : WORKDIR /foo"}\r\n{"stream":"\n"}\r\n{"stream":" ---\u003e Using cache\n"}\r\n{"stream":" ---\u003e fcf5f81addec\n"}\r\n{"stream":"Step 6/8 : COPY bar ."}\r\n{"stream":"\n"}\r\n{"stream":" ---\u003e d5137cb4aae5\n"}\r\n{"stream":"Step 7/8 : RUN [ -f bar ]"}\r\n{"stream":"\n"}\r\n{"stream":" ---\u003e Running in 99258d51bd4b\n"}\r\n{"stream":"Removing intermediate container 99258d51bd4b\n"}\r\n{"errorDetail":{"message":"failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: process_linux.go:508: setting cgroup config for procHooks process caused: failed to call BPF_PROG_ATTACH (BPF_CGROUP_DEVICE, BPF_F_ALLOW_MULTI): can't attach program: invalid argument: unknown"},"error":"failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: process_linux.go:508: setting cgroup config for procHooks process caused: failed to call BPF_PROG_ATTACH (BPF_CGROUP_DEVICE, BPF_F_ALLOW_MULTI): can't attach program: invalid argument: unknown"}\r\n" does not contain "Successfully built"

If not, maybe we need to do rc96 instead -- mostly due to #2997, thanks to yours truly 🤦🏻, but also as a way to do one more test before GA.

@cyphar
Copy link
Member Author

cyphar commented Jun 8, 2021

Yeah, I'll push the log-level change. It doesn't appear to fix the Moby CI issue, but it does fix several real issues so we should probably merge it anyway and I can work on figuring out what's going wrong on Moby's CI (I expect it's a kernel version issue -- Docker works perfectly fine on my machine with the 1.0.0 GA release I prepared).

It turns out that the cilium eBPF library doesn't degrade gracefully if
BPF_F_REPLACE is not supported, so we need to work around it by treating
that case as we treat the more-than-one program case.

It also turns out that we weren't passing BPF_F_REPLACE explicitly, but
this is required by the cilium library (causing EINVALs).

Fixes: d0f2c25 ("cgroup2: devices: replace all existing filters when attaching")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
It seems that we are triggering the mutli-attach fallback in the fedora
CI, but we don't have enough debugging information to really know what's
going on, so add some. Unfortunately the amount of information we have
available with eBPF programs in general is fairly limited (we can't get
their bytecode for instance).

We also demote the "more than one filter" warning to an info message
because it happens very often under the systemd cgroup driver (likely
when systemd configures the cgroup it isn't deleting our old program, so
when our apply code runs after the systemd one there are two running
programs).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AkihiroSuda
Copy link
Member

Yeah, I'll push the log-level change. It doesn't appear to fix the Moby CI issue, but it does fix several real issues so we should probably merge it anyway and I can work on figuring out what's going wrong on Moby's CI (I expect it's a kernel version issue -- Docker works perfectly fine on my machine with the 1.0.0 GA release I prepared).

FYI, Moby CI uses Ubuntu 20.04.2 LTS kernel 5.4.0-1048-aws, which might be older than your expectation

[2021-06-07T08:15:35.768Z] + docker version
[2021-06-07T08:15:45.717Z] Client: Docker Engine - Community
[2021-06-07T08:15:45.717Z]  Version:           20.10.6
[2021-06-07T08:15:45.717Z]  API version:       1.41
[2021-06-07T08:15:45.717Z]  Go version:        go1.13.15
[2021-06-07T08:15:45.717Z]  Git commit:        370c289
[2021-06-07T08:15:45.717Z]  Built:             Fri Apr  9 22:47:17 2021
[2021-06-07T08:15:45.717Z]  OS/Arch:           linux/amd64
[2021-06-07T08:15:45.717Z]  Context:           default
[2021-06-07T08:15:45.717Z]  Experimental:      true
[2021-06-07T08:15:45.717Z] 
[2021-06-07T08:15:45.717Z] Server: Docker Engine - Community
[2021-06-07T08:15:45.717Z]  Engine:
[2021-06-07T08:15:45.717Z]   Version:          20.10.6
[2021-06-07T08:15:45.717Z]   API version:      1.41 (minimum version 1.12)
[2021-06-07T08:15:45.717Z]   Go version:       go1.13.15
[2021-06-07T08:15:45.717Z]   Git commit:       8728dd2
[2021-06-07T08:15:45.717Z]   Built:            Fri Apr  9 22:45:28 2021
[2021-06-07T08:15:45.717Z]   OS/Arch:          linux/amd64
[2021-06-07T08:15:45.717Z]   Experimental:     true
[2021-06-07T08:15:45.717Z]  containerd:
[2021-06-07T08:15:45.717Z]   Version:          1.4.6
[2021-06-07T08:15:45.717Z]   GitCommit:        d71fcd7d8303cbf684402823e425e9dd2e99285d
[2021-06-07T08:15:45.717Z]  runc:
[2021-06-07T08:15:45.717Z]   Version:          1.0.0-rc95
[2021-06-07T08:15:45.717Z]   GitCommit:        b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
[2021-06-07T08:15:45.717Z]  docker-init:
[2021-06-07T08:15:45.717Z]   Version:          0.19.0
[2021-06-07T08:15:45.717Z]   GitCommit:        de40ad0
[2021-06-07T08:15:46.019Z] + docker info
[2021-06-07T08:16:00.861Z] Client:
[2021-06-07T08:16:00.861Z]  Context:    default
[2021-06-07T08:16:00.861Z]  Debug Mode: false
[2021-06-07T08:16:00.861Z]  Plugins:
[2021-06-07T08:16:00.861Z]   app: Docker App (Docker Inc., v0.9.1-beta3)
[2021-06-07T08:16:00.861Z]   buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)
[2021-06-07T08:16:00.861Z]   scan: Docker Scan (Docker Inc., v0.7.0)
[2021-06-07T08:16:00.861Z] 
[2021-06-07T08:16:00.861Z] Server:
[2021-06-07T08:16:00.861Z]  Containers: 0
[2021-06-07T08:16:00.861Z]   Running: 0
[2021-06-07T08:16:00.861Z]   Paused: 0
[2021-06-07T08:16:00.861Z]   Stopped: 0
[2021-06-07T08:16:00.861Z]  Images: 0
[2021-06-07T08:16:00.861Z]  Server Version: 20.10.6
[2021-06-07T08:16:00.861Z]  Storage Driver: overlay2
[2021-06-07T08:16:00.861Z]   Backing Filesystem: extfs
[2021-06-07T08:16:00.861Z]   Supports d_type: true
[2021-06-07T08:16:00.861Z]   Native Overlay Diff: true
[2021-06-07T08:16:00.861Z]   userxattr: false
[2021-06-07T08:16:00.861Z]  Logging Driver: json-file
[2021-06-07T08:16:00.861Z]  Cgroup Driver: systemd
[2021-06-07T08:16:00.861Z]  Cgroup Version: 2
[2021-06-07T08:16:00.861Z]  Plugins:
[2021-06-07T08:16:00.861Z]   Volume: local
[2021-06-07T08:16:00.861Z]   Network: bridge host ipvlan macvlan null overlay
[2021-06-07T08:16:00.861Z]   Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
[2021-06-07T08:16:00.861Z]  Swarm: inactive
[2021-06-07T08:16:00.861Z]  Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
[2021-06-07T08:16:00.861Z]  Default Runtime: runc
[2021-06-07T08:16:00.861Z]  Init Binary: docker-init
[2021-06-07T08:16:00.861Z]  containerd version: d71fcd7d8303cbf684402823e425e9dd2e99285d
[2021-06-07T08:16:00.861Z]  runc version: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
[2021-06-07T08:16:00.861Z]  init version: de40ad0
[2021-06-07T08:16:00.861Z]  Security Options:
[2021-06-07T08:16:00.861Z]   apparmor
[2021-06-07T08:16:00.861Z]   seccomp
[2021-06-07T08:16:00.861Z]    Profile: default
[2021-06-07T08:16:00.861Z]   cgroupns
[2021-06-07T08:16:00.861Z]  Kernel Version: 5.4.0-1048-aws
[2021-06-07T08:16:00.861Z]  Operating System: Ubuntu 20.04.2 LTS
[2021-06-07T08:16:00.861Z]  OSType: linux
[2021-06-07T08:16:00.861Z]  Architecture: x86_64
[2021-06-07T08:16:00.861Z]  CPUs: 2
[2021-06-07T08:16:00.861Z]  Total Memory: 7.569GiB
[2021-06-07T08:16:00.861Z]  Name: ip-10-100-88-107
[2021-06-07T08:16:00.861Z]  ID: BPI2:DJ3J:2BX4:S6D2:XUGC:3EVF:YTII:HJPT:TJU6:HPWF:P5M7:CUDP
[2021-06-07T08:16:00.861Z]  Docker Root Dir: /var/lib/docker
[2021-06-07T08:16:00.861Z]  Debug Mode: false
[2021-06-07T08:16:00.861Z]  Registry: https://index.docker.io/v1/
[2021-06-07T08:16:00.861Z]  Labels:
[2021-06-07T08:16:00.861Z]  Experimental: true
[2021-06-07T08:16:00.861Z]  Insecure Registries:
[2021-06-07T08:16:00.861Z]   127.0.0.0/8
[2021-06-07T08:16:00.861Z]  Live Restore Enabled: true
[2021-06-07T08:16:00.861Z] 
[2021-06-07T08:16:00.861Z] WARNING: No kernel memory limit support
[2021-06-07T08:16:00.861Z] WARNING: No oom kill disable support

@AkihiroSuda
Copy link
Member

Opened a new issue for tracking: #3008

@cyphar
Copy link
Member Author

cyphar commented Jun 8, 2021

@AkihiroSuda Yeah I looked at that a few days ago, and I've been trying to get LXD VMs to work again on openSUSE so I can try to reproduce it locally...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants