cgroupv2: ebpf: check for BPF_F_REPLACE support and degrade gracefully #2986

cyphar · 2021-06-02T03:04:17Z

It turns out that the cilium eBPF library doesn't degrade gracefully if
BPF_F_REPLACE is not supported, so we need to work around it by treating
that case as we treat the more-than-one program case.

Fixes: d0f2c25 ("cgroup2: devices: replace all existing filters when attaching")
Signed-off-by: Aleksa Sarai cyphar@cyphar.com

libcontainer/cgroups/ebpf/ebpf.go

cyphar · 2021-06-02T03:37:02Z

@AkihiroSuda Is it possible to do a test moby CI run with this PR to make sure I'm actually fixing the original issue? Thanks.

libcontainer/cgroups/ebpf/ebpf.go

AkihiroSuda · 2021-06-02T03:59:06Z

@cyphar Yes, being tested in moby/moby#42450

AkihiroSuda · 2021-06-02T04:46:12Z

Still failing 😢
moby/moby#42450 (comment)

cyphar · 2021-06-02T05:15:18Z

Oh, my bad -- the cilium library doesn't automatically pass BPF_F_REPLACE. Oops. It's a bit frustrating we don't have unit tests for this BPF_F_REPLACE behaviour but it seems that runc update doesn't trigger it? (Forgetting BPF_F_REPLACE means this was always broken AFIACS?)

kolyshkin · 2021-06-05T23:03:46Z

I tried using the fix from this PR together with the test case in #3000 (on my Fedora 34 with the kernel 5.12.8-300.fc34.x86_64).

It fixes the "found more than one filter (2) attached to a cgroup" warning for the non-systemd case (i.e. when using fs2 cgroup driver), but for systemd it's still there (this is with some additional debug for the test case itself):

=== RUN   TestUpdateDevicesSystemd
    update_test.go:76: [0] allowed: true output: sh: write error: No space left on device
WARN[0004] found more than one filter (2) attached to a cgroup -- removing extra filters! 
INFO[0004] removing old filter 0 from cgroup             id=14155 name= run_count=0 runtime=0s tag=531db05b114e9af3 type=CGroupDevice
INFO[0004] removing old filter 1 from cgroup             id=14156 name= run_count=0 runtime=0s tag=a04f5eef06a7f555 type=CGroupDevice
    update_test.go:76: [1] allowed: false output: /bin/sh: can't create /dev/full: Operation not permitted
        cat: can't open '/dev/null': Operation not permitted
WARN[0004] found more than one filter (2) attached to a cgroup -- removing extra filters! 
INFO[0004] removing old filter 0 from cgroup             id=14157 name= run_count=0 runtime=0s tag=fb6cb1c301453333 type=CGroupDevice
INFO[0004] removing old filter 1 from cgroup             id=14158 name= run_count=0 runtime=0s tag=3b0b81b071f088cd type=CGroupDevice

I did not take a look at what happens but think this happens because we let systemd apply the device configuration, then apply it again using fs2 driver (as (* systemd.unifiedManager).Set() calls (* fs2.manager).Set()).

This fs[2].Set() call is and was always a problem, not just for devices, but in general -- we let systemd set everything, then re-apply it again using fs[2] driver, and this is against what systemd docs tell us to do.

OTOH removing this entirely can be problematic as not all the resources can be set using systemd.

Ultimately we should solve this by only using fs drivers for those parameters that systemd can't set -- this is something I was thinking about but never got to implement.

cyphar · 2021-06-06T06:08:53Z

Yeah the reason why we set it with both was because:

You need to tell systemd the settings it can set, because it will happily reset them behind your back if you don't tell it what the correct settings are.
Systemd doesn't support setting all cgroup knobs.
Older versions of systemd don't support all the knobs we might set, and systemd doesn't tell us whether a knob we set was supported or not.

So while fixing this might seem like a good idea given (1) and (2), (3) makes this quite a bit more difficult and a bit worrying -- if tomorrow systemd added support for ControllerFoo, and we didn't set ControllerFoo in the fs controller if running under systemd, neither runc nor systemd would set up ControllerFoo on older systemd versions. You could probably fix this with version or feature detection, though I'm not sure how expensive that would be.

However with regards to this particular issue -- I guess this means systemd doesn't remove existing device policies when it adds its own (if it did then we would just BPF_F_REPLACE the systemd policy)? I guess that technically isn't a bad way of doing it, but it goes against what I would expect systemd to do (set everything to be in the state it thinks it should be in).

kolyshkin · 2021-06-06T21:15:56Z

Yeah the reason why we set it with both was because:

You need to tell systemd the settings it can set, because it will happily reset them behind your back if you don't tell it what the correct settings are.

Can you elaborate?

Systemd doesn't support setting all cgroup knobs.

Older versions of systemd don't support all the knobs we might set, and systemd doesn't tell us whether a knob we set was supported or not.

Since systemd errors out when we try to set a parameter which it does not know about, we had to introduce and use a check for systemd version (initially for CPUQuotaPeriod, which requires systemd v242+, commit e751a16, and later added AllowedCPUs and AllowedMemoryNodes, which requires 244+, commit a35cad3).

So we alread know which parameters we do and do not set using systemd (at least in theory). Surely this can be coded much better, currently it looks like a band-aid (which I guess is OK since it is only applied in two places).

However with regards to this particular issue -- I guess this means systemd doesn't remove existing device policies when it adds its own (if it did then we would just BPF_F_REPLACE the systemd policy)? I guess that technically isn't a bad way of doing it, but it goes against what I would expect systemd to do (set everything to be in the state it thinks it should be in).

It seems that this warning becomes a common case -- maybe demote it to debug?

cyphar · 2021-06-07T07:28:34Z

@kolyshkin

Can you elaborate?

Sure. Systemd has (at least in the past, I don't know if this is still the case today) had a habit of writing to the cgroup files of running services (with the setting it thinks the service should have) -- I remember there was a systemd service on SLES which would trigger this every once in a while but we couldn't really pin down the original cause. It's possible this was a bug of some kind, but we never managed to figure out what was causing it. I also don't know how much this is or is not true with Delegate=yes but given the experience with DevicesAllow= recently, I think it's fair to say this is still a realistic issue.

I appreciate this is kind of vague, but it happened a while ago and was a really frustrating bug to try to nail down so I'm just a little bit paranoid about it coming back. 😅

Since systemd errors out when we try to set a parameter which it does not know about, we had to introduce and use a check for systemd version (initially for CPUQuotaPeriod, which requires systemd v242+, commit e751a16, and later added AllowedCPUs and AllowedMemoryNodes, which requires 244+, commit a35cad3).

Ah sorry, you're quite right. I was mixing up systemd's behaviour when dealing with actual .service files -- in a .service file systemd will ignore unknown fields, and I had assumed this was also the case for transient units created through the API. But yeah, I just tested it and you do get an error (and now that you mention it, I remember the PRs you linked).

Using a version check like you suggested is probably an okay solution.

It seems that this warning becomes a common case -- maybe demote it to debug?

Sure. It was mostly a warning because it should indicate that some other process is messing with our cgroup policies (and those policies will be deleted by us) but since it appears to be common under systemd I'll make it debug (sucks that INFO doesn't go to stderr, since INFO is probably a better log level to use...).

kolyshkin · 2021-06-08T03:08:21Z

sucks that INFO doesn't go to stderr

I think it does (AFAIR everything from logrus goes to os.Stderr by default), and we already use info in a a few places for similar reasons:

libcontainer/cgroups/fscommon/fscommon.go:                      logrus.Infof("interrupted while writing %s to %s", data, fd.Name())
libcontainer/cgroups/systemd/v1.go:                     logrus.Infof("freeze container before SetUnitProperties failed: %v", err)
libcontainer/cgroups/systemd/v2.go:                     logrus.Infof("freeze container before SetUnitProperties failed: %v", err)

cyphar · 2021-06-08T03:14:41Z

Ah, I was going off what you said earlier. I expected it to go to stderr, but let me double-check.

kolyshkin · 2021-06-08T03:19:05Z

Ah, I was going off what you said earlier. I expected it to go to stderr, but let me double-check.

Ah, past me strikes back 🤦🏻 I did check it this time:

[kir@kir-rhat x]$ cat a.go 
package main

import "github.com/sirupsen/logrus"

func main() {
	logrus.SetLevel(logrus.TraceLevel)
	logrus.Trace("trace")
	logrus.Debug("debug")
	logrus.Info("info")
	logrus.Warn("warn")
	logrus.Error("err")
	logrus.Fatal("fatal")
}
[kir@kir-rhat x]$ go run a.go
TRAC[0000] trace                                        
DEBU[0000] debug                                        
INFO[0000] info                                         
WARN[0000] warn                                         
ERRO[0000] err                                          
FATA[0000] fatal                                        
exit status 1
[kir@kir-rhat x]$ go run a.go 2>/dev/null

kolyshkin · 2021-06-08T03:19:11Z

So, other than log level nuances, is this PR ready? Looks like moby/moby#42450 is still failing for some reason :(

[2021-06-07T08:25:16.409Z] === FAIL: amd64.integration.build TestBuildMultiStageLayerLeak (0.85s)

[2021-06-07T08:25:16.410Z] build_test.go:483: assertion failed: string "{"stream":"Step 1/8 : FROM busybox"}\r\n{"stream":"\n"}\r\n{"stream":" ---\u003e 1c35c4412082\n"}\r\n{"stream":"Step 2/8 : WORKDIR /foo"}\r\n{"stream":"\n"}\r\n{"stream":" ---\u003e Running in c18eb63b01a2\n"}\r\n{"stream":"Removing intermediate container c18eb63b01a2\n"}\r\n{"stream":" ---\u003e fcf5f81addec\n"}\r\n{"stream":"Step 3/8 : COPY foo ."}\r\n{"stream":"\n"}\r\n{"stream":" ---\u003e a0224f332e5a\n"}\r\n{"aux":{"ID":"sha256:a0224f332e5a455053ab1a8f2f76ee467aa4fd88a48725b2c0873655e79322ce"}}\r\n{"stream":"Step 4/8 : FROM busybox"}\r\n{"stream":"\n"}\r\n{"stream":" ---\u003e 1c35c4412082\n"}\r\n{"stream":"Step 5/8 : WORKDIR /foo"}\r\n{"stream":"\n"}\r\n{"stream":" ---\u003e Using cache\n"}\r\n{"stream":" ---\u003e fcf5f81addec\n"}\r\n{"stream":"Step 6/8 : COPY bar ."}\r\n{"stream":"\n"}\r\n{"stream":" ---\u003e d5137cb4aae5\n"}\r\n{"stream":"Step 7/8 : RUN [ -f bar ]"}\r\n{"stream":"\n"}\r\n{"stream":" ---\u003e Running in 99258d51bd4b\n"}\r\n{"stream":"Removing intermediate container 99258d51bd4b\n"}\r\n{"errorDetail":{"message":"failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: process_linux.go:508: setting cgroup config for procHooks process caused: failed to call BPF_PROG_ATTACH (BPF_CGROUP_DEVICE, BPF_F_ALLOW_MULTI): can't attach program: invalid argument: unknown"},"error":"failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: process_linux.go:508: setting cgroup config for procHooks process caused: failed to call BPF_PROG_ATTACH (BPF_CGROUP_DEVICE, BPF_F_ALLOW_MULTI): can't attach program: invalid argument: unknown"}\r\n" does not contain "Successfully built"

If not, maybe we need to do rc96 instead -- mostly due to #2997, thanks to yours truly 🤦🏻, but also as a way to do one more test before GA.

cyphar · 2021-06-08T03:21:53Z

Yeah, I'll push the log-level change. It doesn't appear to fix the Moby CI issue, but it does fix several real issues so we should probably merge it anyway and I can work on figuring out what's going wrong on Moby's CI (I expect it's a kernel version issue -- Docker works perfectly fine on my machine with the 1.0.0 GA release I prepared).

It turns out that the cilium eBPF library doesn't degrade gracefully if BPF_F_REPLACE is not supported, so we need to work around it by treating that case as we treat the more-than-one program case. It also turns out that we weren't passing BPF_F_REPLACE explicitly, but this is required by the cilium library (causing EINVALs). Fixes: d0f2c25 ("cgroup2: devices: replace all existing filters when attaching") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

It seems that we are triggering the mutli-attach fallback in the fedora CI, but we don't have enough debugging information to really know what's going on, so add some. Unfortunately the amount of information we have available with eBPF programs in general is fairly limited (we can't get their bytecode for instance). We also demote the "more than one filter" warning to an info message because it happens very often under the systemd cgroup driver (likely when systemd configures the cgroup it isn't deleting our old program, so when our apply code runs after the systemd one there are two running programs). Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

kolyshkin

LGTM

AkihiroSuda · 2021-06-08T06:20:43Z

Yeah, I'll push the log-level change. It doesn't appear to fix the Moby CI issue, but it does fix several real issues so we should probably merge it anyway and I can work on figuring out what's going wrong on Moby's CI (I expect it's a kernel version issue -- Docker works perfectly fine on my machine with the 1.0.0 GA release I prepared).

FYI, Moby CI uses Ubuntu 20.04.2 LTS kernel 5.4.0-1048-aws, which might be older than your expectation

[2021-06-07T08:15:35.768Z] + docker version
[2021-06-07T08:15:45.717Z] Client: Docker Engine - Community
[2021-06-07T08:15:45.717Z]  Version:           20.10.6
[2021-06-07T08:15:45.717Z]  API version:       1.41
[2021-06-07T08:15:45.717Z]  Go version:        go1.13.15
[2021-06-07T08:15:45.717Z]  Git commit:        370c289
[2021-06-07T08:15:45.717Z]  Built:             Fri Apr  9 22:47:17 2021
[2021-06-07T08:15:45.717Z]  OS/Arch:           linux/amd64
[2021-06-07T08:15:45.717Z]  Context:           default
[2021-06-07T08:15:45.717Z]  Experimental:      true
[2021-06-07T08:15:45.717Z] 
[2021-06-07T08:15:45.717Z] Server: Docker Engine - Community
[2021-06-07T08:15:45.717Z]  Engine:
[2021-06-07T08:15:45.717Z]   Version:          20.10.6
[2021-06-07T08:15:45.717Z]   API version:      1.41 (minimum version 1.12)
[2021-06-07T08:15:45.717Z]   Go version:       go1.13.15
[2021-06-07T08:15:45.717Z]   Git commit:       8728dd2
[2021-06-07T08:15:45.717Z]   Built:            Fri Apr  9 22:45:28 2021
[2021-06-07T08:15:45.717Z]   OS/Arch:          linux/amd64
[2021-06-07T08:15:45.717Z]   Experimental:     true
[2021-06-07T08:15:45.717Z]  containerd:
[2021-06-07T08:15:45.717Z]   Version:          1.4.6
[2021-06-07T08:15:45.717Z]   GitCommit:        d71fcd7d8303cbf684402823e425e9dd2e99285d
[2021-06-07T08:15:45.717Z]  runc:
[2021-06-07T08:15:45.717Z]   Version:          1.0.0-rc95
[2021-06-07T08:15:45.717Z]   GitCommit:        b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
[2021-06-07T08:15:45.717Z]  docker-init:
[2021-06-07T08:15:45.717Z]   Version:          0.19.0
[2021-06-07T08:15:45.717Z]   GitCommit:        de40ad0
[2021-06-07T08:15:46.019Z] + docker info
[2021-06-07T08:16:00.861Z] Client:
[2021-06-07T08:16:00.861Z]  Context:    default
[2021-06-07T08:16:00.861Z]  Debug Mode: false
[2021-06-07T08:16:00.861Z]  Plugins:
[2021-06-07T08:16:00.861Z]   app: Docker App (Docker Inc., v0.9.1-beta3)
[2021-06-07T08:16:00.861Z]   buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)
[2021-06-07T08:16:00.861Z]   scan: Docker Scan (Docker Inc., v0.7.0)
[2021-06-07T08:16:00.861Z] 
[2021-06-07T08:16:00.861Z] Server:
[2021-06-07T08:16:00.861Z]  Containers: 0
[2021-06-07T08:16:00.861Z]   Running: 0
[2021-06-07T08:16:00.861Z]   Paused: 0
[2021-06-07T08:16:00.861Z]   Stopped: 0
[2021-06-07T08:16:00.861Z]  Images: 0
[2021-06-07T08:16:00.861Z]  Server Version: 20.10.6
[2021-06-07T08:16:00.861Z]  Storage Driver: overlay2
[2021-06-07T08:16:00.861Z]   Backing Filesystem: extfs
[2021-06-07T08:16:00.861Z]   Supports d_type: true
[2021-06-07T08:16:00.861Z]   Native Overlay Diff: true
[2021-06-07T08:16:00.861Z]   userxattr: false
[2021-06-07T08:16:00.861Z]  Logging Driver: json-file
[2021-06-07T08:16:00.861Z]  Cgroup Driver: systemd
[2021-06-07T08:16:00.861Z]  Cgroup Version: 2
[2021-06-07T08:16:00.861Z]  Plugins:
[2021-06-07T08:16:00.861Z]   Volume: local
[2021-06-07T08:16:00.861Z]   Network: bridge host ipvlan macvlan null overlay
[2021-06-07T08:16:00.861Z]   Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
[2021-06-07T08:16:00.861Z]  Swarm: inactive
[2021-06-07T08:16:00.861Z]  Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
[2021-06-07T08:16:00.861Z]  Default Runtime: runc
[2021-06-07T08:16:00.861Z]  Init Binary: docker-init
[2021-06-07T08:16:00.861Z]  containerd version: d71fcd7d8303cbf684402823e425e9dd2e99285d
[2021-06-07T08:16:00.861Z]  runc version: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
[2021-06-07T08:16:00.861Z]  init version: de40ad0
[2021-06-07T08:16:00.861Z]  Security Options:
[2021-06-07T08:16:00.861Z]   apparmor
[2021-06-07T08:16:00.861Z]   seccomp
[2021-06-07T08:16:00.861Z]    Profile: default
[2021-06-07T08:16:00.861Z]   cgroupns
[2021-06-07T08:16:00.861Z]  Kernel Version: 5.4.0-1048-aws
[2021-06-07T08:16:00.861Z]  Operating System: Ubuntu 20.04.2 LTS
[2021-06-07T08:16:00.861Z]  OSType: linux
[2021-06-07T08:16:00.861Z]  Architecture: x86_64
[2021-06-07T08:16:00.861Z]  CPUs: 2
[2021-06-07T08:16:00.861Z]  Total Memory: 7.569GiB
[2021-06-07T08:16:00.861Z]  Name: ip-10-100-88-107
[2021-06-07T08:16:00.861Z]  ID: BPI2:DJ3J:2BX4:S6D2:XUGC:3EVF:YTII:HJPT:TJU6:HPWF:P5M7:CUDP
[2021-06-07T08:16:00.861Z]  Docker Root Dir: /var/lib/docker
[2021-06-07T08:16:00.861Z]  Debug Mode: false
[2021-06-07T08:16:00.861Z]  Registry: https://index.docker.io/v1/
[2021-06-07T08:16:00.861Z]  Labels:
[2021-06-07T08:16:00.861Z]  Experimental: true
[2021-06-07T08:16:00.861Z]  Insecure Registries:
[2021-06-07T08:16:00.861Z]   127.0.0.0/8
[2021-06-07T08:16:00.861Z]  Live Restore Enabled: true
[2021-06-07T08:16:00.861Z] 
[2021-06-07T08:16:00.861Z] WARNING: No kernel memory limit support
[2021-06-07T08:16:00.861Z] WARNING: No oom kill disable support

AkihiroSuda · 2021-06-08T07:55:05Z

Opened a new issue for tracking: #3008

cyphar · 2021-06-08T08:30:40Z

@AkihiroSuda Yeah I looked at that a few days ago, and I've been trying to get LXD VMs to work again on openSUSE so I can try to reproduce it locally...

cyphar added the area/cgroupv2 label Jun 2, 2021

cyphar added this to the 1.0.0 milestone Jun 2, 2021

AkihiroSuda reviewed Jun 2, 2021

View reviewed changes

libcontainer/cgroups/ebpf/ebpf.go Outdated Show resolved Hide resolved

AkihiroSuda reviewed Jun 2, 2021

View reviewed changes

libcontainer/cgroups/ebpf/ebpf.go Outdated Show resolved Hide resolved

cyphar mentioned this pull request Jun 2, 2021

[CI/fedora] found more than one filter (2) attached to a cgroup #2976

Open

AkihiroSuda reviewed Jun 2, 2021

View reviewed changes

libcontainer/cgroups/ebpf/ebpf.go Outdated Show resolved Hide resolved

cyphar mentioned this pull request Jun 2, 2021

Fix various linting issues #2974

Merged

AkihiroSuda mentioned this pull request Jun 2, 2021

update runc binary to v1.0.0 GA moby/moby#42450

Merged

AkihiroSuda added kind/bug status/needs-rebase labels Jun 3, 2021

cyphar removed the status/needs-rebase label Jun 3, 2021

cyphar mentioned this pull request Jun 5, 2021

libct/int: add device update test #3000

Merged

cyphar added 2 commits June 8, 2021 13:23

kolyshkin approved these changes Jun 8, 2021

View reviewed changes

AkihiroSuda approved these changes Jun 8, 2021

View reviewed changes

AkihiroSuda merged commit 4d6b929 into opencontainers:master Jun 8, 2021

AkihiroSuda mentioned this pull request Jun 8, 2021

"failed to call BPF_PROG_ATTACH (BPF_CGROUP_DEVICE, BPF_F_ALLOW_MULTI): can't attach program: invalid argument: unknown" (master, kernel 5.4, cgroup2) #3008

Closed

cyphar deleted the ebpf-replacefd-support-check branch June 8, 2021 08:30

kolyshkin mentioned this pull request Jun 9, 2021

(regression in master) not ok 117 update devices [minimal transition rules] #3014

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cgroupv2: ebpf: check for BPF_F_REPLACE support and degrade gracefully #2986

cgroupv2: ebpf: check for BPF_F_REPLACE support and degrade gracefully #2986

cyphar commented Jun 2, 2021

cyphar commented Jun 2, 2021

AkihiroSuda commented Jun 2, 2021

AkihiroSuda commented Jun 2, 2021

cyphar commented Jun 2, 2021

kolyshkin commented Jun 5, 2021

cyphar commented Jun 6, 2021

kolyshkin commented Jun 6, 2021 •

edited

Loading

cyphar commented Jun 7, 2021

kolyshkin commented Jun 8, 2021

cyphar commented Jun 8, 2021

kolyshkin commented Jun 8, 2021

kolyshkin commented Jun 8, 2021

cyphar commented Jun 8, 2021

kolyshkin left a comment

AkihiroSuda commented Jun 8, 2021

AkihiroSuda commented Jun 8, 2021

cyphar commented Jun 8, 2021

cgroupv2: ebpf: check for BPF_F_REPLACE support and degrade gracefully #2986

cgroupv2: ebpf: check for BPF_F_REPLACE support and degrade gracefully #2986

Conversation

cyphar commented Jun 2, 2021

cyphar commented Jun 2, 2021

AkihiroSuda commented Jun 2, 2021

AkihiroSuda commented Jun 2, 2021

cyphar commented Jun 2, 2021

kolyshkin commented Jun 5, 2021

cyphar commented Jun 6, 2021

kolyshkin commented Jun 6, 2021 • edited Loading

cyphar commented Jun 7, 2021

kolyshkin commented Jun 8, 2021

cyphar commented Jun 8, 2021

kolyshkin commented Jun 8, 2021

kolyshkin commented Jun 8, 2021

cyphar commented Jun 8, 2021

kolyshkin left a comment

Choose a reason for hiding this comment

AkihiroSuda commented Jun 8, 2021

AkihiroSuda commented Jun 8, 2021

cyphar commented Jun 8, 2021

kolyshkin commented Jun 6, 2021 •

edited

Loading