Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing CPU period fails for subsystems if existing parent has quota>0 with systemd driver #3084

Closed
hk-vmg opened this issue Jul 9, 2021 · 6 comments · Fixed by #3090
Closed

Comments

@hk-vmg
Copy link

hk-vmg commented Jul 9, 2021

When setting a lower CFS CPU period, creating a runc container in Kubernetes using systemd driver fails with:
write /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<HASH>.slice/cri-containerd-<HASH>.scope/cpu.cfs_period_us: invalid argument: unknown

cgroupfs driver works fine.

It appears that order matters, when both period and quota files are changed. I believe the root cause is that the one-at-a-time updates to the child files cause the parent limits to be exceeded. This is seen in Kubernetes - if all containers in a pod have limits, then the parent "pod" slice will set a quota and the child "container" settings can't be updated.

https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/fs/cpu.go#L74

Tested with latest master branch runc.

systemd is version 219 but I believe I have all the required OS/systemd patches:

$ rpm -q systemd --changelog | grep -i cpu
core: add CPUQuotaPeriodSec= (#1770379)
core: downgrade CPUQuotaPeriodSec= clamping logs to debug (#1770379) 

@sureshvis

@kolyshkin
Copy link
Contributor

  1. The error actually comes from the kernel.
  2. systemd driver sets the property/limit for systemd unit (those that are possible to represent for systemd, see https://github.com/opencontainers/runc/blob/master/docs/systemd.md#resource-limits), and then uses fs driver as a backup.
  3. The translation of OCI runtime spec properties to systemd properties for CPU quota and period happens here:
    func addCpuQuota(cm *dbusConnManager, properties *[]systemdDbus.Property, quota int64, period uint64) {
    if period != 0 {
    // systemd only supports CPUQuotaPeriodUSec since v242
    sdVer := systemdVersion(cm)
    if sdVer >= 242 {
    *properties = append(*properties,
    newProp("CPUQuotaPeriodUSec", period))
    } else {
    logrus.Debugf("systemd v%d is too old to support CPUQuotaPeriodSec "+
    " (setting will still be applied to cgroupfs)", sdVer)
    }
    }
    if quota != 0 || period != 0 {
    // corresponds to USEC_INFINITY in systemd
    cpuQuotaPerSecUSec := uint64(math.MaxUint64)
    if quota > 0 {
    if period == 0 {
    // assume the default
    period = defCPUQuotaPeriod
    }
    // systemd converts CPUQuotaPerSecUSec (microseconds per CPU second) to CPUQuota
    // (integer percentage of CPU) internally. This means that if a fractional percent of
    // CPU is indicated by Resources.CpuQuota, we need to round up to the nearest
    // 10ms (1% of a second) such that child cgroups can set the cpu.cfs_quota_us they expect.
    cpuQuotaPerSecUSec = uint64(quota*1000000) / period
    if cpuQuotaPerSecUSec%10000 != 0 {
    cpuQuotaPerSecUSec = ((cpuQuotaPerSecUSec / 10000) + 1) * 10000
    }
    }
    *properties = append(*properties,
    newProp("CPUQuotaPerSecUSec", cpuQuotaPerSecUSec))
    }
    }

Note that systemd >= 242 is required to set systemd CPUQuotaPeriod, meaning that in your case runc only sets the quota and do not set the period for systemd. This is probably why setting the period fails later.

I see a few ways to fix this:

  1. For systemd, if the period is not default, either set the quota and period together, or do not set at all (it will be set anyway by the fs driver). This should be an easy fix.
  2. Instead of relying on systemd version to figure out whether to set CPUQuotaPeriod, query the current value -- if parameter exists, we can set it. This will slow things down a bit.
  3. In fs driver, if the setting is failed with EINVAL (like in your case), retry one more time after setting the other parameter.

@kolyshkin
Copy link
Contributor

systemd is version 219

Do you mean 239?

kolyshkin added a commit to kolyshkin/runc that referenced this issue Jul 15, 2021
Testing that it fails without the fix.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin
Copy link
Contributor

I think systemd is relevant but for the different reasons than those in items 1 and 2 in my earlier comment. It seems if systemd driver is used, systemd sets the period to 1000000 (i.e. 10x of the default period of 100000), and this may play a role in the failure.

In any case, items 1 and 2 are not entirely correct as with systemd we do not set "quota" and "period" separately, but a combined value CPUQuotaPerSec (which results in setting both quota/period), and so setting or not setting CPUQuotaPeriod does not really matter.

The item 3 gives a good description of what is happening. Fix is on the way (took me a long time to code the test case).

kolyshkin added a commit to kolyshkin/runc that referenced this issue Jul 15, 2021
Testing that it fails without the fix.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@hk-vmg
Copy link
Author

hk-vmg commented Jul 15, 2021

Thanks so much! This is RedHat Enterprise Linux, which means systemd is a lower version but contains a lot of cherry-picked fixes backported into the lower version - I see same behavior in RHEL7 = systemd 219 and RHEL8 = systemd 239.

Maybe that is root cause: if sdVer >= 242 { from code snippet above is false, so it's falling back to the cgroup behavior. If I update the libcontainer/cgroups/fs/cpu.go#L74 logic to operate in a different order based on whether the period is going up or down, this WFM with systemd driver. It sounds like your retry fix will resolve that.

It doesn't look like there's an easy way to detect whether CPUQuotaPerSec support is available in some version-independent way, unfortunately. I suppose an "isRhel" check would work but maybe a bit hacky.

$ busctl get-property org.freedesktop.systemd1 /org/freedesktop/systemd1 org.freedesktop.systemd1.Manager Version
s "239 (239-XX.el8_X.X)"

kolyshkin added a commit to kolyshkin/runc that referenced this issue Jul 15, 2021
Testing that it fails without the fix.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
kolyshkin added a commit to kolyshkin/runc that referenced this issue Jul 15, 2021
Testing that it fails without the fix.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin
Copy link
Contributor

In fact it's not absolutely required to set a period via systemd, as we set if via fallback fs driver anyway, so I am dropping the idea of using more sophisticated methods of figuring out whether systemd is supporting properties that we want to set.

I have the fix almost ready, the only problem is the test case, in particular #3090 (comment). Perhaps you can help with that @hk-vmg ?

@hk-vmg
Copy link
Author

hk-vmg commented Jul 21, 2021

Sorry for delay - thank you so much for the progress. I'll see if I can add anything, but I see you've likely got most/all of the way there.

kolyshkin added a commit to kolyshkin/crun that referenced this issue Apr 6, 2023
Sometimes setting CPU quota period fails when a new period is lower,
and a parent cgroup has CPU quota limit set.

This happens as in cgroup v1 the quota and the period can not be set
together (this is fixed in v2), and since the period is being set first,
new_limit = old_quota/new_period may be higher than the parent cgroup
limit.

The fix is to retry setting the period after the quota, to cover all
possible scenarios.

Tested via runc integration tests. Before the commit, it fails:

	root@ubu2004:~/git/runc# RUNC=`pwd`/../crun/crun.before bats -f "pod cgroup" -t  tests/integration/update.bats
	1..1
	not ok 1 update cpu period in a pod cgroup with pod limit set
	# (in test file tests/integration/update.bats, line 424)
	#   `[ "$status" -eq 0 ]' failed
	# crun.before spec (status=0):
	#
	# crun.before run -d --console-socket /tmp/bats-run-30428-dYkMDC/runc.4FdCtn/tty/sock test_update (status=0):
	#
	# crun.before update --cpu-quota 600000 test_update (status=1):
	# writing file `cpu.cfs_quota_us`: Invalid argument
	# crun.before update --cpu-period 10000 --cpu-quota 3000 test_update (status=1):
	# writing file `cpu.cfs_period_us`: Invalid argument

With the fix, the test passes.

Originally reported for runc in
opencontainers/runc#3084

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
kolyshkin added a commit to kolyshkin/crun that referenced this issue Apr 6, 2023
Sometimes setting CPU quota period fails when a new period is lower,
and a parent cgroup has CPU quota limit set.

This happens as in cgroup v1 the quota and the period can not be set
together (this is fixed in v2), and since the period is being set first,
new_limit = old_quota/new_period may be higher than the parent cgroup
limit.

The fix is to retry setting the period after the quota, to cover all
possible scenarios.

Tested via runc integration tests. Before the commit, it fails:

	root@ubu2004:~/git/runc# RUNC=`pwd`/../crun/crun.before bats -f "pod cgroup" -t  tests/integration/update.bats
	1..1
	not ok 1 update cpu period in a pod cgroup with pod limit set
	# (in test file tests/integration/update.bats, line 424)
	#   `[ "$status" -eq 0 ]' failed
	# crun.before spec (status=0):
	#
	# crun.before run -d --console-socket /tmp/bats-run-30428-dYkMDC/runc.4FdCtn/tty/sock test_update (status=0):
	#
	# crun.before update --cpu-quota 600000 test_update (status=1):
	# writing file `cpu.cfs_quota_us`: Invalid argument
	# crun.before update --cpu-period 10000 --cpu-quota 3000 test_update (status=1):
	# writing file `cpu.cfs_period_us`: Invalid argument

With the fix, the test passes.

Originally reported for runc in
opencontainers/runc#3084

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants