-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[KEP-5246] Migrate to systemd's cgroup v1 CPU shares to v2 CPU weight formula #5247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KEP-5246] Migrate to systemd's cgroup v1 CPU shares to v2 CPU weight formula #5247
Conversation
Signed-off-by: Itamar Holder <iholder@redhat.com>
9285303 to
12f5084
Compare
Signed-off-by: Itamar Holder <iholder@redhat.com>
12f5084 to
8020ae0
Compare
|
FYI @vladikr |
|
/cc @giuseppe |
giuseppe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: giuseppe, iholder101 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
|
||
| - A significant amount of the work would need to land in other layers, mainly OCI runtimes and the CRI. | ||
| - We'll probably need a CRI configuration to ensure coordination between the CRI and the OCI runtimes implementations, | ||
| and to ensure it lands at the same version, as suggested |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we expand this in the design section to understand what needs to be changed and how the rollout will happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @yujuhong!
What I have in mind is:
- Change implementation in OCIs.
- Expose a configuration the CRI can use.
- Ensure CRI supports the new OCI configuration.
- Test k8s with the new CRI configuration.
- In the same k8s release:
- Change the k8s code related to this formula, mainly pod-resources resize etc.
- Ensure CRIs enable the OCI config.
Does that sound right? WDYT?
|
|
||
| That being said, the formula in entirely an implementation detail that's most probably not being counted | ||
| to have certain concrete values. In any way, we should ensure that the new formula is well documented | ||
| and that the change is properly communicated to the users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be a way for the users to configure back to the previous behavior, in the case of unexpected issues resulted from the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this mechanism will be implemented by the OCI runtimes, the entire mechanism is a workaround for using cgroup v1 settings in a cgroup v2 environment.
If someone wants to have full control of the cgroup v2 values, then the must use the native cgroupv2 unified map to pass the correct value down the stack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mainly asked because if we simply switched the implementation, it may have unexpected impact on user workloads. Having an option to preserve the original behavior can be important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the change was purely at the OCI/CRI level, I'd say we can delegate this decision to these layers.
However, since there's a little k8s code that's also effected (e.g. pod-level resources) if a user would want to roll-back to the previous behavior we would need such a configurable.
Saying that, I'm not entirely sure it will be valuable and that it's worth introducing a config that will most likely be deprecated very quickly.
In the end of the day, the only users I can think of that will be negatively affected by this are users who expect concrete CPU weight values that would now be changed. I think this use-case is very rare if exists at all. In any case, I'm open to this approach.
|
/cc |
|
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
|
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
tallclair
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't change the discussion, but IMO we should stop converting between CPU shares & weights. Rather, we should use k8s native resource type (milli-CPU), and only convert to the cgroup types at the point where the cgroup value is really needed. In other words, we would only ever convert directly to and from milli-cpu to shares, or milli-cpu to weight, but not shares to weight.
| [kubernetes/kubernetes]: https://git.k8s.io/kubernetes | ||
| [kubernetes/website]: https://git.k8s.io/website | ||
|
|
||
| ## Summary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is all background. I would add a ### Background heading over it, and add a summary of the proposed changes here.
| `cpu.weight = (1 + ((cpu.shares - 2) * 9999) / 262142)` | ||
|
|
||
| While systemd's formula is: | ||
| `cpu.weight = 1 + ((cpu.shares – 2) × 99) / (1024 – 2)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This differs from the new opencontainers formula - wouldn't it be better to match that?
|
The KEP was closed so we do not need this PR /close |
|
@SergeyKanzhelev: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
One-line PR description: Migrate to systemd's cgroup v1 CPU shares to v2 CPU weight formula.
Issue link: Migrate to systemd's cgroup v1 CPU shares to v2 CPU weight formula #5246
Other comments:
See discussion on Conversion of cgroup v1 CPU shares to v2 CPU weight causes workloads to have low CPU priority kubernetes#131216
KEP in a human-readable format: https://github.com/iholder101/kubernetes-enhancements/blob/kep/systemd_cpu_cgroup_conversion/keps/sig-node/5246-cgroup-cpu-share-to-weight-conversion/README.md