Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail Kubelet at startup if swap is configured with cgroup v1 #122241

Closed

Conversation

kannon92
Copy link
Contributor

@kannon92 kannon92 commented Dec 8, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

Since Swap requires cgroup v2 to work, we want to fail starting kubelet if swap is enabled and the node is using cgroup v1.

Which issue(s) this PR fixes:

Fixes #119427

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fail kubelet at startup if swap is configured with cgroupv1.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

[KEP]: https://github.com/kubernetes/enhancements/issues/2400

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. labels Dec 8, 2023
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.29 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.29.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Fri Dec 8 16:13:50 UTC 2023.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 8, 2023
@kannon92
Copy link
Contributor Author

kannon92 commented Dec 8, 2023

This PR addresses the comments in #119327.

I decided to drop the swap_test changes to minimize noise.

@@ -224,6 +224,8 @@ func NewContainerManager(mountUtil mount.Interface, cadvisorInterface cadvisor.I
return nil, fmt.Errorf("running with swap on is not supported, please disable swap! or set --fail-swap-on flag to false. /proc/swaps contained: %v", swapLines)
}
}
} else if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) && !cgroups.IsCgroup2UnifiedMode() {
Copy link
Member

@aojea aojea Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok , there is some complexity here, there is an option that defines the behavior of the kubelet if swap is enabled or no

// failSwapOn tells the Kubelet to fail to start if swap is enabled on the node.
--
581 | // Default: true
582 | // +optional
583 | FailSwapOn *bool `json:"failSwapOn,omitempty"`

if that option is not enabled, the kubelet keeps working.

Now we are going to add a new failure mode, if the feature gate NodeSwap is enabled and cgroupsv1, it will fail too.

The question here is that featuregates are not flags, so they can not be used to define conditional behaviors, when the feature graduates the feature gate will no longer be available and will make the kubelet to always fail on hosts with cgroupsv1 and swap enabled but the "FailSwapOn" to false, that will be confusing

/cc @SergeyKanzhelev

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out.

Generally, we are only hitting this change if someone says failSwapOn=false and then we verify that it must be cgroupv2. We do hit a undefined case if NodeSwap is enabled but failSwapOn=true. It essentially makes the feature gate useless. So we will require both NodeSwap=true and failSwapOn=false for swap to be enabled. We also require that the node actually has swap enabled on the kernel level.

I don't really know what we want to do with FailSwapOn for GA for this feature tbh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, I think this should be fine as is.

Even though swap is in beta (technically beta2), we are rolling this out with the feature flag off. Its a bit more involved feature as the node needs to have swap set up correctly to work. As with any feature we will eventually drop the feature gate when we GA but I think we need a lot more soak time for this feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those are the right questions Kevin, and that is the goal of the KEPs , you discuss this ambiguous problems and reach a conclusion ... and there is a place to go back and see "oh, this was done because of this", there is a section already for these kind of problems that PRR reviewers check , cc: @wojtek-t

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see @SergeyKanzhelev also did similar comments in the previous PR #119327 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to not use the feature toggle and I use the config option --fail-swap-on=false as that will be required to run kubelet on swap.

We are still debating about how swap rollout should go so I think this is fine now.

/cc @harche

@kannon92
Copy link
Contributor Author

/cc @SergeyKanzhelev @rphillips @harche

@kannon92
Copy link
Contributor Author

kannon92 commented Jan 9, 2024

/triage accepted
/priority important-soon

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 9, 2024
@kannon92 kannon92 moved this from Triage to Needs Reviewer in SIG Node PR Triage Jan 9, 2024
@kannon92
Copy link
Contributor Author

kannon92 commented Jan 9, 2024

/test

@@ -839,7 +839,7 @@ func run(ctx context.Context, s *options.KubeletServer, kubeDeps *kubelet.Depend
TopologyManagerScope: s.TopologyManagerScope,
TopologyManagerPolicyOptions: topologyManagerPolicyOptions,
},
s.FailSwapOn,
s.KubeletConfiguration.FailSwapOn,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test is correct, the flag is true

Jan 10 16:09:36 kind-control-plane kubelet[324]: I0110 16:09:36.749688 324 flags.go:64] FLAG: --fail-swap-on="true"

the problem is that here you are setting the value of the configuration

This is a breaking change, you can not change the order of preference of flags vs configuration so lightly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I logged what s.FailSwapOn is and its false in this case. So this is probably not being set..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are parsing this bool from the cli and for some reason its set to false (or not being set at all).

iholder101 and others added 2 commits January 10, 2024 15:12
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kannon92
Once this PR has been reviewed and has the lgtm label, please assign random-liu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kannon92
Copy link
Contributor Author

kannon92 commented Jan 10, 2024

@aojea I tried to reproduce it locally and the config is parsed correctly.

I started to read the logs https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/122241/pull-kubernetes-e2e-kind/1745131801791172608/build-log.txt and I notice that kubeadm is actually setting the KubeletConfig with failSwapOn to false. We are actually ignoring the CLI and using the config. I think this is WAI as the cli options are actually deprecated.

It looks like kind has failSwapOn=false hardcoded in the kubeadm config.

https://github.com/kubernetes-sigs/kind/blob/40c81f187425254daf2bf84360a6257a278252df/pkg/cluster/internal/kubeadm/config.go#L264

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
metadata:
  name: config
cgroupDriver: {{ .CgroupDriver }}
cgroupRoot: /kubelet
failSwapOn: false

So there seems to be some legacy reasons for setting failSwapOn: false. https://cs.k8s.io/?q=failSwapOn%3A%20false&i=nope&files=&excludeFiles=&repos=

I don't really know the reason why this is set up as is but I'm not sure how to proceed with this PR then.

So what I see is there are a lot of templates for setting swap to false. We also set to swap to false for development environment (in hack/local-up-cluster). So I am not sure if this PR is the right path forward.

The Swap KEP is not intended to be used with cgroup v1 but I realize that it seems that it has/is possible to deploy kubelet on a swap enabled node. We just don't have workloads actually use swap but that doesn't mean that we can set swap and cgroupv1 as a failure path.

Comment on lines 227 to 228
} else if !cgroups.IsCgroup2UnifiedMode() {
return nil, fmt.Errorf("failSwapOn is %v but running swap with cgroups v1 is not supported", failSwapOn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that what you want to do is

if failSwapOn && !cgroups.IsCgroup2UnifiedMode()  {
  return nil, fmt.Errorf("failSwapOn is %v but running swap with cgroups v1 is not supported", failSwapOn)
}

and this before the if in line 207

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 10, 2024
@kannon92
Copy link
Contributor Author

kannon92 commented Jan 10, 2024

So thinking about this I think we have three cases that we need to support:

case 1:
Prod now: --fail-swap-on=true
case 2:
CI (kind, minikube, etc and Dev): --fail-swap-on=false for cgroupv1 and cgroupv2.
This is to allow for devs/CI to use machines with swap but kube does not utilize swap.
case 3 (Swap KEP): --fail-swap=on=false and feature gate NodeSwap=true
Nodes can have Swap and kubernetes workloads are allowed to use swap. Basically Kube can not utilize swap in cgroup v1.

case 3 is the main one we want to fail. Case 2 should not be a fail case for swap as kube isn't actually using swap.

@kannon92
Copy link
Contributor Author

/test pull-kubernetes-e2e-kind-ipv6

@aojea
Copy link
Member

aojea commented Jan 11, 2024

So thinking about this I think we have three cases that we need to support:

everything that worked before has to keep working the same case 1 and case 2 are things that people use already and we can not break, is not only kind , CI , dev, there are people out there setting that value ... those has to have the same behavior, you can not change it

case 3 (Swap KEP): --fail-swap=on=false and feature gate NodeSwap=true
Nodes can have Swap and kubernetes workloads are allowed to use swap. Basically Kube can not utilize swap in cgroup v1.

These are the kind of things that have to be decided on the KEP and that is the reason we have PRR reviews, to evaluate the risks of rolling these changes in production

@kannon92
Copy link
Contributor Author

/retest

@kannon92
Copy link
Contributor Author

/cc @harche

@@ -803,6 +803,10 @@ func run(ctx context.Context, s *options.KubeletServer, kubeDeps *kubelet.Depend
return fmt.Errorf("topology manager policy options %v require feature gates %q enabled",
s.TopologyManagerPolicyOptions, features.TopologyManagerPolicyOptions)
}
workloadsUseSwap := false
if utilfeature.DefaultFeatureGate.Enabled(features.NodeSwap) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking with @aojea if feature gate check is fine to use it here.

@aojea
Copy link
Member

aojea commented Jan 12, 2024

@harche @kannon92 I was rereading the kep https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2400-node-swap#api-changes and we need to change the KEP as what we are implementing here is different that what is described there.

That is completely ok, no problem with that, but KEPs are kind of source of truth for docs, so the behavior implemented has to be reflected in the KEP

### Goals

- On Linux systems, when swap is provisioned and available, Kubelet can start
  up with swap on.
- Configuration is available for kubelet to set swap utilization available to
  Kubernetes workloads, defaulting to 0 swap.
- Cluster administrators can enable and configure kubelet swap utilization on a
  per-node basis.
- Use of swap memory with both cgroupsv1 and cgroupsv2 is supported.

... 


* If `SwapBehavior` is not set or set to `"LimitedSwap"`, containers do not have
  access to swap beyond their memory limit. This value prevents a container
  from using swap in excess of their memory limit, even if it is enabled on a
  system.
  * With cgroups v1, it is possible for a container to use _some_ swap if its
    combined memory and swap usage do not exceed the
    [`memory.memsw.limit_in_bytes`] limit.
  * With cgroups v2, swap is configured independently from memory. Thus, the
    container runtimes can set [`memory.swap.max`] to 0 in this case, and _no_ swap
    usage will be permitted.

I understand that we want to re-scope the feature to work only with cgroupsv2, so let me suggest another angle.
Reading the KEP it seems the swap behavior is governed by SwapBehavior , what if we do the feature with cgroupsv1 a noop and we log an error, that is very simple, completely backwards compatible without risk of breaking kubelets and that is easy to think about. Current proposal of different behavior depending on flags and featureflags is hard to reason about and complex to implement in a backward compatible way, you have to think that we need to graduate and remove the featureflag, so this bool will be a constant soon and we'' have the backwards compatible problem with some kubelet crashing on start

diff --git a/pkg/kubelet/kuberuntime/kuberuntime_container_linux.go b/pkg/kubelet/kuberuntime/kuberuntime_container_linux.go
index f14b6d1a6fc..b0200953bf2 100644
--- a/pkg/kubelet/kuberuntime/kuberuntime_container_linux.go
+++ b/pkg/kubelet/kuberuntime/kuberuntime_container_linux.go
@@ -167,6 +167,11 @@ func (m *kubeGenericRuntimeManager) configureContainerSwapResources(lcr *runtime
                klog.InfoS("No swap cgroup controller present", "swapBehavior", m.memorySwapBehavior, "pod", klog.KObj(pod), "containerName", container.Name)
                return
        }
+
+       if !libcontainercgroups.IsCgroup2UnifiedMode() {
+               klog.InfoS("Swap behavior only available with cgroupsv2", "swapBehavior", m.memorySwapBehavior, "pod", klog.KObj(pod), "containerName", container.Name)
+               return
+       }
        swapConfigurationHelper := newSwapConfigurationHelper(*m.machineInfo)
 
        if !utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NodeSwap) {

@kannon92
Copy link
Contributor Author

kannon92 commented Jan 12, 2024

@aojea I have been working on https://docs.google.com/document/d/1S75d_0N0i1taGTGVjQ8YLYymq8D5Yc0aVu-ROMqQkSs/edit?usp=sharing to summarize our findings for evalulating swap. We are planning to update the KEP to reflect a lot of these changes.

Currently this was called out in the Beta1 section but I guess the main text was not updated.

@harche
Copy link
Contributor

harche commented Jan 12, 2024

I understand that we want to re-scope the feature to work only with cgroupsv2, so let me suggest another angle.
Reading the KEP it seems the swap behavior is governed by SwapBehavior , what if we do the feature with cgroupsv1 a noop and we log an error, that is very simple, completely backwards compatible without risk of breaking kubelets and that is easy to think about. Current proposal of different behavior depending on flags and featureflags is hard to reason about and complex to implement in a backward compatible way, you have to think that we need to graduate and remove the featureflag, so this bool will be a constant soon and we'' have the backwards compatible problem with some kubelet crashing on start

Thanks @aojea for that feedback. As @kannon92 already pointed out, the Goals section should have been updated when we updated the graduation criteria for Beta1.

However, we need to reach a consensus on kubelet behavior in case of swap and cgroup v1. @iholder101 started working on the original PR to fail the kubelet after receiving the feedback during the review process. cc @SergeyKanzhelev @pacoxu

@kannon92
Copy link
Contributor Author

I opened up kubernetes/enhancements#4401 to reflect some of this @aojea.

We added a non goal to support cgroupv1 and this feature. I think we need a new configuration for swap enabled nodes with this feature off.

This is a tricky feature and we didn't actually have any way for a user to opt out of kube using swap. --fail-swap-on=false with NodeSwap=true is fine for alpha + beta but when we GA this feature, we will assume that NodeSwap=true is always enabled so we can't really toggle this feature. We only had UnlimitedSwap and LimitedSwap as options and that meant we could never disable the use of swap.

Its clear to me that this feature will be an advanced one and I think we should have fall back options even if this feature is GAed.

@aojea
Copy link
Member

aojea commented Jan 12, 2024

Its clear to me that this feature will be an advanced one and I think we should have fall back options even if this feature is GAed.

it is a technical problem, you know more than me in this area so I trust your judgment, you have several options on the table, flags, noop for cgroupsv1, ... so describe them, study the pros and cons and choose the ones you think is better for the project and the users

Thanks for the healthy discussion

@kannon92
Copy link
Contributor Author

I moved @aojea suggestion to #122745

going to close this one so we can keep new implementation discussion in one place.

/close

SIG Node PR Triage automation moved this from Needs Reviewer to Done Jan 13, 2024
@k8s-ci-robot
Copy link
Contributor

@kannon92: Closed this PR.

In response to this:

I moved @aojea suggestion to #122745

going to close this one so we can keep new implementation discussion in one place.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.

[KEP-2400] Enable Swap Support for Cgroup v2 Only
5 participants