-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-note: Describe issues around node admission in 1.22 #107348
release-note: Describe issues around node admission in 1.22 #107348
Conversation
@smarterclayton: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
CHANGELOG/CHANGELOG-1.22.md
Outdated
@@ -826,6 +826,10 @@ A regression bug was found where guaranteed Pods with multiple containers do not | |||
|
|||
If CSIMigrationvSphere feature gate is enabled, user should not upgrade to Kubernetes v1.22. vSphere CSI Driver does not support Kubernetes v1.22 yet because it uses v1beta1 CRD APIs. Support for v1.22 will be added at a later release. Check the following document for supported Kubernetes releases for a given [vSphere CSI Driver version](https://vsphere-csi-driver.sigs.k8s.io/compatiblity_matrix.html#compatibility-matrix-for-vsphere-csi-driver). | |||
|
|||
### Workloads that saturate nodes with pods may see pods that fail due to node admission | |||
|
|||
1.22 addressed a long-standing issue in the Kubelet where terminating pods were [vulnerable to race conditions](https://github.com/kubernetes/kubernetes/pull/102344) leading to early shutdown, resource leaks, or long delays in actually completing pod shutdown. As a consequence of this change the Kubelet now correctly takes into account the resources of running and terminating pods when deciding to accept new pods, since terminating pods are still holding on to those resources. This stricter handling may surface to end users as pod rejections when creating pods that are scheduled to mostly full nodes that have other terminating pods holding the resources the new pods need. The most likely error would be a pod set to `Failed` phase with reason set to `OutOfCpu` or `OutOfMemory`, but any resource on the node that has some fixed limit (including persistent volume counts on cloud nodes, exclusive CPU cores, or unique hardware devices) could trigger the failure. While this behavior is correct it reduces the throughput of pod execution and creates user-visible warnings - [future versions of Kubernetes will minimize the likelihood users see rejected pods](https://github.com/kubernetes/kubernetes/issues/106884). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a "warning" in the case of a workload pod (replicaset, statefulset, etc). But it's a complete failure in the case of plain pods or jobs with a small backoffLimit
.
Can we suggest users to not use plain pods or backoffLimit=0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was already a complete failure for those users during eviction anyway. Do we have a doc that describes the recommended interactions here to reference directly, vs inlining it? I'd probably prefer
https://kubernetes.io/docs/concepts/scheduling-eviction/#pod-disruption is close, we could potentially describe in more detail there.
https://kubernetes.io/docs/tasks/extend-kubernetes/ is probably a good place where we could describe "writing a controller that schedules pods onto a cluster" in detail, including this warning.
Added a stub sentence pending us deciding where to put a better section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we say "user-visible pod failures" instead?
Maybe a better link is https://kubernetes.io/docs/concepts/workloads/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of the discussion, https://kubernetes.io/docs/concepts/workloads/ doesn't seem to address taking kubelet rejection into account directly - which part do you view as important to communicate to users?
/assign alculquicondor @smarterclayton one suggestion inline from reviewer! |
0147160
to
885d20e
Compare
CHANGELOG/CHANGELOG-1.22.md
Outdated
@@ -826,6 +826,10 @@ A regression bug was found where guaranteed Pods with multiple containers do not | |||
|
|||
If CSIMigrationvSphere feature gate is enabled, user should not upgrade to Kubernetes v1.22. vSphere CSI Driver does not support Kubernetes v1.22 yet because it uses v1beta1 CRD APIs. Support for v1.22 will be added at a later release. Check the following document for supported Kubernetes releases for a given [vSphere CSI Driver version](https://vsphere-csi-driver.sigs.k8s.io/compatiblity_matrix.html#compatibility-matrix-for-vsphere-csi-driver). | |||
|
|||
### Workloads that saturate nodes with pods may see pods that fail due to node admission | |||
|
|||
1.22 addressed a long-standing issue in the Kubelet where terminating pods were [vulnerable to race conditions](https://github.com/kubernetes/kubernetes/pull/102344) leading to early shutdown, resource leaks, or long delays in actually completing pod shutdown. As a consequence of this change the Kubelet now correctly takes into account the resources of running and terminating pods when deciding to accept new pods, since terminating pods are still holding on to those resources. This stricter handling may surface to end users as pod rejections when creating pods that are scheduled to mostly full nodes that have other terminating pods holding the resources the new pods need. The most likely error would be a pod set to `Failed` phase with reason set to `OutOfCpu` or `OutOfMemory`, but any resource on the node that has some fixed limit (including persistent volume counts on cloud nodes, exclusive CPU cores, or unique hardware devices) could trigger the failure. While this behavior is correct it reduces the throughput of pod execution and creates user-visible warnings - [future versions of Kubernetes will minimize the likelihood users see rejected pods](https://github.com/kubernetes/kubernetes/issues/106884). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we say "user-visible pod failures" instead?
Maybe a better link is https://kubernetes.io/docs/concepts/workloads/
The 1.22 release fixed an issue where pods that were terminating were not always properly accounting for the resources they used. As a consequence, certain workloads that saturate a single node with pods may see increased pod creation failures until existing pods fully terminate. Inform users of that change and link to where we will resolve in the future.
885d20e
to
2819891
Compare
/lgtm |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
1 similar comment
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
The 1.22 release fixed an issue where pods that were terminating
were not always properly accounting for the resources they used.
As a consequence, certain workloads that saturate a single node with
pods may see increased pod creation failures until existing pods
fully terminate. Inform users of that change and link to where we
will resolve in the future.
What type of PR is this?
/kind documentation
What this PR does / why we need it:
Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: