New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k8s.io/dynamic-resource-allocation: fix potential scheduling deadlock #120871
k8s.io/dynamic-resource-allocation: fix potential scheduling deadlock #120871
Conversation
/cc @elazar |
@pohly: GitHub didn't allow me to request PR reviews from the following users: elazar. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
selectedNode := schedulingCtx.Spec.SelectedNode | ||
potentialNodes := schedulingCtx.Spec.PotentialNodes | ||
if selectedNode != "" && !hasString(potentialNodes, selectedNode) { | ||
potentialNodes = append(potentialNodes, selectedNode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Some of the logic below seems to rely on the fact that selectedNode
is the first element in a slice. Does this mean that we need to prepend it here instead of appending it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. After UnsuitableNodes
, the selected node is either the first element, the last element, not present, or in the middle. By ensuring that the truncated element (first or last) is not the selected node, we can be sure that the rest includes it regardless where exactly it is.
if lenUnsuitable > resourcev1alpha2.PodSchedulingNodeListMaxSize { | ||
if delayed.UnsuitableNodes[0] == selectedNode { | ||
// Truncate at the end and keep selected node in the first element. | ||
delayed.UnsuitableNodes = delayed.UnsuitableNodes[0:lenUnsuitable-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This truncation assumes that the difference between lenUnsuitable
and PodSchedulingNodeListMaxSize
is 1. Is this guaranteed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was no truncation earlier. Drivers had to return at most PodSchedulingNodeListMaxSize
, which they did by iterating over the potential nodes slice. Now that slice is potentially one element longer, so truncating by one element works.
The UnsuitableNodes
method call description should get extended to cover this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, and also unit tests added. That actually revealed that I had not handled the truncation when initially creating the status entry.
When handling a PodSchedulingContext object, the code first checked for unsuitable nodes and then tried to allocate if (and only if) the selected node hadn't been found to be unsuitable. If for whatever reason the selected node wasn't listed as potential node, then scheduling got stuck because the allocation would fail and cause a return with an error instead of updating the list of unsuitable nodes. This would be retried with the same result. To avoid this scenario, the selected node now also gets checked. This is better than assuming a certain kube-scheduler behavior. This problem occurred when experimenting with cluster autoscaling: spec: potentialNodes: - gke-cluster-pohly-pool-dra-69b88e1e-bz6c - gke-cluster-pohly-pool-dra-69b88e1e-fpvh selectedNode: gke-cluster-pohly-default-pool-c9f60a43-6kxh Why the scheduler wrote a spec like this is unclear. This was with Kubernetes 1.27 and the code has been updated since then, so perhaps it's resolved.
21270c3
to
0ba37e7
Compare
/retest |
/triage accepted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@elezar: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: elezar, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
LGTM label has been added. Git tree hash: 3785d362368f66174720a3a8b18fb9c5a5e98e3a
|
@pohly: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
When handling a PodSchedulingContext object, the code first checked for unsuitable nodes and then tried to allocate if (and only if) the selected node hadn't been found to be unsuitable.
If for whatever reason the selected node wasn't listed as potential node, then scheduling got stuck because the allocation would fail and cause a return with an error instead of updating the list of unsuitable nodes. This would be retried with the same result.
To avoid this scenario, the selected node now also gets checked. This is better than assuming a certain kube-scheduler behavior.
Special notes for your reviewer:
This problem occurred when experimenting with cluster autoscaling:
Why the scheduler wrote a spec like this is unclear. This was with Kubernetes 1.27 and the code has been updated since then, so perhaps it's resolved.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: