k8s.io/dynamic-resource-allocation: fix potential scheduling deadlock #120871

pohly · 2023-09-25T13:30:03Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

When handling a PodSchedulingContext object, the code first checked for unsuitable nodes and then tried to allocate if (and only if) the selected node hadn't been found to be unsuitable.

If for whatever reason the selected node wasn't listed as potential node, then scheduling got stuck because the allocation would fail and cause a return with an error instead of updating the list of unsuitable nodes. This would be retried with the same result.

To avoid this scenario, the selected node now also gets checked. This is better than assuming a certain kube-scheduler behavior.

Special notes for your reviewer:

This problem occurred when experimenting with cluster autoscaling:

spec:
  potentialNodes:
  - gke-cluster-pohly-pool-dra-69b88e1e-bz6c
  - gke-cluster-pohly-pool-dra-69b88e1e-fpvh
  selectedNode: gke-cluster-pohly-default-pool-c9f60a43-6kxh

Why the scheduler wrote a spec like this is unclear. This was with Kubernetes 1.27 and the code has been updated since then, so perhaps it's resolved.

Does this PR introduce a user-facing change?

k8s.io/dynamic-resource-allocation: handle a selected node which isn't listed as potential node

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/issues/3063

pohly · 2023-09-25T13:30:15Z

/cc @elazar

k8s-ci-robot · 2023-09-25T13:30:17Z

@pohly: GitHub didn't allow me to request PR reviews from the following users: elazar.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @elazar

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

elezar · 2023-09-25T14:53:38Z

staging/src/k8s.io/dynamic-resource-allocation/controller/controller.go

+	selectedNode := schedulingCtx.Spec.SelectedNode
+	potentialNodes := schedulingCtx.Spec.PotentialNodes
+	if selectedNode != "" && !hasString(potentialNodes, selectedNode) {
+		potentialNodes = append(potentialNodes, selectedNode)


Question: Some of the logic below seems to rely on the fact that selectedNode is the first element in a slice. Does this mean that we need to prepend it here instead of appending it?

No. After UnsuitableNodes, the selected node is either the first element, the last element, not present, or in the middle. By ensuring that the truncated element (first or last) is not the selected node, we can be sure that the rest includes it regardless where exactly it is.

elezar · 2023-09-25T14:56:52Z

staging/src/k8s.io/dynamic-resource-allocation/controller/controller.go

+			if lenUnsuitable > resourcev1alpha2.PodSchedulingNodeListMaxSize {
+				if delayed.UnsuitableNodes[0] == selectedNode {
+					// Truncate at the end and keep selected node in the first element.
+					delayed.UnsuitableNodes = delayed.UnsuitableNodes[0:lenUnsuitable-1]


This truncation assumes that the difference between lenUnsuitable and PodSchedulingNodeListMaxSize is 1. Is this guaranteed?

There was no truncation earlier. Drivers had to return at most PodSchedulingNodeListMaxSize, which they did by iterating over the potential nodes slice. Now that slice is potentially one element longer, so truncating by one element works.

The UnsuitableNodes method call description should get extended to cover this.

Done, and also unit tests added. That actually revealed that I had not handled the truncation when initially creating the status entry.

When handling a PodSchedulingContext object, the code first checked for unsuitable nodes and then tried to allocate if (and only if) the selected node hadn't been found to be unsuitable. If for whatever reason the selected node wasn't listed as potential node, then scheduling got stuck because the allocation would fail and cause a return with an error instead of updating the list of unsuitable nodes. This would be retried with the same result. To avoid this scenario, the selected node now also gets checked. This is better than assuming a certain kube-scheduler behavior. This problem occurred when experimenting with cluster autoscaling: spec: potentialNodes: - gke-cluster-pohly-pool-dra-69b88e1e-bz6c - gke-cluster-pohly-pool-dra-69b88e1e-fpvh selectedNode: gke-cluster-pohly-default-pool-c9f60a43-6kxh Why the scheduler wrote a spec like this is unclear. This was with Kubernetes 1.27 and the code has been updated since then, so perhaps it's resolved.

pohly · 2023-09-25T19:03:37Z

/retest

bart0sh · 2023-09-26T08:33:34Z

/triage accepted
/priority important-soon
/cc

elezar

/lgtm

k8s-ci-robot · 2023-10-06T09:11:57Z

@elezar: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2023-10-06T09:12:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elezar, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/dynamic-resource-allocation/OWNERS~~ [pohly]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bart0sh · 2023-10-10T10:17:12Z

/lgtm

k8s-ci-robot · 2023-10-10T10:17:20Z

LGTM label has been added.

Git tree hash: 3785d362368f66174720a3a8b18fb9c5a5e98e3a

k8s-ci-robot · 2023-10-10T12:39:48Z

@pohly: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-e2e-gce	`0ba37e7`	link	unknown	`/test pull-kubernetes-e2e-gce`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot requested review from bart0sh and klueska September 25, 2023 13:30

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 25, 2023

elezar reviewed Sep 25, 2023

View reviewed changes

pohly force-pushed the dra-unsuitable-nodes-selected-node branch from 21270c3 to 0ba37e7 Compare September 25, 2023 16:27

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 25, 2023

bart0sh added this to Triage in SIG Node PR Triage Sep 26, 2023

bart0sh moved this from Triage to Needs Reviewer in SIG Node PR Triage Sep 26, 2023

elezar approved these changes Oct 6, 2023

View reviewed changes

k8s-ci-robot assigned bart0sh Oct 10, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 10, 2023

bart0sh moved this from Needs Reviewer to Needs Approver in SIG Node PR Triage Oct 10, 2023

k8s-ci-robot merged commit 38c6bd8 into kubernetes:master Oct 10, 2023
16 of 17 checks passed

SIG Node PR Triage automation moved this from Needs Approver to Done Oct 10, 2023

k8s-ci-robot added this to the v1.29 milestone Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k8s.io/dynamic-resource-allocation: fix potential scheduling deadlock #120871

k8s.io/dynamic-resource-allocation: fix potential scheduling deadlock #120871

pohly commented Sep 25, 2023

pohly commented Sep 25, 2023

k8s-ci-robot commented Sep 25, 2023

elezar Sep 25, 2023

pohly Sep 25, 2023

elezar Sep 25, 2023

pohly Sep 25, 2023

pohly Sep 25, 2023

pohly commented Sep 25, 2023

bart0sh commented Sep 26, 2023

elezar left a comment

k8s-ci-robot commented Oct 6, 2023

k8s-ci-robot commented Oct 6, 2023

bart0sh commented Oct 10, 2023

k8s-ci-robot commented Oct 10, 2023

k8s-ci-robot commented Oct 10, 2023

k8s.io/dynamic-resource-allocation: fix potential scheduling deadlock #120871

k8s.io/dynamic-resource-allocation: fix potential scheduling deadlock #120871

Conversation

pohly commented Sep 25, 2023

What type of PR is this?

What this PR does / why we need it:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

pohly commented Sep 25, 2023

k8s-ci-robot commented Sep 25, 2023

elezar Sep 25, 2023

Choose a reason for hiding this comment

pohly Sep 25, 2023

Choose a reason for hiding this comment

elezar Sep 25, 2023

Choose a reason for hiding this comment

pohly Sep 25, 2023

Choose a reason for hiding this comment

pohly Sep 25, 2023

Choose a reason for hiding this comment

pohly commented Sep 25, 2023

bart0sh commented Sep 26, 2023

elezar left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Oct 6, 2023

k8s-ci-robot commented Oct 6, 2023

bart0sh commented Oct 10, 2023

k8s-ci-robot commented Oct 10, 2023

k8s-ci-robot commented Oct 10, 2023