Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s.io/dynamic-resource-allocation: fix potential scheduling deadlock #120871

Merged

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Sep 25, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

When handling a PodSchedulingContext object, the code first checked for unsuitable nodes and then tried to allocate if (and only if) the selected node hadn't been found to be unsuitable.

If for whatever reason the selected node wasn't listed as potential node, then scheduling got stuck because the allocation would fail and cause a return with an error instead of updating the list of unsuitable nodes. This would be retried with the same result.

To avoid this scenario, the selected node now also gets checked. This is better than assuming a certain kube-scheduler behavior.

Special notes for your reviewer:

This problem occurred when experimenting with cluster autoscaling:

spec:
  potentialNodes:
  - gke-cluster-pohly-pool-dra-69b88e1e-bz6c
  - gke-cluster-pohly-pool-dra-69b88e1e-fpvh
  selectedNode: gke-cluster-pohly-default-pool-c9f60a43-6kxh

Why the scheduler wrote a spec like this is unclear. This was with Kubernetes 1.27 and the code has been updated since then, so perhaps it's resolved.

Does this PR introduce a user-facing change?

k8s.io/dynamic-resource-allocation: handle a selected node which isn't listed as potential node

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/issues/3063

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Sep 25, 2023
@pohly
Copy link
Contributor Author

pohly commented Sep 25, 2023

/cc @elazar

@k8s-ci-robot
Copy link
Contributor

@pohly: GitHub didn't allow me to request PR reviews from the following users: elazar.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @elazar

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 25, 2023
selectedNode := schedulingCtx.Spec.SelectedNode
potentialNodes := schedulingCtx.Spec.PotentialNodes
if selectedNode != "" && !hasString(potentialNodes, selectedNode) {
potentialNodes = append(potentialNodes, selectedNode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Some of the logic below seems to rely on the fact that selectedNode is the first element in a slice. Does this mean that we need to prepend it here instead of appending it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. After UnsuitableNodes, the selected node is either the first element, the last element, not present, or in the middle. By ensuring that the truncated element (first or last) is not the selected node, we can be sure that the rest includes it regardless where exactly it is.

if lenUnsuitable > resourcev1alpha2.PodSchedulingNodeListMaxSize {
if delayed.UnsuitableNodes[0] == selectedNode {
// Truncate at the end and keep selected node in the first element.
delayed.UnsuitableNodes = delayed.UnsuitableNodes[0:lenUnsuitable-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This truncation assumes that the difference between lenUnsuitable and PodSchedulingNodeListMaxSize is 1. Is this guaranteed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was no truncation earlier. Drivers had to return at most PodSchedulingNodeListMaxSize, which they did by iterating over the potential nodes slice. Now that slice is potentially one element longer, so truncating by one element works.

The UnsuitableNodes method call description should get extended to cover this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and also unit tests added. That actually revealed that I had not handled the truncation when initially creating the status entry.

When handling a PodSchedulingContext object, the code first checked for
unsuitable nodes and then tried to allocate if (and only if) the selected node
hadn't been found to be unsuitable.

If for whatever reason the selected node wasn't listed as potential node, then
scheduling got stuck because the allocation would fail and cause a return with
an error instead of updating the list of unsuitable nodes. This would be
retried with the same result.

To avoid this scenario, the selected node now also gets checked. This is better
than assuming a certain kube-scheduler behavior.

This problem occurred when experimenting with cluster autoscaling:

    spec:
      potentialNodes:
      - gke-cluster-pohly-pool-dra-69b88e1e-bz6c
      - gke-cluster-pohly-pool-dra-69b88e1e-fpvh
      selectedNode: gke-cluster-pohly-default-pool-c9f60a43-6kxh

Why the scheduler wrote a spec like this is unclear. This was with Kubernetes
1.27 and the code has been updated since then, so perhaps it's resolved.
@pohly pohly force-pushed the dra-unsuitable-nodes-selected-node branch from 21270c3 to 0ba37e7 Compare September 25, 2023 16:27
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 25, 2023
@pohly
Copy link
Contributor Author

pohly commented Sep 25, 2023

/retest

@bart0sh bart0sh added this to Triage in SIG Node PR Triage Sep 26, 2023
@bart0sh
Copy link
Contributor

bart0sh commented Sep 26, 2023

/triage accepted
/priority important-soon
/cc

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Sep 26, 2023
@bart0sh bart0sh moved this from Triage to Needs Reviewer in SIG Node PR Triage Sep 26, 2023
Copy link
Contributor

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot
Copy link
Contributor

@elezar: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elezar, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bart0sh
Copy link
Contributor

bart0sh commented Oct 10, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 10, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 3785d362368f66174720a3a8b18fb9c5a5e98e3a

@bart0sh bart0sh moved this from Needs Reviewer to Needs Approver in SIG Node PR Triage Oct 10, 2023
@k8s-ci-robot
Copy link
Contributor

@pohly: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-e2e-gce 0ba37e7 link unknown /test pull-kubernetes-e2e-gce

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot merged commit 38c6bd8 into kubernetes:master Oct 10, 2023
16 of 17 checks passed
SIG Node PR Triage automation moved this from Needs Approver to Done Oct 10, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants