Fix scheduler issue with nodetree additions #93387

maelk · 2020-07-23T13:56:41Z

What type of PR is this?

/kind bug
/sig scheduling

What this PR does / why we need it:

This is a backport of #93355 .
When adding multiple nodes to the scheduler nodetree, the function getting the next node does not return all the nodes one after an other, but skips some and duplicate others. This commit works around the problem by always starting with reset counters.

Which issue(s) this PR fixes:

Fixes #91601

Does this PR introduce a user-facing change?:

NONE

k8s-ci-robot · 2020-07-23T13:56:50Z

Hi @maelk. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

maelk · 2020-07-23T13:57:41Z

/assign @ahg-g

alculquicondor · 2020-07-23T14:06:26Z

/hold

let's discuss on the original PR

ahg-g · 2020-07-23T14:06:49Z

/ok-to-test

maelk · 2020-07-23T15:41:21Z

/hold cancel
we'll go with this as a hotfix : #93355 (review)

alculquicondor · 2020-07-23T16:14:44Z

/approve
/lgtm
/retest

k8s-ci-robot · 2020-07-23T16:15:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, maelk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

maelk · 2020-07-24T19:58:55Z

/retest

alculquicondor · 2020-07-24T20:30:19Z

@kubernetes/release-managers

maelk · 2020-07-25T07:01:05Z

/retest

ahg-g · 2020-07-25T14:05:17Z

/retest

maelk · 2020-07-26T08:11:55Z

/retest

maelk · 2020-07-26T15:24:37Z

/retest
@ahg-g @alculquicondor this has now failed many times on the pull-kubernetes-e2e-gce-device-plugin-gpu. Should I keep retriggering ? also, lgty ?

ahg-g · 2020-07-26T19:32:59Z

@hasheddan the test pull-kubernetes-e2e-gce-device-plugin-gpu is failing constantly on this 1.18 patch, will your #93207 fix make to 1.18 and is it going to fix the failing test?

hasheddan · 2020-07-26T19:38:44Z

@ahg-g interesting, I was monitoring the test on the 1.18 blocking dashboard https://testgrid.k8s.io/sig-release-1.18-blocking#gce-device-plugin-gpu-1.18 and it has been passing.. somehow? Anyway, I opened a backport to 1.17 today and am happy to do so for 1.18 as well. Any idea why this is passing on 1.18 blocking?

ahg-g · 2020-07-26T19:43:42Z

I honestly have no idea what this test is doing and why it is under sig-scheduling in the first place!

hasheddan · 2020-07-26T20:00:32Z

@ahg-g I believe it is testing that pods requesting GPUs are scheduled to correct nodes, so I am guessing that is why it lives in SIG-scheduling? Anyways, I have opened a backport to 1.18 :)

ahg-g · 2020-07-26T20:05:27Z

thanks, do you know if the failure in this PR will be fixed by your PR?

hasheddan · 2020-07-26T20:07:49Z

@ahg-g I cannot be 100% certain, but the failure is consistent with the one that was fixed when the PR was merged to master (i.e. GPUs not being properly set up on nodes)

ahg-g · 2020-07-26T20:11:58Z

/retest

hasheddan

/lgtm

ahg-g · 2020-07-26T22:50:13Z

/retest

maelk · 2020-07-28T10:01:20Z

/retest

alculquicondor · 2020-07-31T15:09:04Z

ping @kubernetes/release-managers

furkatgofurov7 · 2020-08-04T16:00:59Z

20/20 tests passed, but pull-kubernetes-verify job is still failing. Retriggering it again.

/retest

furkatgofurov7 · 2020-08-04T19:58:27Z

pull-kubernetes-integration job is failed due to timeout.

/retest

fejta-bot · 2020-08-04T23:01:49Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

feiskyer · 2020-08-04T23:56:44Z

/retest

fejta-bot · 2020-08-05T03:13:47Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Jul 23, 2020

k8s-ci-robot added this to the v1.18 milestone Jul 23, 2020

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 23, 2020

k8s-ci-robot requested review from alculquicondor and hex108 July 23, 2020 13:56

k8s-ci-robot assigned ahg-g Jul 23, 2020

maelk mentioned this pull request Jul 23, 2020

Fix scheduler issue with nodetree additions #93355

Merged

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 23, 2020

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 23, 2020

maelk force-pushed the sched-fix-1.18-mael branch from 567b12c to b88c10c Compare July 23, 2020 14:24

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 23, 2020

k8s-ci-robot assigned alculquicondor Jul 23, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 23, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 23, 2020

maelk force-pushed the sched-fix-1.18-mael branch from b88c10c to a87071a Compare July 24, 2020 06:46

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 24, 2020

hasheddan approved these changes Jul 26, 2020

View reviewed changes

k8s-ci-robot assigned hasheddan Jul 26, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 26, 2020

feiskyer added cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. and removed do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. labels Aug 4, 2020

hasheddan mentioned this pull request Aug 4, 2020

Make critical jobs Guaranteed Pod QOS: pull-kubernetes-verify kubernetes/test-infra#18597

Closed

k8s-ci-robot merged commit c6cb4f0 into kubernetes:release-1.18 Aug 5, 2020

dennis-benzinger-hybris mentioned this pull request Sep 8, 2020

Scheduler and cluster auto scaler don't agree on available resources #93186

Closed

Huang-Wei mentioned this pull request Dec 8, 2020

Scheduler's snapshot.nodeInfoList has data chaos #97120

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix scheduler issue with nodetree additions #93387

Fix scheduler issue with nodetree additions #93387

maelk commented Jul 23, 2020

k8s-ci-robot commented Jul 23, 2020

maelk commented Jul 23, 2020

alculquicondor commented Jul 23, 2020

ahg-g commented Jul 23, 2020

maelk commented Jul 23, 2020

alculquicondor commented Jul 23, 2020

k8s-ci-robot commented Jul 23, 2020

maelk commented Jul 24, 2020

alculquicondor commented Jul 24, 2020

maelk commented Jul 25, 2020

ahg-g commented Jul 25, 2020

maelk commented Jul 26, 2020

maelk commented Jul 26, 2020

ahg-g commented Jul 26, 2020

hasheddan commented Jul 26, 2020

ahg-g commented Jul 26, 2020

hasheddan commented Jul 26, 2020

ahg-g commented Jul 26, 2020

hasheddan commented Jul 26, 2020

ahg-g commented Jul 26, 2020

hasheddan left a comment

ahg-g commented Jul 26, 2020

maelk commented Jul 28, 2020

alculquicondor commented Jul 31, 2020

furkatgofurov7 commented Aug 4, 2020

furkatgofurov7 commented Aug 4, 2020

fejta-bot commented Aug 4, 2020

feiskyer commented Aug 4, 2020

fejta-bot commented Aug 5, 2020

Fix scheduler issue with nodetree additions #93387

Fix scheduler issue with nodetree additions #93387

Conversation

maelk commented Jul 23, 2020

k8s-ci-robot commented Jul 23, 2020

maelk commented Jul 23, 2020

alculquicondor commented Jul 23, 2020

ahg-g commented Jul 23, 2020

maelk commented Jul 23, 2020

alculquicondor commented Jul 23, 2020

k8s-ci-robot commented Jul 23, 2020

maelk commented Jul 24, 2020

alculquicondor commented Jul 24, 2020

maelk commented Jul 25, 2020

ahg-g commented Jul 25, 2020

maelk commented Jul 26, 2020

maelk commented Jul 26, 2020

ahg-g commented Jul 26, 2020

hasheddan commented Jul 26, 2020

ahg-g commented Jul 26, 2020

hasheddan commented Jul 26, 2020

ahg-g commented Jul 26, 2020

hasheddan commented Jul 26, 2020

ahg-g commented Jul 26, 2020

hasheddan left a comment

Choose a reason for hiding this comment

ahg-g commented Jul 26, 2020

maelk commented Jul 28, 2020

alculquicondor commented Jul 31, 2020

furkatgofurov7 commented Aug 4, 2020

furkatgofurov7 commented Aug 4, 2020

fejta-bot commented Aug 4, 2020

feiskyer commented Aug 4, 2020

fejta-bot commented Aug 5, 2020