Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated cherry pick of #71551: activate unschedulable pods only if the node became more #71933

Conversation

mlmhl
Copy link
Contributor

@mlmhl mlmhl commented Dec 11, 2018

Cherry pick of #71551 on release-1.11.

#71551: activate unschedulable pods only if the node became more

@k8s-ci-robot k8s-ci-robot added do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 11, 2018
@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 11, 2018
@mlmhl mlmhl force-pushed the automated-cherry-pick-of-#71551-upstream-release-1.11 branch from 8adbb14 to ac46298 Compare December 11, 2018 03:26
@mlmhl
Copy link
Contributor Author

mlmhl commented Dec 11, 2018

/retest

@mlmhl
Copy link
Contributor Author

mlmhl commented Dec 11, 2018

/assign @bsalamat

@mlmhl mlmhl force-pushed the automated-cherry-pick-of-#71551-upstream-release-1.11 branch from ac46298 to e674e79 Compare December 11, 2018 04:46
@mlmhl
Copy link
Contributor Author

mlmhl commented Dec 11, 2018

/test pull-kubernetes-bazel-test

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 11, 2018
Copy link
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Thanks, @mlmhl!

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, mlmhl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 11, 2018
@foxish foxish added cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Dec 12, 2018
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@bsalamat bsalamat added this to the v1.11 milestone Dec 12, 2018
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit c4240ec into kubernetes:release-1.11 Dec 12, 2018
@foxish foxish removed the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Dec 12, 2018
@k8s-ci-robot k8s-ci-robot added the do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. label Dec 12, 2018
@foxish foxish added cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. and removed do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. labels Dec 12, 2018
@foxish
Copy link
Contributor

foxish commented Dec 14, 2018

It looks to me like this broke a storage test in https://k8s-testgrid.appspot.com/sig-release-1.11-blocking#gce-cos-1.11-serial

cc @mlmhl @bsalamat

@foxish
Copy link
Contributor

foxish commented Dec 14, 2018

Can someone from @kubernetes/sig-scheduling-misc confirm if this is indeed the cause? If not - I'll try and revert it in a couple hours.

@bsalamat
Copy link
Member

@foxish What are your reasons to believe this PR broke the storage test? The storage test seems to be flaky even before the merge of this PR. And, this PR is an important scheduler fix. We need it in 1.11.

@foxish
Copy link
Contributor

foxish commented Dec 14, 2018

Based on testgrid it seems correlated - this test,

[sig-storage] PersistentVolumes-local Stress with local volume provisioner [Serial] should use be able to process many pods and reuse local volumes was not flaky.

It started to fail when c4240ec was committed and became green again when b1d75de (revert PR) was committted.

@bsalamat
Copy link
Member

@foxish This fix has been back-ported to 1.12 and 1.13 as well (and of course exists in HEAD too). It is odd that the storage test depends on the internal algorithms of the scheduler only in 1.11. We should ask SIG Storage to fix their test.

@mlmhl
Copy link
Contributor Author

mlmhl commented Dec 17, 2018

I run this failed test in my local environment and found the reason is that:

When pods are created, the local volume PVs haven't been created yet, so all pods are marked as Unschedulable due to didn't find available persistent volumes to bind. This won't be a problem in previous version as scheduler will retry these Unschedulable pods(by invoking MoveAllToActiveQueue) when receiving node heartbeat updates.

However, after this PR cherry-picked, scheduler won't retry unschedulable pods anymore if nodes only updated for heartbeat, without any meaningful changes. So all pods are staying in pending state and finally timeout.

BTW, I found that this problem doesn't exist in 1.12 and 1.13, because from 1.12 scheduler will retry unschedulable pods also if any pvs added/updated. So I think we should cherry-pick #65616 to 1.11 first before we cherry-pick this PR. cc @bsalamat @msau42

@bsalamat
Copy link
Member

Thanks a lot, @mlmhl for your investigation.

@msau42 @cofyc
Please cherry pick #65616 to 1.11 so that we can cherry pick this PR into 1.11 as well.

@msau42
Copy link
Member

msau42 commented Dec 17, 2018

preparing cherry pick here: #72127

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants