Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix inter-pod anti-affinity issue #53647

Merged
merged 1 commit into from Dec 3, 2017
Merged

fix inter-pod anti-affinity issue #53647

merged 1 commit into from Dec 3, 2017

Conversation

wenlxie
Copy link
Contributor

@wenlxie wenlxie commented Oct 10, 2017

This is used to fix:
#50813

Release note:

[scheduler] Fix issue new pod with affinity stuck at `creating` because node had been deleted but its pod still exists.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 10, 2017
@k8s-ci-robot
Copy link
Contributor

Hi @wenlxie. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@timothysc timothysc added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. kind/bug Categorizes issue or PR as related to a bug. labels Oct 16, 2017
Copy link
Member

@timothysc timothysc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test case for this condition, I would recommend an integration test b/c it uses fake-nodes which should be easy to simulate.

Also, please add a release-note on this one.

@timothysc timothysc added release-note Denotes a PR that will be considered when it comes time to generate release notes. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. status/approved-for-milestone and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 16, 2017
@timothysc timothysc added this to the v1.9 milestone Oct 16, 2017
@anfernee
Copy link
Member

anfernee commented Oct 26, 2017

I am not sure this is the right fix, otherwise masking errors will cause more harm in future. The biggest questions to be answered here are:

  • why node still in cache when it's actually gone? from the current code, we assume it's possible removePod comes after RemoveNode. When all pods are deleted from a node, the node will be removed from the cache. But it's not happening for some reason, which causes this bug. we should figure out why. Node is removed from cache before pod.Spec.NodeName is removed.
  • the CacheNodeInfo that predicate uses isn't cached at all. NodeList itself is cached.
    type CachedNodeInfo struct {
    corelisters.NodeLister
    }
    Is it intentional? Whatever it is, the name is confusing. If this uses cache, this bug will not happen as well. It only happens when cache is out of sync. But it's a good thing that this help exposing the cache problem.

@wenlxie if you still on this issue, we can work together addressing this issue.

@wenlxie
Copy link
Contributor Author

wenlxie commented Nov 5, 2017

@anfernee Thanks for the reply.
It is easy to reproduce this issue, it happens in following scenario:

  1. The node had been removed from the cluster,
  2. There are pod that used to running on there node still exists
  3. A new pod with podAntiAffinity will stuck at pending

You can add a finalizer to the running pod, and then delete its node from k8s, the pod will not get deleted.

@timothysc timothysc removed this from the v1.9 milestone Nov 6, 2017
@anfernee
Copy link
Member

anfernee commented Nov 8, 2017

@wenlxie just curious how do you remove a node? I tried drain, cordon and just shut the node down, they all seem to work fine.

@wenlxie
Copy link
Contributor Author

wenlxie commented Nov 14, 2017

@anfernee Use kubectl delete node command

@anfernee
Copy link
Member

I still cannot reproduce, but I believe @wenlxie 's fix will fix the issue.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 29, 2017
Copy link
Member

@timothysc timothysc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change seems safe but ideally I would like to see a test case for it.

/lgtm
/approve

@timothysc timothysc added this to the v1.9 milestone Nov 29, 2017
@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 29, 2017
@k8s-ci-robot
Copy link
Contributor

@fejta-bot: you can't request testing unless you are a kubernetes member.

In response to this:

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: you can't request testing unless you are a kubernetes member.

In response to this:

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: you can't request testing unless you are a kubernetes member.

In response to this:

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: you can't request testing unless you are a kubernetes member.

In response to this:

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: you can't request testing unless you are a kubernetes member.

In response to this:

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: you can't request testing unless you are a kubernetes member.

In response to this:

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: you can't request testing unless you are a kubernetes member.

In response to this:

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: you can't request testing unless you are a kubernetes member.

In response to this:

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: you can't request testing unless you are a kubernetes member.

In response to this:

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: you can't request testing unless you are a kubernetes member.

In response to this:

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: you can't request testing unless you are a kubernetes member.

In response to this:

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: you can't request testing unless you are a kubernetes member.

In response to this:

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dims
Copy link
Member

dims commented Dec 3, 2017

/retest

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 2b98a97 into kubernetes:master Dec 3, 2017
@wenlxie wenlxie mentioned this pull request Dec 5, 2017
jpbetz added a commit that referenced this pull request Dec 6, 2017
@wenlxie wenlxie deleted the githubupstream.master.fixinterpodantiaffinity branch October 24, 2019 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants