Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delete a node from its cache if it gets node not found error #56622

Merged
merged 1 commit into from Dec 12, 2017

Conversation

wackxu
Copy link
Contributor

@wackxu wackxu commented Nov 30, 2017

What this PR does / why we need it:

delete a node from its cache if it gets node not found error

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #56261

Special notes for your reviewer:

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 30, 2017
@wackxu
Copy link
Contributor Author

wackxu commented Nov 30, 2017

/assign @bsalamat

@dims
Copy link
Member

dims commented Nov 30, 2017

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 30, 2017
@@ -1109,6 +1109,11 @@ func (factory *configFactory) MakeDefaultErrorFunc(backoff *util.PodBackoff, pod
} else {
if _, ok := err.(*core.FitError); ok {
glog.V(4).Infof("Unable to schedule %v %v: no fit: %v; waiting", pod.Namespace, pod.Name, err)
} else if errors.IsNotFound(err) {
if errStatus, ok := err.(errors.APIStatus); ok && errStatus.Status().Details.Kind == "node" {
node := v1.Node{ObjectMeta: metav1.ObjectMeta{Name: errStatus.Status().Details.Name}}
Copy link
Member

@bsalamat bsalamat Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we try again to get the node and if we still see the "not found" error, then remove the node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bsalamat Yeah, that is good, I will fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bsalamat Done, PTAL

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 1, 2017
_, err := factory.client.CoreV1().Nodes().Get(errStatus.Status().Details.Name, metav1.GetOptions{})
if err != nil && errors.IsNotFound(err) {
node := v1.Node{ObjectMeta: metav1.ObjectMeta{Name: errStatus.Status().Details.Name}}
factory.schedulerCache.RemoveNode(&node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks better. After removing the node, we also need to invalidate eCache predicates for the node. In order to do so, please add the following lines:

if factory.enableEquivalenceClassCache {
	factory.equivalencePodCache.InvalidateAllCachedPredicateItemOfNode(node.GetName())
}

Later on, we should refactor our code base so that the function that deletes a node from scheduler cache, also does the invalidation of the eCache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it would be easier to read if you set
nodeName := errStatus.Status().Details.Name
and use nodeName.

@bsalamat
Copy link
Member

bsalamat commented Dec 1, 2017

@davidopp Could you please add 1.9 milestone to this PR?

@wackxu
Copy link
Contributor Author

wackxu commented Dec 2, 2017

@bsalamat Done, PTAL

@bsalamat
Copy link
Member

bsalamat commented Dec 2, 2017

I wish we could write a test for this with reasonable amount of effort, but it needs a lot of effort. So, this is fine.
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 2, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, wackxu

Associated issue: 56261

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 2, 2017
@davidopp davidopp added this to the v1.9 milestone Dec 12, 2017
@timothysc timothysc added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. status/approved-for-milestone and removed milestone/incomplete-labels labels Dec 12, 2017
@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@enisoc enisoc added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Dec 12, 2017
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request Needs Attention

@bsalamat @davidopp @timothysc @wackxu @kubernetes/sig-scheduling-misc

Action required: During code freeze, pull requests in the milestone should be in progress.
If this pull request is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress label so it can be tracked with other in-flight pull requests.

Action Required: This pull request has not been updated since Dec 2. Please provide an update.

Note: This pull request is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Pull Request Labels
  • sig/scheduling: Pull Request will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move pull request out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@k8s-github-robot
Copy link

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@ravisantoshgudimetla
Copy link
Contributor

Can this be backported to 1.8?

@resouer
Copy link
Contributor

resouer commented Jan 10, 2018

I think so. @wackxu Would you like to try a cherry pick PR? Just do this under Kubernetes repo:

hack/cherry_pick_pull.sh upstream/release-1.8 56622

@wackxu
Copy link
Contributor Author

wackxu commented Jan 10, 2018

Thanks for your help @resouer . See #58038 @ravisantoshgudimetla

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. milestone/needs-attention priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scheduler should delete a node from its cache if it gets "node not found" error
10 participants