-
Notifications
You must be signed in to change notification settings - Fork 38.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
delete a node from its cache if it gets node not found error #56622
Conversation
/assign @bsalamat |
/ok-to-test |
@@ -1109,6 +1109,11 @@ func (factory *configFactory) MakeDefaultErrorFunc(backoff *util.PodBackoff, pod | |||
} else { | |||
if _, ok := err.(*core.FitError); ok { | |||
glog.V(4).Infof("Unable to schedule %v %v: no fit: %v; waiting", pod.Namespace, pod.Name, err) | |||
} else if errors.IsNotFound(err) { | |||
if errStatus, ok := err.(errors.APIStatus); ok && errStatus.Status().Details.Kind == "node" { | |||
node := v1.Node{ObjectMeta: metav1.ObjectMeta{Name: errStatus.Status().Details.Name}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we try again to get the node and if we still see the "not found" error, then remove the node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bsalamat Yeah, that is good, I will fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bsalamat Done, PTAL
_, err := factory.client.CoreV1().Nodes().Get(errStatus.Status().Details.Name, metav1.GetOptions{}) | ||
if err != nil && errors.IsNotFound(err) { | ||
node := v1.Node{ObjectMeta: metav1.ObjectMeta{Name: errStatus.Status().Details.Name}} | ||
factory.schedulerCache.RemoveNode(&node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks better. After removing the node, we also need to invalidate eCache predicates for the node. In order to do so, please add the following lines:
if factory.enableEquivalenceClassCache {
factory.equivalencePodCache.InvalidateAllCachedPredicateItemOfNode(node.GetName())
}
Later on, we should refactor our code base so that the function that deletes a node from scheduler cache, also does the invalidation of the eCache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it would be easier to read if you set
nodeName := errStatus.Status().Details.Name
and use nodeName
.
@davidopp Could you please add 1.9 milestone to this PR? |
@bsalamat Done, PTAL |
I wish we could write a test for this with reasonable amount of effort, but it needs a lot of effort. So, this is fine. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, wackxu Associated issue: 56261 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/test all [submit-queue is verifying that this PR is safe to merge] |
[MILESTONENOTIFIER] Milestone Pull Request Needs Attention @bsalamat @davidopp @timothysc @wackxu @kubernetes/sig-scheduling-misc Action required: During code freeze, pull requests in the milestone should be in progress. Action Required: This pull request has not been updated since Dec 2. Please provide an update. Note: This pull request is marked as Example update:
Pull Request Labels
|
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here. |
Can this be backported to 1.8? |
I think so. @wackxu Would you like to try a cherry pick PR? Just do this under Kubernetes repo:
|
Thanks for your help @resouer . See #58038 @ravisantoshgudimetla |
What this PR does / why we need it:
delete a node from its cache if it gets node not found error
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #56261
Special notes for your reviewer:
Release note: