-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep track of remaining pods when a node is deleted #93938
Keep track of remaining pods when a node is deleted #93938
Conversation
6267ad3
to
da97b22
Compare
/retest |
da97b22
to
aeeaa70
Compare
/assign @Huang-Wei |
aeeaa70
to
aee640d
Compare
/retest |
/milestone v1.19 One thing we could do to reduce the chance of accessing nodeinfo.Node() without checking for nil is to return an error when the node object is nil ( |
@@ -57,8 +57,9 @@ import ( | |||
// - Both "Expired" and "Deleted" are valid end states. In case of some problems, e.g. network issue, | |||
// a pod might have changed its state (e.g. added and deleted) without delivering notification to the cache. | |||
type Cache interface { | |||
// ListPods lists all pods in the cache. | |||
ListPods(selector labels.Selector) ([]*v1.Pod, error) | |||
// ListPods returns the number of pods in the cache (including those from deleted nodes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// ListPods returns the number of pods in the cache (including those from deleted nodes). | |
// PodCount returns the number of pods in the cache (including those from deleted nodes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
it would be nice if we can keep the actual fix in a separate commit separate from all the other updates to the tests. |
64e5cbe
to
465bf95
Compare
/test pull-kubernetes-verify |
Signed-off-by: Aldo Culquicondor <acondor@google.com> Change-Id: Iebb22fc816926aaa1ddd1e4b2e52f335a275ffaa Signed-off-by: Aldo Culquicondor <acondor@google.com>
The apiserver is expected to send pod deletion events that might arrive at a different time. However, sometimes a node could be recreated without its pods being deleted. Partial revert of kubernetes#86964 Signed-off-by: Aldo Culquicondor <acondor@google.com> Change-Id: I51f683e5f05689b711c81ebff34e7118b5337571
465bf95
to
dfe9e41
Compare
Done |
/retest |
func (cache *schedulerCache) removePod(pod *v1.Pod) error { | ||
n, ok := cache.nodes[pod.Spec.NodeName] | ||
if !ok { | ||
klog.Errorf("node %v not found when trying to remove pod %v", pod.Spec.NodeName, pod.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recalled the original logic returned an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It did. But returning nil is actually safer in the case of extraneous update events that might arrive before a node is created, and after the original node was completely removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to returning nil and just logging an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the usage of removePod()
, there are still a number of callers rely on the returned value. So I'd suggest to revert to the original state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detail for each caller:
- ForgetPod: We actually want to proceed and clear the assumedPods and podStates.
- expirePod: Same as above.
- AddPod: it just logs the error returned, so same effect.
- RemovePod: We want to clear podStates.
- updatePod: This is the case where we want to prevent losing information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, for expirePod and ForgetPod, the node shouldn't have been removed because it still had pods assigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. That's fair.
Just a nit ^^. LGTM otherwise. |
Thanks. LGTM, will leave it to Wei to officially lgtm. |
/retest |
/lgtm |
/hold cancel |
This is actually already merged, but the github UI is outdated. /shrug |
@alculquicondor: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/unshrug |
@alculquicondor: ¯\_(ツ)_/¯ In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
The apiserver is expected to send pod deletion events that might arrive at a different time. However, sometimes a node could be recreated without its pods being deleted.
Special notes for your reviewer:
This is a partial revert of #86964 and #89908
Since then, we have been more careful about direct usage of the map of nodes. In particular:
Fixed Dump that still uses the node map (still useful to know, so we don't want to skip phantom nodes). And switched the only remaining list method that uses the map to return a count, and marked to only be used for tests.
The PR consists of 2 commits:
Does this PR introduce a user-facing change?: