-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler cache: API and implementation #21016
Conversation
/cc @wojtek-t |
Labelling this PR as size/XL |
GCE e2e build/test failed for commit 67f9bc61a565ec89a3a1585072cbf740ecc9e10f. |
return nil | ||
} | ||
|
||
key := mustGetPodKey(pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving discussion from parent PR here:
@WojtekT
"I'm against having "mustGetPodKey" function. We shouldn't have any "panic" in the code in places where it depends on the data (there can be some data-corruption it shouldn't blow up our components).
It should be:
key, err := getPodKey(pod)
if err != nil {
return err
}
And you can move it outside lock BTW."
@hongchaodeng
"To return error is not intended, because this error case is not transient error (e.g. data corruption). The pod must have a unique key, e.g. a name.
Currently, Pod.Name is unique within Pod.Namespace, and we get "Namespace/Name" as key by using MetaNamespaceKeyFunc. If some changes breaks this compatibility, it is very dangerous and we shouldn't swallow the error.
We can extract namespace and name from api.Pod directly. But depending on MetaNamespaceKeyFunc seems more maintainable.
Not quite sure about this part. Just let you know that we shouldn't let such error not handled."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree with you here.
Basically, assume that you have a corrupted data in etcd. Then:
- if we leave as you are suggesting (with panic), this will cause scheduler crash-loop
- if we change to what I'm suggesting, we may end up with scheduler trying to schedule that pod multiple pods, but it will be scheduling other pods in the meantime too.
The second option is much better, since it doesn't break the system completely.
So I would like this to be changed to return an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just point out one thing because I didn't know why corrupted data could cause the panic.
api.Pod is ensured from compiler level to have GetObjectMeta() and be instance of ObjectMetaAccessor. Thus, MetaNamespaceKeyFunc should always return the key. See here and here.
It's orthogonal to returning error. I also agree with you that It's fine to return error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that we should always be able to access all necessary fields. But we can e.g. change "MetaNamespaceKeyFunc" to do some other kind of validation, and it can potentially return an error.
So I would suggest changing to return an error.
@hongchaodeng |
38335ef
to
a8ee96a
Compare
Addressed all comments. The test failure was caused by the dependency on #20977. I have picked it here to show the test result. |
GCE e2e test build/test passed for commit 38335efbbf259efab82bf7be71c5c851f4d61664. |
GCE e2e test build/test passed for commit a8ee96af7c9cbdd532d43727d2905e14e388b58d. |
The author of this PR is not in the whitelist for merge, can one of the admins add the 'ok-to-merge' label? |
|
||
cache.mu.Lock() | ||
defer cache.mu.Unlock() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove empty line
Addressed comments |
LGTM - thanks! |
GCE e2e test build/test passed for commit 5c3d303. |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e test build/test passed for commit 5c3d303. |
@k8s-bot test this Tests are more than 48 hours old. Re-running tests. |
GCE e2e test build/test passed for commit 5c3d303. |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e build/test failed for commit 5c3d303. |
GCE e2e test build/test passed for commit 5c3d303. |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e test build/test passed for commit 5c3d303. |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e build/test failed for commit 5c3d303. |
GCE e2e build/test failed for commit 5c3d303. |
GCE e2e test build/test passed for commit 5c3d303. |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e test build/test passed for commit 5c3d303. |
Automatic merge from submit-queue |
Auto commit by PR queue bot
This is the cache interface, impl. and tests separated out from #20669.
It depends on #20977.