New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PodresourceAPI reports about resources of pods in terminal phase #119423
Comments
/sig node |
/triage accepted |
/assign |
I was thinking about this issue for a bit, and the root cause is probably that there is not a (good enough) formal spec for the API itself so the implementation is pretty much the reference. That said, my thoughts about the ideal semantics are:
|
I wonder if pod resources API may report the same set of resources used by two pods - one succeeded and one newly scheduled in a single response? @ffromani should we try to summarize shortcomings at the page you listed as a starting point? We can start with the list of edge cases like:
I like the idea that podresources API is driven by the fact whether resources are actually in use, not by the Pod status" Thus the resource manager behavior ideally needs to be documented. |
@SergeyKanzhelev I totally agree that the behavior should be documented in a more precse way. I'm just not sure that page is the best place for reference docs. I'll check with SIG-docs and get advice. |
From top of my head, I'm not sure we have a code path which prevents this. It should be extremely unlikely (at very least you will need a very limited amount of resources, bypass the scheduler entirely and hit a race within the kubelet) but possible. Should also be transient - the next query should return the expected consistent result. But still, I'm not sure there's prevention for this flow. Also depends on the specifics of resource managers. The gist is that the podresources API is a thin layer which reformats and exposes the internal data of resource managers. This makes the code very simple, but leaks pretty badly the internal implementation of the resource managers allocation process. This in turns makes the API not really formally specified and with perhaps poor stability guarantees. |
I think the best approach would be extending https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kubelet/pkg/apis/podresources/v1/api.proto#L36 to enable the client side to filter out or not pod in terminal phase. This way we fix the issue AND we preserve backward compatible. This of course assuming we can extend the message in a backward compatible way. It should be the case, but we need to doublecheck. |
+1 I like the ability to allow the client to filter the result based on pod phase or in general exposing the pod phase allowing the client to filter out the pods themselves. |
The next question would be if such a change deserves a KEP or not. I quite strongly feel that this would still fall under the "bugfix" category - we had a rich history about "invasive" and sometimes behavior-changing fixes, and this change hardly look as a RFE. |
Add an option to create and use a separate informer to get unfiltered pod listing from the apiserver. Due to mismatched view with respect to the kubelet, the plugin needs access to pod in terminal phase which are not deleted to make sure the reconciliation is done correctly. xref: kubernetes-sigs#598 xref: kubernetes/kubernetes#119423 Signed-off-by: Francesco Romani <fromani@redhat.com>
Add an option to create and use a separate informer to get unfiltered pod listing from the apiserver. Due to mismatched view with respect to the kubelet, the plugin needs access to pod in terminal phase which are not deleted to make sure the reconciliation is done correctly. xref: kubernetes-sigs#598 xref: kubernetes/kubernetes#119423 Signed-off-by: Francesco Romani <fromani@redhat.com>
Add an option to create and use a separate informer to get unfiltered pod listing from the apiserver. Due to mismatched view with respect to the kubelet, the plugin needs access to pod in terminal phase which are not deleted to make sure the reconciliation is done correctly. xref: kubernetes-sigs#598 xref: kubernetes/kubernetes#119423 Signed-off-by: Francesco Romani <fromani@redhat.com>
Add an option to create and use a separate informer to get unfiltered pod listing from the apiserver. Due to mismatched view with respect to the kubelet, the plugin needs access to pod in terminal phase which are not deleted to make sure the reconciliation is done correctly. xref: kubernetes-sigs#598 xref: kubernetes/kubernetes#119423 Signed-off-by: Francesco Romani <fromani@redhat.com>
What happened?
PodResourcesAPI provide a
List()
endpoint which reports about all the resources that consumed by pods and containers on the node.The problem is that pods which are in terminal phase (i.e. are in
Failed
orSucceeded
status) are reported as well. about The internal managers reassign resources assigned to pods in terminal phase, so PodResources should ignore them, because they can still be in used.What did you expect to happen?
PodResources should ignore and not reports about resources which are in used by pods which are in terminal phase.
How can we reproduce it (as minimally and precisely as possible)?
Provided a new test-case that demonstrates the exact problem and can be used as a reproducer: #119402
Anything else we need to know?
The docs that describes PodResourceAPI are not refer explicitly to whether
List()
should ignore terminal pod's resources or not, meaning it might not be a bug at all.OTOH, internal managers reassign resources assigned to pods in terminal phase so it make sense to not account them.
Kubernetes version
Kubernetes v1.26.2
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: