New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DevicePlugin] Should we retry when ListAndWatch gRPC call failed? #58372
Comments
/assign |
Since the client that endpoint is using to wait on I think device plugin restart is already handled and we create a new endpoint on reregistration. Thinking all this I am unable to see benefit of waiting there. |
Hi,
SGTM if we can make sure about this. By the way, Is there any case that device plugin is still running but // RecvMsg blocks until it receives a message or the stream is
// done. On client side, it returns io.EOF when the stream is done. On
// any other error, it aborts the stream and returns an RPC status. |
I think the stream should be kept around during the life cycle of device plugin, so agree with @vikaschoudhary16 that the current behavior seems WAI. |
This is already backed by gRPC mechanism of retry.
These messages are not random. They happen because the client disconnects (i.e: device plugin closes the connexion). |
Thanks for the detailed explanation, seems we are ok about this so close it. |
Is this a BUG REPORT or FEATURE REQUEST?:
/area hw-accelerators
/sig node
What happened:
We return immediately when endpoint
ListAndWatch
gRPC call failed and endpoint will stop:Should we give another chance (e.g. wait 3 seconds) to wait for device plugin client to come up?
Another thing: maybe this is why we get so many gRPC error message in unit test randomly.
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
I will send a PR for this proposal.
@vikaschoudhary16 @jiayingz @RenaudWasTaken WDYT?
Environment:
kubectl version
):uname -a
):The text was updated successfully, but these errors were encountered: