Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DevicePlugin] Should we retry when ListAndWatch gRPC call failed? #58372

Closed
ScorpioCPH opened this issue Jan 17, 2018 · 6 comments
Closed

[DevicePlugin] Should we retry when ListAndWatch gRPC call failed? #58372

ScorpioCPH opened this issue Jan 17, 2018 · 6 comments
Assignees
Labels
area/hw-accelerators sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@ScorpioCPH
Copy link
Contributor

Is this a BUG REPORT or FEATURE REQUEST?:

/area hw-accelerators
/sig node

What happened:

We return immediately when endpoint ListAndWatch gRPC call failed and endpoint will stop:

go func() {
	e.run()
	e.stop()
}

Should we give another chance (e.g. wait 3 seconds) to wait for device plugin client to come up?

Another thing: maybe this is why we get so many gRPC error message in unit test randomly.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

I will send a PR for this proposal.

@vikaschoudhary16 @jiayingz @RenaudWasTaken WDYT?

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. area/hw-accelerators sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 17, 2018
@ScorpioCPH
Copy link
Contributor Author

/assign

@vikaschoudhary16
Copy link
Contributor

Since the client that endpoint is using to wait on ListAndWatch has already ensured that connection was READY with server, in that case if still it fails, that could happen only if device plugin has died.
So your question becomes, should we wait for device plugin restart?

I think device plugin restart is already handled and we create a new endpoint on reregistration. Thinking all this I am unable to see benefit of waiting there.

@ScorpioCPH
Copy link
Contributor Author

Hi,

that could happen only if device plugin has died.

SGTM if we can make sure about this.

By the way, Is there any case that device plugin is still running but ClientStream is down?
Just FYI here:

// RecvMsg blocks until it receives a message or the stream is
// done. On client side, it returns io.EOF when the stream is done. On
// any other error, it aborts the stream and returns an RPC status.

@jiayingz
Copy link
Contributor

I think the stream should be kept around during the life cycle of device plugin, so agree with @vikaschoudhary16 that the current behavior seems WAI.

@RenaudWasTaken
Copy link
Contributor

Should we give another chance (e.g. wait 3 seconds) to wait for device plugin client to come up?

This is already backed by gRPC mechanism of retry.

Another thing: maybe this is why we get so many gRPC error message in unit test randomly.

These messages are not random. They happen because the client disconnects (i.e: device plugin closes the connexion).
The order is pretty random though probably because it doesn't use the same logging mechanism.

@ScorpioCPH
Copy link
Contributor Author

Thanks for the detailed explanation, seems we are ok about this so close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/hw-accelerators sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

5 participants