[DevicePlugin] Should we retry when ListAndWatch gRPC call failed? #58372

ScorpioCPH · 2018-01-17T04:39:52Z

Is this a BUG REPORT or FEATURE REQUEST?:

/area hw-accelerators
/sig node

What happened:

We return immediately when endpoint ListAndWatch gRPC call failed and endpoint will stop:

go func() {
	e.run()
	e.stop()
}

Should we give another chance (e.g. wait 3 seconds) to wait for device plugin client to come up?

Another thing: maybe this is why we get so many gRPC error message in unit test randomly.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

I will send a PR for this proposal.

@vikaschoudhary16 @jiayingz @RenaudWasTaken WDYT?

Environment:

Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

ScorpioCPH · 2018-01-17T04:40:07Z

/assign

vikaschoudhary16 · 2018-01-17T05:05:53Z

Since the client that endpoint is using to wait on ListAndWatch has already ensured that connection was READY with server, in that case if still it fails, that could happen only if device plugin has died.
So your question becomes, should we wait for device plugin restart?

I think device plugin restart is already handled and we create a new endpoint on reregistration. Thinking all this I am unable to see benefit of waiting there.

ScorpioCPH · 2018-01-17T05:35:27Z

Hi,

that could happen only if device plugin has died.

SGTM if we can make sure about this.

By the way, Is there any case that device plugin is still running but ClientStream is down?
Just FYI here:

// RecvMsg blocks until it receives a message or the stream is
// done. On client side, it returns io.EOF when the stream is done. On
// any other error, it aborts the stream and returns an RPC status.

jiayingz · 2018-01-17T06:28:17Z

I think the stream should be kept around during the life cycle of device plugin, so agree with @vikaschoudhary16 that the current behavior seems WAI.

RenaudWasTaken · 2018-01-21T01:28:56Z

Should we give another chance (e.g. wait 3 seconds) to wait for device plugin client to come up?

This is already backed by gRPC mechanism of retry.

Another thing: maybe this is why we get so many gRPC error message in unit test randomly.

These messages are not random. They happen because the client disconnects (i.e: device plugin closes the connexion).
The order is pretty random though probably because it doesn't use the same logging mechanism.

ScorpioCPH · 2018-01-21T04:38:05Z

Thanks for the detailed explanation, seems we are ok about this so close it.

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. area/hw-accelerators sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 17, 2018

k8s-ci-robot assigned ScorpioCPH Jan 17, 2018

ScorpioCPH closed this as completed Jan 21, 2018

vikaschoudhary16 mentioned this issue Apr 16, 2018

Add vikaschoudhary16 to the approvers in device manager #62184

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DevicePlugin] Should we retry when ListAndWatch gRPC call failed? #58372

[DevicePlugin] Should we retry when ListAndWatch gRPC call failed? #58372

ScorpioCPH commented Jan 17, 2018

ScorpioCPH commented Jan 17, 2018

vikaschoudhary16 commented Jan 17, 2018

ScorpioCPH commented Jan 17, 2018

jiayingz commented Jan 17, 2018

RenaudWasTaken commented Jan 21, 2018

ScorpioCPH commented Jan 21, 2018

[DevicePlugin] Should we retry when ListAndWatch gRPC call failed? #58372

[DevicePlugin] Should we retry when ListAndWatch gRPC call failed? #58372

Comments

ScorpioCPH commented Jan 17, 2018

ScorpioCPH commented Jan 17, 2018

vikaschoudhary16 commented Jan 17, 2018

ScorpioCPH commented Jan 17, 2018

jiayingz commented Jan 17, 2018

RenaudWasTaken commented Jan 21, 2018

ScorpioCPH commented Jan 21, 2018