Enable retry option by default #121

soletan · 2020-02-28T23:19:26Z

FIrst of all, thanks for this adapter which we are starting to use with our server-side framework's ODM currently.

When testing an application we encountered issues in case of one node of etcd cluster being shut down for testing failed state scenarios. The cluster was still able to handle requests, but whenever trying to address a node which has been shut down for testing, consuming application failed. After some investigation we realized that retry option must be set explicitly in addition to the list of endpoint hosts. Isn't it sufficient to rely on your connection-pool.ts dividing its pool into available and failed hosts.

Thus I'd ask for enabling retry by default.

What's the point of using an adapter like etcd3 to access a multi-node cluster capable of serving requests while coping with some nodes failing when it requires the consuming application to handle cluster issues itself by default? Our application wants to store its data in a cluster for its higher availability. If it would have to detect and handle issues with failing endpoints and pick strategies for any probable scenario why should it try to use a cluster in the first place instead of implementing its own? IMHO it's up to the adapter - etcd3 in this case - to cover failing nodes and implicitly switch to another node by default to serve application in context of its own API as good as possible.

The current requirement for setting retry option to benefit from adapter coping with failed endpoints isn't quite obvious mostly due to the way of providing code-related documentation, only. And it isn't intuitive in case of listing multiple endpoints for a reason, either.

In addition, IMHO current behaviour is bad by design. Passing back GRPC errors is beyond semantics defined for the API between your adapter and some consuming application. It's like reporting SCSI command failures to an application which is trying to open a regular file. When trying to read a certain key, getting GRPCGenericError: 14 UNAVAILABLE: DNS resolution failed won't help that application to decide how to proceed. It is anything but related to its intention to read some key from an opaque store exposed through your adapter's API. In worst case application would have to repeat the exact same read command because it doesn't know how to adjust it for affecting the way it is executed. It can't even control the picked endpoint but still has to handle issue when endpoint picked by your code is down. There is nothing the application can do about this issue apart from bailing out or trying again relying on some information which isn't sustainable and might change in future releases e.g. when your adapter decides to work with a different dependency or similar. Thus, I think etcd3 isn't covering its background interaction with the cluster well enough.

The text was updated successfully, but these errors were encountered:

connor4312 · 2020-06-16T05:49:31Z

Thank you for your input. Apologies for the delay in reply, there were permissioning issues in this repo for a while.

I believe we left retries disabled by default because we ended up in retry storms a couple times at my old team where we used this (+we had a couple thousand processes connected to etcd3). However, in the years since writing this originally, I've been enlighted to the joys of circuit breakers, and actually recently wrote a library for those, Cockatiel.

In the upcoming release--within the next few days--I want to provide integration with Cockatiel to allow consumers to define retry and circuit breaker policies on both the host and cluster level. This will include fairly liberal default policies, probably three retries with a 2 second circuit breaker timeout, just to avoid default options shooting any feet too badly.

Regarding errors: I have some subclasses of errors in here, and the "GRPCGenericError" is just the fallback error type. I do not want to obscure the grpc errors since as a developer if your DNS is down you want to know that rather than just get some "could not connect to server" error. However it would be a good idea to type the categories of errors better, such as connection errors versus 4xx-style request errors. This is also necessary for reliable retries.

Ultimately there will tend to be the possibility of getting generic errors, just because the grpc module is upstream and I just have a manual map of error types.

connor4312 · 2020-06-21T03:11:16Z

Added in 204eb2d

connor4312 added the enhancement label Jun 16, 2020

connor4312 mentioned this issue Jun 19, 2020

Release schedule? #125

Closed

connor4312 closed this as completed Jun 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable retry option by default #121

Enable retry option by default #121

soletan commented Feb 28, 2020 •

edited

Loading

connor4312 commented Jun 16, 2020

connor4312 commented Jun 21, 2020

Enable retry option by default #121

Enable retry option by default #121

Comments

soletan commented Feb 28, 2020 • edited Loading

connor4312 commented Jun 16, 2020

connor4312 commented Jun 21, 2020

soletan commented Feb 28, 2020 •

edited

Loading