Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable retry option by default #121

Closed
soletan opened this issue Feb 28, 2020 · 2 comments
Closed

Enable retry option by default #121

soletan opened this issue Feb 28, 2020 · 2 comments

Comments

@soletan
Copy link

soletan commented Feb 28, 2020

FIrst of all, thanks for this adapter which we are starting to use with our server-side framework's ODM currently.

When testing an application we encountered issues in case of one node of etcd cluster being shut down for testing failed state scenarios. The cluster was still able to handle requests, but whenever trying to address a node which has been shut down for testing, consuming application failed. After some investigation we realized that retry option must be set explicitly in addition to the list of endpoint hosts. Isn't it sufficient to rely on your connection-pool.ts dividing its pool into available and failed hosts.

Thus I'd ask for enabling retry by default.

What's the point of using an adapter like etcd3 to access a multi-node cluster capable of serving requests while coping with some nodes failing when it requires the consuming application to handle cluster issues itself by default? Our application wants to store its data in a cluster for its higher availability. If it would have to detect and handle issues with failing endpoints and pick strategies for any probable scenario why should it try to use a cluster in the first place instead of implementing its own? IMHO it's up to the adapter - etcd3 in this case - to cover failing nodes and implicitly switch to another node by default to serve application in context of its own API as good as possible.

The current requirement for setting retry option to benefit from adapter coping with failed endpoints isn't quite obvious mostly due to the way of providing code-related documentation, only. And it isn't intuitive in case of listing multiple endpoints for a reason, either.

In addition, IMHO current behaviour is bad by design. Passing back GRPC errors is beyond semantics defined for the API between your adapter and some consuming application. It's like reporting SCSI command failures to an application which is trying to open a regular file. When trying to read a certain key, getting GRPCGenericError: 14 UNAVAILABLE: DNS resolution failed won't help that application to decide how to proceed. It is anything but related to its intention to read some key from an opaque store exposed through your adapter's API. In worst case application would have to repeat the exact same read command because it doesn't know how to adjust it for affecting the way it is executed. It can't even control the picked endpoint but still has to handle issue when endpoint picked by your code is down. There is nothing the application can do about this issue apart from bailing out or trying again relying on some information which isn't sustainable and might change in future releases e.g. when your adapter decides to work with a different dependency or similar. Thus, I think etcd3 isn't covering its background interaction with the cluster well enough.

@connor4312
Copy link
Member

Thank you for your input. Apologies for the delay in reply, there were permissioning issues in this repo for a while.

I believe we left retries disabled by default because we ended up in retry storms a couple times at my old team where we used this (+we had a couple thousand processes connected to etcd3). However, in the years since writing this originally, I've been enlighted to the joys of circuit breakers, and actually recently wrote a library for those, Cockatiel.

In the upcoming release--within the next few days--I want to provide integration with Cockatiel to allow consumers to define retry and circuit breaker policies on both the host and cluster level. This will include fairly liberal default policies, probably three retries with a 2 second circuit breaker timeout, just to avoid default options shooting any feet too badly.

Regarding errors: I have some subclasses of errors in here, and the "GRPCGenericError" is just the fallback error type. I do not want to obscure the grpc errors since as a developer if your DNS is down you want to know that rather than just get some "could not connect to server" error. However it would be a good idea to type the categories of errors better, such as connection errors versus 4xx-style request errors. This is also necessary for reliable retries.

Ultimately there will tend to be the possibility of getting generic errors, just because the grpc module is upstream and I just have a manual map of error types.

@connor4312
Copy link
Member

Added in 204eb2d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants