You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
FIrst of all, thanks for this adapter which we are starting to use with our server-side framework's ODM currently.
When testing an application we encountered issues in case of one node of etcd cluster being shut down for testing failed state scenarios. The cluster was still able to handle requests, but whenever trying to address a node which has been shut down for testing, consuming application failed. After some investigation we realized that retry option must be set explicitly in addition to the list of endpoint hosts. Isn't it sufficient to rely on your connection-pool.ts dividing its pool into available and failed hosts.
Thus I'd ask for enabling retry by default.
What's the point of using an adapter like etcd3 to access a multi-node cluster capable of serving requests while coping with some nodes failing when it requires the consuming application to handle cluster issues itself by default? Our application wants to store its data in a cluster for its higher availability. If it would have to detect and handle issues with failing endpoints and pick strategies for any probable scenario why should it try to use a cluster in the first place instead of implementing its own? IMHO it's up to the adapter - etcd3 in this case - to cover failing nodes and implicitly switch to another node by default to serve application in context of its own API as good as possible.
The current requirement for setting retry option to benefit from adapter coping with failed endpoints isn't quite obvious mostly due to the way of providing code-related documentation, only. And it isn't intuitive in case of listing multiple endpoints for a reason, either.
In addition, IMHO current behaviour is bad by design. Passing back GRPC errors is beyond semantics defined for the API between your adapter and some consuming application. It's like reporting SCSI command failures to an application which is trying to open a regular file. When trying to read a certain key, getting GRPCGenericError: 14 UNAVAILABLE: DNS resolution failed won't help that application to decide how to proceed. It is anything but related to its intention to read some key from an opaque store exposed through your adapter's API. In worst case application would have to repeat the exact same read command because it doesn't know how to adjust it for affecting the way it is executed. It can't even control the picked endpoint but still has to handle issue when endpoint picked by your code is down. There is nothing the application can do about this issue apart from bailing out or trying again relying on some information which isn't sustainable and might change in future releases e.g. when your adapter decides to work with a different dependency or similar. Thus, I think etcd3 isn't covering its background interaction with the cluster well enough.
The text was updated successfully, but these errors were encountered:
Thank you for your input. Apologies for the delay in reply, there were permissioning issues in this repo for a while.
I believe we left retries disabled by default because we ended up in retry storms a couple times at my old team where we used this (+we had a couple thousand processes connected to etcd3). However, in the years since writing this originally, I've been enlighted to the joys of circuit breakers, and actually recently wrote a library for those, Cockatiel.
In the upcoming release--within the next few days--I want to provide integration with Cockatiel to allow consumers to define retry and circuit breaker policies on both the host and cluster level. This will include fairly liberal default policies, probably three retries with a 2 second circuit breaker timeout, just to avoid default options shooting any feet too badly.
Regarding errors: I have some subclasses of errors in here, and the "GRPCGenericError" is just the fallback error type. I do not want to obscure the grpc errors since as a developer if your DNS is down you want to know that rather than just get some "could not connect to server" error. However it would be a good idea to type the categories of errors better, such as connection errors versus 4xx-style request errors. This is also necessary for reliable retries.
Ultimately there will tend to be the possibility of getting generic errors, just because the grpc module is upstream and I just have a manual map of error types.
FIrst of all, thanks for this adapter which we are starting to use with our server-side framework's ODM currently.
When testing an application we encountered issues in case of one node of etcd cluster being shut down for testing failed state scenarios. The cluster was still able to handle requests, but whenever trying to address a node which has been shut down for testing, consuming application failed. After some investigation we realized that
retry
option must be set explicitly in addition to the list of endpointhosts
. Isn't it sufficient to rely on your connection-pool.ts dividing its pool into available and failed hosts.Thus I'd ask for enabling
retry
by default.What's the point of using an adapter like etcd3 to access a multi-node cluster capable of serving requests while coping with some nodes failing when it requires the consuming application to handle cluster issues itself by default? Our application wants to store its data in a cluster for its higher availability. If it would have to detect and handle issues with failing endpoints and pick strategies for any probable scenario why should it try to use a cluster in the first place instead of implementing its own? IMHO it's up to the adapter - etcd3 in this case - to cover failing nodes and implicitly switch to another node by default to serve application in context of its own API as good as possible.
The current requirement for setting
retry
option to benefit from adapter coping with failed endpoints isn't quite obvious mostly due to the way of providing code-related documentation, only. And it isn't intuitive in case of listing multiple endpoints for a reason, either.In addition, IMHO current behaviour is bad by design. Passing back GRPC errors is beyond semantics defined for the API between your adapter and some consuming application. It's like reporting SCSI command failures to an application which is trying to open a regular file. When trying to read a certain key, getting
GRPCGenericError: 14 UNAVAILABLE: DNS resolution failed
won't help that application to decide how to proceed. It is anything but related to its intention to read some key from an opaque store exposed through your adapter's API. In worst case application would have to repeat the exact same read command because it doesn't know how to adjust it for affecting the way it is executed. It can't even control the picked endpoint but still has to handle issue when endpoint picked by your code is down. There is nothing the application can do about this issue apart from bailing out or trying again relying on some information which isn't sustainable and might change in future releases e.g. when your adapter decides to work with a different dependency or similar. Thus, I think etcd3 isn't covering its background interaction with the cluster well enough.The text was updated successfully, but these errors were encountered: