Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Retry logic doesn't seem to retry enough nor handle timeouts properly #594
I am fairly new to Hector's codebase, but following some surprising behaviour observed in production and some subsequent reading of Hector's code, I thought I would raise this ticket to confirm or invalidate my suspicions:
Could it be possible to:
Thank you very much in advance!
I totally agree with you. SocketTimeoutException should be handled like HTimedOutException. We are facing the same problem as you. After some research, the code actually was there before the issue 434. I vote to revert that change. I do find some related issue and preparing a patch for it. I am doing some test with reverted change of issue 434 and can't reproduce the issue 434.
This comment has been minimized.
This comment has been minimized.Show comment Hide comment
@weizhu-us, glad to hear I am not the only one facing this issue.
About my 1st point, unless I am being mistaken, I still believe it will retry an incorrect number of times, given the code I have shown and the fact that:
Hector will still rethrow the exception after, say in my case, 5 retries instead of 50:
More concretely, I guess this means Hector should be changed in the following way:
...the idea being:
I think there is valid reason to retry up to the number of nodes. The idea might be that it's time to give up if all the nodes are tried. But it's not the case since the host with HTimedOutException is not added to the excludeHosts list. We had a case that some requests happen to hit the same node for all the retries after timeout while using RR LB policy. We then switched to dynamic LB to lower the chance of that from happening. Bu it would be nice to add the node to the excludedHost after HTimedOutException unless there is a good reason not to do so.