Retry logic doesn't seem to retry enough nor handle timeouts properly #594

Open
marccarre opened this Issue Mar 6, 2013 · 3 comments

Comments

Projects
None yet
2 participants
@marccarre

Hi,

I am fairly new to Hector's codebase, but following some surprising behaviour observed in production and some subsequent reading of Hector's code, I thought I would raise this ticket to confirm or invalidate my suspicions:

  1. In HConnectionManager, it seems that regardless of the number of retries configured in FailoverPolicy (I use 50 retries with 2 seconds between retries, on a 5 nodes cluster), the maximum number of times it will actually retry is the number of nodes in the cluster as:

    • the minimum of these two values is taken:

      int retries = Math.min(op.failoverPolicy.numRetries, hostPools.size());
      
    • and, regardless of the error, a "retry token" is consumed every time it fails:

      } finally {
          --retries;
          [...]
      }
      
  2. Based on the comments in the code, it seems that when a timeout occurs, the number of retries should NOT be decremented, but the code does NOT increment retries to compensate for the decrement in the finally block:

        } else if (he instanceof HTimedOutException ) {
            // DO NOT drecrement retries, we will be keep retrying on timeouts until it comes back
            [...]
        }
    [...]
    } finally {
        --retries;
        [...]
    }
    
  3. When Cassandra is so busy that it drops requests or when requests timeout, it happens that: SocketTimeoutException gets thrown, which is rethrown as TTransportException, which is itself rethrown as HectorTransportException.
    However it seems this should be handled like HTimedOutException.


Could it be possible to:

  1. if I am being mistaken, shed some light on the above logic?
  2. if these are indeed bugs, fix them in the next release or give some instructions on how I could:
  • write tests for these;
  • fix the code so that the tests pass.

Thank you very much in advance!

M.

@weizhu-us

This comment has been minimized.

Show comment Hide comment
@weizhu-us

weizhu-us Mar 7, 2013

I totally agree with you. SocketTimeoutException should be handled like HTimedOutException. We are facing the same problem as you. After some research, the code actually was there before the issue 434. I vote to revert that change. I do find some related issue and preparing a patch for it. I am doing some test with reverted change of issue 434 and can't reproduce the issue 434.
Regarding your first two points, the comment of retry count is not accurate. Retry does decrease for timeout. Nate already confirmed that.

I totally agree with you. SocketTimeoutException should be handled like HTimedOutException. We are facing the same problem as you. After some research, the code actually was there before the issue 434. I vote to revert that change. I do find some related issue and preparing a patch for it. I am doing some test with reverted change of issue 434 and can't reproduce the issue 434.
Regarding your first two points, the comment of retry count is not accurate. Retry does decrease for timeout. Nate already confirmed that.

@marccarre

This comment has been minimized.

Show comment Hide comment
@marccarre

marccarre Mar 7, 2013

@weizhu-us, glad to hear I am not the only one facing this issue.
Also, thanks for confirming what I observed and my 2nd and 3rd points.

About my 1st point, unless I am being mistaken, I still believe it will retry an incorrect number of times, given the code I have shown and the fact that:

  • regardless of the type of error and the associated error handling, and
  • regardless of whether it is a retryable error or not,

Hector will still rethrow the exception after, say in my case, 5 retries instead of 50:

if ( retries <= 0 || retryable == false)
      throw he;

More concretely, I guess this means Hector should be changed in the following way:

  1. retries should just be equal to op.failoverPolicy.numRetries,
  2. detection of unresponsive hosts (i.e. mark as down) and how the hosts/pools are managed should be decoupled from the retry logic,

...the idea being:

  • if all your hosts are down, you still continue to retry, but
  • to avoid spamming them at the worst possible time, you have sleepBetweenHostsMilli in FailoverPolicy, or Hector could even avoid sending the query to the hosts if none has been detected as up again (this could be a choice in the configuration), and
  • if one host comes back up after 49 retries the query would then be sent to it, and provided it succeeds this time (e.g. assuming CL.ONE), from the client's perspective, the query "just" succeeded (i.e. the retry logic is transparent) and you don't have to handle a "retry burden" earlier than necessary (i.e. after 5 retries only).

@weizhu-us, glad to hear I am not the only one facing this issue.
Also, thanks for confirming what I observed and my 2nd and 3rd points.

About my 1st point, unless I am being mistaken, I still believe it will retry an incorrect number of times, given the code I have shown and the fact that:

  • regardless of the type of error and the associated error handling, and
  • regardless of whether it is a retryable error or not,

Hector will still rethrow the exception after, say in my case, 5 retries instead of 50:

if ( retries <= 0 || retryable == false)
      throw he;

More concretely, I guess this means Hector should be changed in the following way:

  1. retries should just be equal to op.failoverPolicy.numRetries,
  2. detection of unresponsive hosts (i.e. mark as down) and how the hosts/pools are managed should be decoupled from the retry logic,

...the idea being:

  • if all your hosts are down, you still continue to retry, but
  • to avoid spamming them at the worst possible time, you have sleepBetweenHostsMilli in FailoverPolicy, or Hector could even avoid sending the query to the hosts if none has been detected as up again (this could be a choice in the configuration), and
  • if one host comes back up after 49 retries the query would then be sent to it, and provided it succeeds this time (e.g. assuming CL.ONE), from the client's perspective, the query "just" succeeded (i.e. the retry logic is transparent) and you don't have to handle a "retry burden" earlier than necessary (i.e. after 5 retries only).
@weizhu-us

This comment has been minimized.

Show comment Hide comment
@weizhu-us

weizhu-us Mar 7, 2013

I think there is valid reason to retry up to the number of nodes. The idea might be that it's time to give up if all the nodes are tried. But it's not the case since the host with HTimedOutException is not added to the excludeHosts list. We had a case that some requests happen to hit the same node for all the retries after timeout while using RR LB policy. We then switched to dynamic LB to lower the chance of that from happening. Bu it would be nice to add the node to the excludedHost after HTimedOutException unless there is a good reason not to do so.
We set socket time out while creating KS. For our use case, it's fine to give up after that many retries. We set timeout to be 100ms and we have 5 nodes, if our client doesn't receive result within 500ms, it probably doesn't need it any more.

I think there is valid reason to retry up to the number of nodes. The idea might be that it's time to give up if all the nodes are tried. But it's not the case since the host with HTimedOutException is not added to the excludeHosts list. We had a case that some requests happen to hit the same node for all the retries after timeout while using RR LB policy. We then switched to dynamic LB to lower the chance of that from happening. Bu it would be nice to add the node to the excludedHost after HTimedOutException unless there is a good reason not to do so.
We set socket time out while creating KS. For our use case, it's fine to give up after that many retries. We set timeout to be 100ms and we have 5 nodes, if our client doesn't receive result within 500ms, it probably doesn't need it any more.

weizhu-us pushed a commit to weizhu-us/hector that referenced this issue Mar 27, 2013

zznate added a commit that referenced this issue Mar 28, 2013

Merge pull request #608 from weizhu-us/master
 issue #594, handle SocketTimeoutException differently
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment