Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance Transient Error handling in Async.start() method #1

Merged
merged 6 commits into from
Jan 22, 2015

Conversation

markshaule-wf
Copy link
Owner

@beaulyddon-wf @robertkluin-wf @johnlockwood-wf @tannermiller-wf @rosshendrickson-wf @jasonaguilon-wf @tylertreat-wf

FYI: @erikpetersen-wf @macleodbroad-wf

Async Changes:
Async.start() now sleeps before attempting to re-add task on a TransientError.
Add option retry_transient_errors to override the retry behaviour in Async.start(). False can be specified to just re-raise the TransientError, and not attempt a retry.

Context Changes:
Context _insert_tasks now re-raises TransientError if retry option has been set to False.
Renamed parameter 'retry_errors' to 'retry_transient_errors' in _insert_tasks function.
_insert_tasks - retry_transient_errors parameter is now passed onto recursive calls correctly.

Notes on compatibility:
Since I've renamed the parameter for _insert_tasks, that could potentially break someone's custom implementation of that function.

…o retry_transient_errors False.

Renamed parameter in _insert_tasks 'retry_errors' to 'retry_transient_errors' for clarity.
Pass retry_transient_errors parameter onto recursive calls in _insert_tasks when we are splitting and retrying errors.
… to re-add a task in Async.start() method.

Add option to override retry of TransientErrors in the Async.start() method
@beaulyddon-wf
Copy link

+1

if not retry_transient:
raise

import time

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I hate inline imports

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - me too, but that is the pattern in this file unfortunately.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why with a stdlib one. I would have no issue with this moving to the top of the file. And I agree in general with inline imports. I understand in libs like this were often things in the same file will not execute in the same context. But meh.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I'll move the time inline imports from this file and put them up top. We've still got some other stdlib imports in this file, but I'll restrict my changes to the imports I've added.

@macleodbroad-wf
Copy link

+1

2 similar comments
@tannermiller-wf
Copy link

+1

@tylertreat-wf
Copy link

+1

@johnlockwood-wf
Copy link

Adding a 4 second sleep to something called Async seems wrong.

@johnlockwood-wf
Copy link

What would happen with the existing method when a Transient Error popped?

@markshaule-wf
Copy link
Owner Author

@johnlockwood-wf - For the existing method, Async.start() would immediately attempt a retry on a TransientError. (Which from my understanding is not what we want - in the past I believe it is recommended to me to have a delay before the retry).

For the context _insert_tasks() method raising a TransientError, I suspect that one would be able to detect the failure of the insert using the insert_failed @property. I also suspect that most users don't actually check that, and the insert would just silently fail.

@markshaule-wf
Copy link
Owner Author

Since we've had a few suggestions from @macleodbroad-wf , @tylertreat-wf and @johnlockwood-wf about the delay - let me add a new option retry_delay that will allow the caller to override the default 4 second delay on the retry.

@tylertreat-wf
Copy link

+1

@johnlockwood-wf
Copy link

@markshaule-wf do you have any empirical evidence that 4 seconds is a good magical number?

@johnlockwood-wf
Copy link

I see recommendations to use a backoff, where it retries after a very short time, then if that fails, retry after a little longer, etc.

@markshaule-wf
Copy link
Owner Author

No empirical evidence that 4 seconds is a good magical number. At the time in our discussion, 'around 5 seconds' was recommended by @beaulyddon-wf, which was the Google recommendation. We've got the ability to override the value now, so if callers want to use a different value, they can.

We only retry once, if a TransientError (or other error) occurs on the retry, an exception is thrown.

@beaulyddon-wf
Copy link

From Google: "Ideally you would not immediately retry as it can cause more issues" is the general statement. However that's mostly to keep people from killing them. They recognize that's not ideal for critical pass operations. So for critical they say a single immediate retry is ok. However after that they recommend a back off of at least 5 seconds as these types of failures/errors/contention tend to clean themselves up after a couple of seconds.

@beaulyddon-wf
Copy link

+1

@tylertreat-wf
Copy link

+1

3 similar comments
@tannermiller-wf
Copy link

+1

@macleodbroad-wf
Copy link

+1

@johnlockwood-wf
Copy link

+1

…ansactional=True has been specified in context, or Async.start()
@markshaule-wf
Copy link
Owner Author

@beaulyddon-wf @tannermiller-wf @macleodbroad-wf @tylertreat-wf - The latest commit will always re-raise if transactional=True on a TransientError, regardless of the new option.

@beaulyddon-wf
Copy link

+1

3 similar comments
@tannermiller-wf
Copy link

+1

@tylertreat-wf
Copy link

+1

@macleodbroad-wf
Copy link

+1

@rosshendrickson-wf
Copy link

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants