Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve communication error tolerance for long running queries #6784

Merged
merged 3 commits into from
Jan 9, 2017

Conversation

cberner
Copy link
Contributor

@cberner cberner commented Dec 6, 2016

This dynamically adjusts the tolerance for communication errors based on the length of time that the query has run (without error)

Increase error tolerance for communication between coordinator and
worker up to the length of time the query ran for before it hit failures
private synchronized Duration elapsedErrorDuration()
{
if (errorStopwatch.isRunning()) {
errorStopwatch.stop();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this only counts time while the request is outstanding. My PR changes that to count total wall time since it started failing, which seems like it makes more sense to me. However, I'm not sure if this was intentionally written to only count the time when the request running

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is ancient, and the new Backoff class is much better, so let's just use that

@cberner
Copy link
Contributor Author

cberner commented Dec 6, 2016

@dain, there's one other behavior change due to a refactoring I made, and it seems like the more appropriate behavior. I noted it with a comment, so please let me know if that's the right thing to do

@@ -64,6 +66,7 @@ public Backoff(Duration maxFailureInterval, Ticker ticker, Duration... backoffDe
}

this.lastSuccessTime = this.ticker.read();
this.createTime = this.ticker.read();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd pass this into the constructor so this can be the start time of the query instead of the start time of this task. This will help deal with new tasks added late in the query due to the phased scheduler.

minErrorDuration,
maxErrorDuration,
ticker,
new Duration(0, MILLISECONDS),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use the default back off schedule. This one seems way to aggressive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how the code works today, which is why I used it. I agree it seems aggressive though. I'll change it to the default one

private synchronized Duration elapsedErrorDuration()
{
if (errorStopwatch.isRunning()) {
errorStopwatch.stop();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is ancient, and the new Backoff class is much better, so let's just use that

Also refactor HttpPageBufferClient to use Backoff
@cberner cberner assigned cberner and unassigned dain Dec 7, 2016
@cberner cberner merged commit 49257f2 into prestodb:master Jan 9, 2017
@cberner cberner deleted the backoff2 branch February 14, 2017 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants