Improve communication error tolerance for long running queries #6784

cberner · 2016-12-06T22:00:28Z

This dynamically adjusts the tolerance for communication errors based on the length of time that the query has run (without error)

Increase error tolerance for communication between coordinator and worker up to the length of time the query ran for before it hit failures

cberner · 2016-12-06T22:02:03Z

presto-main/src/main/java/com/facebook/presto/operator/HttpPageBufferClient.java

-    private synchronized Duration elapsedErrorDuration()
-    {
-        if (errorStopwatch.isRunning()) {
-            errorStopwatch.stop();


It looks like this only counts time while the request is outstanding. My PR changes that to count total wall time since it started failing, which seems like it makes more sense to me. However, I'm not sure if this was intentionally written to only count the time when the request running

This code is ancient, and the new Backoff class is much better, so let's just use that

cberner · 2016-12-06T22:02:58Z

@dain, there's one other behavior change due to a refactoring I made, and it seems like the more appropriate behavior. I noted it with a comment, so please let me know if that's the right thing to do

dain · 2016-12-06T22:34:51Z

presto-main/src/main/java/com/facebook/presto/server/remotetask/Backoff.java

@@ -64,6 +66,7 @@ public Backoff(Duration maxFailureInterval, Ticker ticker, Duration... backoffDe
        }

        this.lastSuccessTime = this.ticker.read();
+        this.createTime = this.ticker.read();


I'd pass this into the constructor so this can be the start time of the query instead of the start time of this task. This will help deal with new tasks added late in the query due to the phased scheduler.

dain · 2016-12-06T22:49:22Z

presto-main/src/main/java/com/facebook/presto/operator/HttpPageBufferClient.java

+                minErrorDuration,
+                maxErrorDuration,
+                ticker,
+                new Duration(0, MILLISECONDS),


I'd use the default back off schedule. This one seems way to aggressive.

This is how the code works today, which is why I used it. I agree it seems aggressive though. I'll change it to the default one

dain · 2016-12-06T23:36:04Z

presto-main/src/main/java/com/facebook/presto/operator/HttpPageBufferClient.java

-    private synchronized Duration elapsedErrorDuration()
-    {
-        if (errorStopwatch.isRunning()) {
-            errorStopwatch.stop();


This code is ancient, and the new Backoff class is much better, so let's just use that

Also refactor HttpPageBufferClient to use Backoff

cberner added 2 commits December 6, 2016 13:50

Remove unnecessary constructor parameter

d3a15b7

Tolerate more errors for long running queries

57f54cd

Increase error tolerance for communication between coordinator and worker up to the length of time the query ran for before it hit failures

cberner assigned dain Dec 6, 2016

cberner commented Dec 6, 2016

View reviewed changes

dain approved these changes Dec 6, 2016

View reviewed changes

Make long running exchanges more tolerant of errors

548cd57

Also refactor HttpPageBufferClient to use Backoff

cberner force-pushed the backoff2 branch from 594caa3 to 548cd57 Compare December 7, 2016 19:24

cberner assigned cberner and unassigned dain Dec 7, 2016

martint added the accepted label Dec 28, 2016

facebook-github-bot added the CLA Signed label Dec 29, 2016

cberner merged commit 49257f2 into prestodb:master Jan 9, 2017

cberner deleted the backoff2 branch February 14, 2017 23:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve communication error tolerance for long running queries #6784

Improve communication error tolerance for long running queries #6784

cberner commented Dec 6, 2016

cberner Dec 6, 2016

dain Dec 6, 2016

cberner commented Dec 6, 2016

dain Dec 6, 2016

dain Dec 6, 2016

cberner Dec 7, 2016

dain Dec 6, 2016

Improve communication error tolerance for long running queries #6784

Improve communication error tolerance for long running queries #6784

Conversation

cberner commented Dec 6, 2016

cberner Dec 6, 2016

Choose a reason for hiding this comment

dain Dec 6, 2016

Choose a reason for hiding this comment

cberner commented Dec 6, 2016

dain Dec 6, 2016

Choose a reason for hiding this comment

dain Dec 6, 2016

Choose a reason for hiding this comment

cberner Dec 7, 2016

Choose a reason for hiding this comment

dain Dec 6, 2016

Choose a reason for hiding this comment