Initial Support of Adaptive Optimization with Presto Unlimited #14675

pguofb · 2020-06-18T20:37:12Z

Background: Presto Unlimited will materialize exchange outputs into temporary tables (see #12387) , which creates an opportunity to invoke CBO during runtime on later stages based on temporary table statistics generated by previous stages. This will result in more reliable optimizations for complex queries whose later stages often have less accurate estimated statistics.

Specifically, this PR achieves the following goals.

Enable table and column statistics to be collected for temporary tables, and enable the statistics to be correctly fetched and processed by statsCalculators.
Create an iterative optimization rule that can leverage temporary table statistics to calculate statistics for the probe and build side of a Join node and swap them when the build side is larger.
Make CBO invokable at LegacySqlQueryScheduler and optimize the plan right before scheduling and actual execution.

Basic testing:

In TestHiveIntegrationSmokeTest::testMaterializedPartitioning test suite, the rule correctly captures two queries 1 and 2 where build side is larger than probe side, and invoking CBO did not affect query result correctness.
Will add more unit tests.

== NO RELEASE NOTE ==

rschlussel

Still reviewing the last three commits, but if the first two commits are ready sooner, we can merge them first.

presto-main/src/main/java/com/facebook/presto/cost/StatsUtil.java

rschlussel · 2020-06-23T20:19:54Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/LegacySqlQueryScheduler.java

+
+        Map<PlanFragment, PlanFragment> oldToNewFragment = stream(forTree(StreamingSubPlan::getChildren).depthFirstPreOrder(section.getPlan()))
+                // filter leaf stages
+                .filter(plan -> plan.getChildren().isEmpty())


ideally all this logic to filter relevant plans should go into the optimizer rule

Yeah, I thought about moving all the filtering logic inside the rule, but my concern is it seems not possible to easily tell from the iterative optimizer output if the returned plan is actually changed or not. If we are not able to tell, then for each ready section, we will waste time rebuilding the whole section and stageExecution/scheduler etc. even the optimizer rule is not fired at all.

rschlussel · 2020-06-23T20:23:52Z

presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/Statistics.java

@@ -91,6 +91,12 @@ public static HiveBasicStatistics reduce(HiveBasicStatistics first, HiveBasicSta

    public static Map<String, HiveColumnStatistics> merge(Map<String, HiveColumnStatistics> first, Map<String, HiveColumnStatistics> second)
    {
+        // To correctly merge statistics during temporary table finish insertion. When ``first'' have exactly the same columns as ``second'' but all empty statistics,
+        // then the ``first'' is the placeholder empty statistics left at temporary table creation and safe to directly return second.
+        if (first.values().stream().allMatch(statistics -> statistics.equals(HiveColumnStatistics.empty()))


@arhimondr can you check this logic?

How about instead of adding this logic we set all statistics to 0 when creating a temporary table?

You can use this helper method: https://github.com/prestodb/presto/blob/master/presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/Statistics.java#L244

arhimondr

First two commits:

Fix StatsUtil function ...
Enable statistics aggregation in temp

Some comments

Please make sure the commit message complies with our commit message style guidelines:
https://github.com/prestodb/presto/wiki/Review-and-Commit-guidelines#commit-formatting-and-pull-requests
https://chris.beams.io/posts/git-commit/

More specifically the commit summary should be ~50 characters. You can have mode descriptive message in the body. For example:

Enable statistics collection for temporary tables

Enable column level statistics collection when writing intermediate data
to temporary table for materialized exchanges. Collected statistics will
be used to automatically change join order in runtime.

presto-main/src/main/java/com/facebook/presto/cost/ConnectorFilterStatsCalculatorService.java

arhimondr · 2020-06-24T02:14:49Z

presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/Statistics.java

@@ -91,6 +91,12 @@ public static HiveBasicStatistics reduce(HiveBasicStatistics first, HiveBasicSta

    public static Map<String, HiveColumnStatistics> merge(Map<String, HiveColumnStatistics> first, Map<String, HiveColumnStatistics> second)
    {
+        // To correctly merge statistics during temporary table finish insertion. When ``first'' have exactly the same columns as ``second'' but all empty statistics,
+        // then the ``first'' is the placeholder empty statistics left at temporary table creation and safe to directly return second.
+        if (first.values().stream().allMatch(statistics -> statistics.equals(HiveColumnStatistics.empty()))


How about instead of adding this logic we set all statistics to 0 when creating a temporary table?

arhimondr · 2020-06-24T02:22:26Z

presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/Statistics.java

@@ -91,6 +91,12 @@ public static HiveBasicStatistics reduce(HiveBasicStatistics first, HiveBasicSta

    public static Map<String, HiveColumnStatistics> merge(Map<String, HiveColumnStatistics> first, Map<String, HiveColumnStatistics> second)
    {
+        // To correctly merge statistics during temporary table finish insertion. When ``first'' have exactly the same columns as ``second'' but all empty statistics,
+        // then the ``first'' is the placeholder empty statistics left at temporary table creation and safe to directly return second.
+        if (first.values().stream().allMatch(statistics -> statistics.equals(HiveColumnStatistics.empty()))


You can use this helper method: https://github.com/prestodb/presto/blob/master/presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/Statistics.java#L244

presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java

presto-main/src/main/java/com/facebook/presto/sql/planner/StatisticsAggregationPlanner.java

presto-main/src/main/java/com/facebook/presto/sql/planner/PlanFragmenter.java

arhimondr · 2020-06-24T03:00:40Z

@rschlussel @pguofb

It feels like on the high level collecting the stats for each and ever column is a little overkill. Given that the planner does the pruning of all unnecessary columns it is certain that all the columns will be used. Ideally we should simply collect the overall size for all the inputs and use it as an input for the CBO algorithm. It will not give the best results in case there are more nodes before the JoinNode, but historically we've seen that the estimates are not very reliable anyway. Thus I'm not very confident if it makes sense to pay the cost of collecting the detailed stats that are much more expensive, as usually there will be no additional nodes before the join, and if there are - we are not very confident in the estimates anyway.

Anyway, changing this will go beyond the scope of this project. But it feels like it is something we should consider doing in the feature.

arhimondr

Enable statistics aggregation for temporary table.

LGTM

presto-main/src/main/java/com/facebook/presto/sql/planner/PlanOptimizers.java

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/RuntimeReorderJoinSides.java

presto-main/src/main/java/com/facebook/presto/execution/scheduler/LegacySqlQueryScheduler.java

aweisberg · 2020-07-01T18:23:54Z

Would this benefit from integration tests for SqlQueryScheduler and LegacySqlQueryScheduler?

pguofb · 2020-07-01T18:34:39Z

Would this benefit from integration tests for SqlQueryScheduler and LegacySqlQueryScheduler?

I suppose integration test could further verify the correctness of invoking runtime CBO. Ideally, we might also want to use some queries in prod that are currently ill-optimized to see if CBO can fix the join sides in runtime.

arhimondr

Some comments

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java

presto-main/src/main/java/com/facebook/presto/sql/analyzer/FeaturesConfig.java

presto-main/src/main/java/com/facebook/presto/sql/planner/PlanFragmenter.java

presto-main/src/main/java/com/facebook/presto/sql/planner/iterative/IterativeOptimizer.java

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/RuntimeReorderJoinSides.java

presto-main/src/main/java/com/facebook/presto/execution/scheduler/SqlQueryScheduler.java

arhimondr · 2020-07-02T02:32:00Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/SqlQueryScheduler.java

@@ -591,7 +706,7 @@ public BasicStageExecutionStats getBasicStageStats()
    public StageInfo getStageInfo()
    {
        ListMultimap<StageId, SqlStageExecution> stageExecutions = getStageExecutions();
-        return buildStageInfo(plan, stageExecutions);
+        return buildStageInfo(plan.get(), stageExecutions);


This is likely to break statistics

CC: @rschlussel

@pguofb is looking at that next. Right now the info in the live plan will be incorrect, but the final plan in the querycompletedevent and the stage info will be correct.

presto-main/src/main/java/com/facebook/presto/execution/scheduler/LegacySqlQueryScheduler.java

wenleix · 2020-07-06T15:36:47Z

Nit about commit header: No period at the end :). See the seven rules: https://chris.beams.io/posts/git-commit/#seven-rules 😃

wenleix

"Add a session property for runtime optimizer". LGTM. Note "adaptive optimizer" is probably a more sexy name than "runtime optimizer". 😃

wenleix

"Invoke CBO at SqlQueryScheduler for Join Swapping". Generally looks good to me. I will take a separate look into RuntimeReorderJoinSides later. But don't be blocked on my pass if you think it's ready to merge.

Just to confirm my understanding: when adaptive optimization kicked in legacy scheduler, the logic of creating new sections actually leverage what implements in the new scheduler -- this makes sense since section execution abstraction is only introduced in new scheduler, and it will be much more difficult to do adaptive execution in legacy scheduler.

Now here is an interesting note, when legacy scheduler works together with adaptive execution, the behavior will be a mix of "legacy" and "new" scheduler right? i.e. the code path follows legacy scheduler when adaptive execution hasn't kicked in. But once adaptive execution kicked, the code logic is more like the section retry in the new scheduler. 😃

cc @rschlussel

presto-main/src/main/java/com/facebook/presto/execution/scheduler/LegacySqlQueryScheduler.java

wenleix · 2020-07-06T16:14:11Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/LegacySqlQueryScheduler.java

+            outputBuffers = createDiscardingOutputBuffers();
+            locationsConsumer = (fragmentId, tasks, noMoreExchangeLocations) -> {};
+        }
+        SectionExecution sectionExecution = sectionExecutionFactory.createSectionExecutions(


@rschlussel : Is SectionExecutionFactory introduced for the new sql query scheduler? So essentially we are calling something in the new scheduler from the legacy scheduler? ;)

sectionexecutionfactory is already used in the legacySqlQueryScheduler (and the logic was originally extracted from there)

wenleix · 2020-07-06T16:15:07Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/LegacySqlQueryScheduler.java

+                remoteTaskFactory,
+                splitSourceFactory,
+                0);
+        addStateChangeListeners(sectionExecution);


do we need to remove the state change listeners for old sections?

Yeah, I thought about this. When we created new stageExecutions and replace the old ones in this.stageExecutions, these old stageExecutions will have nothing referencing to them and will be garbage collected eventually without getting triggered. Therefore, I'm not too worried about explicitly removing listeners. Besides, stateMachine class also does not provide methods to explicitly remove listeners, so I think it is ok to trust on the GC to do its own job :)

wenleix · 2020-07-06T16:15:43Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/LegacySqlQueryScheduler.java

+        Map<StageId, StageExecutionAndScheduler> updatedStageExecutions = sectionExecution.getSectionStages().stream()
+                .collect(toImmutableMap(execution -> execution.getStageExecution().getStageExecutionId().getStageId(), identity()));
+        synchronized (this) {
+            stageExecutions.putAll(updatedStageExecutions);


ditto: do we need to remove old stage executions from stageExecutions ?

Samewise, stageExecutions is a map from StageId -> StageExecutionAndScheduler. The rewritten stages have the same id from the old one (a bit different from retry because we optimize them right before we schedule & execute them), and so we actually replace the old ones, not simply inserting a batch of new ones.

presto-main/src/main/java/com/facebook/presto/execution/scheduler/LegacySqlQueryScheduler.java

presto-main/src/main/java/com/facebook/presto/execution/SqlQueryExecution.java

wenleix · 2020-07-06T21:15:43Z

two high level questions:

In the query completion event, will the adapted plan being reported, or the original plan? (looks like original plan will be reported?)
Do we plan to have field in query completion event to indicate if the query get adapted? -- this helps understand the impact in production, verifier run, etc .

pguofb · 2020-07-06T21:50:31Z

two high level questions:

In the query completion event, will the adapted plan being reported, or the original plan? (looks like original plan will be reported?)

Do we plan to have field in query completion event to indicate if the query get adapted? -- this helps understand the impact in production, verifier run, etc .

Actually, according to my project plan, the next step (after enable runtime join swapping) is to fix the statistics show-up in the UI. And, I'm currently starting to look at this part.

So far, I checked the UI that Stage Performance (zoom into per-stage details) and JSON page show the correct (swapped) statistics. But the Live Plan page (showing the overall SubPlan) is not updated to the swapped one, even though we updated this.plan every time we adapted a stage via updatePlan function.
We have not decided a concrete plan on modifying the query completion event, but ideally we definitely want to reflect whether runtime adaptation happens.

pguofb · 2020-07-10T13:08:20Z

The latest commit fixes the broken statistics in LivePlan. Now, both the the Query Completion Event as well as the Live Plan shown during runtime (essentially via periodically issuing a http GET to QueryResource.getQueryInfo fetch latest queryInfo and render) will reflect the correct plan information and statistics.

rschlussel · 2020-07-10T15:29:23Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/LegacySqlQueryScheduler.java

@@ -190,6 +196,7 @@ private LegacySqlQueryScheduler(
        this.queryStateMachine = requireNonNull(queryStateMachine, "queryStateMachine is null");
        this.plan.compareAndSet(null, requireNonNull(plan, "plan is null"));
        this.session = requireNonNull(session, "session is null");
+        this.metadata = requireNonNull(metadata, "metadata is null");


nit: just get the functionmanager from the metadata since that's all you need (and same below)

rschlussel · 2020-07-10T15:32:48Z

@wenleix @arhimondr did you have any other comments or concerns?

wenleix · 2020-07-10T20:43:28Z

@pguofb :

So far, I checked the UI that Stage Performance (zoom into per-stage details) and JSON page show the correct (swapped) statistics. But the Live Plan page (showing the overall SubPlan) is not updated to the swapped one, even though we updated this.plan every time we adapted a stage via updatePlan function.

Sounds good. What about the plan json in the query completion event? :)

Update: sorry , just see you latest comment (#14675 (comment)) . Nice work!

wenleix · 2020-07-10T20:43:56Z

thanks @rschlussel and @pguofb . I have no other comments :)

arhimondr

LGTM % nits

arhimondr · 2020-07-13T23:07:34Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/LegacySqlQueryScheduler.java

+    private final SectionExecutionFactory sectionExecutionFactory;
+    private final RemoteTaskFactory remoteTaskFactory;
+    private final SplitSourceFactory splitSourceFactory;
+    private static final Logger log = Logger.get(LegacySqlQueryScheduler.class);


nit: Usually we define log as a very first field in the class and separate it from the other fields

arhimondr · 2020-07-13T23:09:35Z

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/RuntimeReorderJoinSides.java

+public class RuntimeReorderJoinSides
+        implements Rule<JoinNode>
+{
+    private final Logger log = Logger.get(RuntimeReorderJoinSides.class);


private static final

arhimondr · 2020-07-13T23:10:08Z

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/RuntimeReorderJoinSides.java

+                swapped.getRightHashVariable(),
+                swapped.getDistributionType());
+
+        log.debug("Probe size: " + leftOutputSizeInBytes + " is smaller than Build size: " + rightOutputSizeInBytes + " => invoke runtime join swapping on JoinNode ID: " + newJoinNode.getId());


Use pattern Probe size: %s is smaller than build size: %s => ...

rschlussel · 2020-07-15T15:15:46Z

Looks great! can you squash the last 4 commits together?

rschlussel · 2020-07-14T15:39:20Z

presto-main/src/main/java/com/facebook/presto/execution/SqlQueryExecution.java

-        SubPlan fragmentedPlan = planFragmenter.createSubPlans(stateMachine.getSession(), plan, false, idAllocator, stateMachine.getWarningCollector());
+        // the variableAllocator is finally passed to SqlQueryScheduler for runtime cost-based optimizations
+        variableAllocator.set(new PlanVariableAllocator(plan.getTypes().allVariables()));
+        SubPlan fragmentedPlan = planFragmenter.createSubPlans(stateMachine.getSession(), plan, false, idAllocator, variableAllocator.get(), stateMachine.getWarningCollector());


No need to pass the variableAllocator here - createSubPlans creates a variableallocator in the exact same way.

Yes, createSubPlans internally create a variableAllocator this way, but we want this variableAllocator to be exposed after creating the subplans, and then use it for runtime optimizers. Therefore, we moved the creation outside here, pass it in as an argument, and later on can feed it in SqlQueryScheduler.

- Pass in CBO and make it invokable at [Legacy/]SqlQueryScheduler during runtime. - Rename get() method of PlanOptimizers to getPlanningTimeOptimizers() to distinguish getRuntimeOptimizers() - Adjust IterativeOptimizer optimize() function to return the original plan when the plan is not changed, instead of always calling memo.extract() to rebuild a new one. - Create a join swapping rule (RuntimeReorderJoinSides) based on the probe and build side statistics, and adjust the local exchange when necessary (add at build side, remove at probe side). - Rebuild the section, re-generate stageExecutionAndSchedulers of the section, and adjust the overall subplan when the join is swapped to reflect the correct statistics to web UI and QueryCompletionEvent.

pguofb force-pushed the milestone1 branch 2 times, most recently from a5947a8 to 1d5edc7 Compare June 19, 2020 17:12

rschlussel requested review from arhimondr and rschlussel June 22, 2020 19:28

pguofb force-pushed the milestone1 branch from 1d5edc7 to 0e4d41e Compare June 23, 2020 15:37

rschlussel reviewed Jun 23, 2020

View reviewed changes

pguofb force-pushed the milestone1 branch from 0e4d41e to 1f309db Compare June 24, 2020 01:33

arhimondr reviewed Jun 24, 2020

View reviewed changes

pguofb force-pushed the milestone1 branch 4 times, most recently from 71d7387 to 369b6dd Compare June 25, 2020 19:45

arhimondr reviewed Jun 25, 2020

View reviewed changes

pguofb force-pushed the milestone1 branch 3 times, most recently from b42a58c to 3c33361 Compare June 27, 2020 04:24

wenleix self-requested a review June 29, 2020 23:52

pguofb force-pushed the milestone1 branch from 3c33361 to 7154ef6 Compare June 30, 2020 13:11

rschlussel reviewed Jul 1, 2020

View reviewed changes

arhimondr reviewed Jul 2, 2020

View reviewed changes

wenleix reviewed Jul 6, 2020

View reviewed changes

presto-main/src/main/java/com/facebook/presto/execution/scheduler/LegacySqlQueryScheduler.java Outdated Show resolved Hide resolved

wenleix reviewed Jul 6, 2020

View reviewed changes

presto-main/src/main/java/com/facebook/presto/execution/scheduler/LegacySqlQueryScheduler.java Outdated Show resolved Hide resolved

wenleix reviewed Jul 6, 2020

View reviewed changes

presto-main/src/main/java/com/facebook/presto/execution/scheduler/LegacySqlQueryScheduler.java Outdated Show resolved Hide resolved

pguofb force-pushed the milestone1 branch 2 times, most recently from e3b6cd8 to d03265a Compare July 6, 2020 15:07

wenleix changed the title ~~Invoke CBO in LegacySqlQueryScheduler based on temporary table statistics~~ Initial Support of Adaptive Optimization with Presto Unlimited Jul 6, 2020

wenleix reviewed Jul 6, 2020

View reviewed changes

pguofb force-pushed the milestone1 branch from d03265a to fab8a61 Compare July 6, 2020 18:16

pguofb force-pushed the milestone1 branch from fab8a61 to 3dc8803 Compare July 10, 2020 13:02

rschlussel approved these changes Jul 10, 2020

View reviewed changes

pguofb force-pushed the milestone1 branch from 3dc8803 to 8e7d5f0 Compare July 10, 2020 15:54

arhimondr approved these changes Jul 13, 2020

View reviewed changes

pguofb force-pushed the milestone1 branch 3 times, most recently from 45f9a57 to 8752111 Compare July 15, 2020 14:43

rschlussel reviewed Jul 15, 2020

View reviewed changes

pguofb force-pushed the milestone1 branch 2 times, most recently from cad2c36 to e60d5a3 Compare July 15, 2020 16:36

pguofb added 2 commits July 15, 2020 13:42

Add a session property for runtime optimizer

76f7499

pguofb force-pushed the milestone1 branch from e60d5a3 to 054d36a Compare July 15, 2020 17:56

rschlussel merged commit 0890b3d into prestodb:master Jul 15, 2020

Initial Support of Adaptive Optimization with Presto Unlimited #14675

Initial Support of Adaptive Optimization with Presto Unlimited #14675

Conversation

pguofb commented Jun 18, 2020 • edited by wenleix Loading

rschlussel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arhimondr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arhimondr commented Jun 24, 2020

arhimondr left a comment

Choose a reason for hiding this comment

aweisberg commented Jul 1, 2020

pguofb commented Jul 1, 2020

arhimondr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenleix commented Jul 6, 2020

wenleix left a comment

Choose a reason for hiding this comment

wenleix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenleix commented Jul 6, 2020

pguofb commented Jul 6, 2020

pguofb commented Jul 10, 2020

Choose a reason for hiding this comment

rschlussel commented Jul 10, 2020

wenleix commented Jul 10, 2020 • edited Loading

wenleix commented Jul 10, 2020

arhimondr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rschlussel commented Jul 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pguofb commented Jun 18, 2020 •

edited by wenleix

Loading

wenleix commented Jul 10, 2020 •

edited

Loading