Support evaluating min/max only metadata query #14845

shixuan-fan · 2020-07-16T01:27:23Z

== RELEASE NOTES ==

General Changes
* Add support to optimize min/max only metadata query. This is controlled by existing config ``optimizer.optimize-metadata-queries`` and session property ``optimize_metadata_queries``. Note that enabling this config/session property might change query result if there are metadata that refers to empty data, e.g. empty hive partition.

Note that enabling existing config optimizer.optimize-metadata-queries and session property optimize_metadata_queries might change query result if there are metadata that refers to empty data, e.g. empty hive partition. For example, if we have two Hive ds partitions, one is 2020-07-01 and the other is 2020-08-01. Let's assume 2020-08-01 is an empty partition. Then when computing without metadata optimizer, the ds rows come from data, and since 2020-08-01 does not have any data, it won't be appearing in the result (e.g. DISTINCT ds would only return 2020-07-01). However, if metadata optimizer is enabled, then ds rows come from metastore, and DISTINCT ds would return both rows.

Assuming we have a daily ingested table that is partitioned on ds, a filter like `ds = (SELECT '2020-07-01')` is converted into an INNER JOIN, but this value is not passed to the other side of Join, which leads to full table scan. This commit will enable this value being treated as predicate, and thus we only need to read this one partition.

highker

"Push expression translation above MetadataQueryOptimizer" LGTM

highker

"Remove unused field": Change the title to "Remove unused field for MetadataQueryOptimizer"

presto-hive/src/test/java/com/facebook/presto/hive/TestHiveLogicalPlanner.java

...main/src/main/java/com/facebook/presto/sql/planner/optimizations/MetadataQueryOptimizer.java

highker · 2020-07-27T17:39:26Z

...main/src/main/java/com/facebook/presto/sql/planner/optimizations/MetadataQueryOptimizer.java

+            if (arguments.isEmpty()) {
+                return constant(null, returnType);
+            }


When will the result be null?

The result would be null if all values are null.

That being said, I should probably move this to be the first step of this function.

Assuming we have a daily ingested table that is partitioned on ds, one common use case is to fetch data from latest ds partition. One way to compose such a query is using a filter like `ds = (SELECT max(ds) FROM table)`. However, this filter is converted into an INNER JOIN, and will lead to a full table scan on the other side of join. Instead, this commit enables a query like `SELECT max(ds) FROM table` being evaluated at optimization time when OPTIMIZE_METADATA_QUERIES is set to true, and convert it into a ValuesNode, which could then be pushed to the other side of Join to avoid expensive full table scan.

shixuan-fan force-pushed the optimize branch from a971fd7 to 2493b37 Compare July 16, 2020 05:32

shixuan-fan changed the title ~~[WIP] Support evaluating min/max only metadata query~~ [TEST|WIP] Support evaluating min/max only metadata query Jul 16, 2020

shixuan-fan force-pushed the optimize branch from 2493b37 to 0656169 Compare July 16, 2020 18:08

shixuan-fan marked this pull request as draft July 16, 2020 23:19

shixuan-fan force-pushed the optimize branch 5 times, most recently from 6d81914 to bfca28a Compare July 23, 2020 01:12

shixuan-fan added 2 commits July 23, 2020 10:40

Push expression translation above MetadataQueryOptimizer

a7030b1

shixuan-fan force-pushed the optimize branch from bfca28a to be7331a Compare July 23, 2020 18:22

shixuan-fan marked this pull request as ready for review July 27, 2020 16:19

shixuan-fan changed the title ~~[TEST|WIP] Support evaluating min/max only metadata query~~ Support evaluating min/max only metadata query Jul 27, 2020

shixuan-fan requested a review from highker July 27, 2020 16:19

shixuan-fan assigned highker Jul 27, 2020

shixuan-fan requested a review from a team July 27, 2020 16:20

highker reviewed Jul 27, 2020

View reviewed changes

highker removed their assignment Jul 27, 2020

jainxrohit self-requested a review July 27, 2020 17:37

highker reviewed Jul 27, 2020

View reviewed changes

Remove unused field in MetadataQueryOptimizer

3435c2b

shixuan-fan force-pushed the optimize branch from be7331a to fe7b487 Compare July 27, 2020 20:45

shixuan-fan force-pushed the optimize branch from fe7b487 to d6df5fd Compare July 27, 2020 20:49

highker approved these changes Jul 27, 2020

View reviewed changes

shixuan-fan merged commit ef4b537 into prestodb:master Jul 27, 2020

shixuan-fan deleted the optimize branch July 27, 2020 22:32

caithagoras mentioned this pull request Jul 28, 2020

Add release notes for 0.239 #14908

Merged

13 tasks

highker mentioned this pull request Jul 26, 2021

Skip empty partitions with optimize_metadata_queries #16497

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support evaluating min/max only metadata query #14845

Support evaluating min/max only metadata query #14845

shixuan-fan commented Jul 16, 2020 •

edited

Loading

highker left a comment

highker left a comment

highker Jul 27, 2020

shixuan-fan Jul 27, 2020

shixuan-fan Jul 27, 2020

Support evaluating min/max only metadata query #14845

Support evaluating min/max only metadata query #14845

Conversation

shixuan-fan commented Jul 16, 2020 • edited Loading

highker left a comment

Choose a reason for hiding this comment

highker left a comment

Choose a reason for hiding this comment

highker Jul 27, 2020

Choose a reason for hiding this comment

shixuan-fan Jul 27, 2020

Choose a reason for hiding this comment

shixuan-fan Jul 27, 2020

Choose a reason for hiding this comment

shixuan-fan commented Jul 16, 2020 •

edited

Loading