-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support evaluating min/max only metadata query #14845
Conversation
6d81914
to
bfca28a
Compare
Assuming we have a daily ingested table that is partitioned on ds, a filter like `ds = (SELECT '2020-07-01')` is converted into an INNER JOIN, but this value is not passed to the other side of Join, which leads to full table scan. This commit will enable this value being treated as predicate, and thus we only need to read this one partition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Push expression translation above MetadataQueryOptimizer" LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Remove unused field": Change the title to "Remove unused field for MetadataQueryOptimizer"
presto-hive/src/test/java/com/facebook/presto/hive/TestHiveLogicalPlanner.java
Outdated
Show resolved
Hide resolved
...main/src/main/java/com/facebook/presto/sql/planner/optimizations/MetadataQueryOptimizer.java
Outdated
Show resolved
Hide resolved
...main/src/main/java/com/facebook/presto/sql/planner/optimizations/MetadataQueryOptimizer.java
Outdated
Show resolved
Hide resolved
...main/src/main/java/com/facebook/presto/sql/planner/optimizations/MetadataQueryOptimizer.java
Outdated
Show resolved
Hide resolved
...main/src/main/java/com/facebook/presto/sql/planner/optimizations/MetadataQueryOptimizer.java
Outdated
Show resolved
Hide resolved
if (arguments.isEmpty()) { | ||
return constant(null, returnType); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will the result be null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result would be null if all values are null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That being said, I should probably move this to be the first step of this function.
Assuming we have a daily ingested table that is partitioned on ds, one common use case is to fetch data from latest ds partition. One way to compose such a query is using a filter like `ds = (SELECT max(ds) FROM table)`. However, this filter is converted into an INNER JOIN, and will lead to a full table scan on the other side of join. Instead, this commit enables a query like `SELECT max(ds) FROM table` being evaluated at optimization time when OPTIMIZE_METADATA_QUERIES is set to true, and convert it into a ValuesNode, which could then be pushed to the other side of Join to avoid expensive full table scan.
Note that enabling existing config
optimizer.optimize-metadata-queries
and session propertyoptimize_metadata_queries
might change query result if there are metadata that refers to empty data, e.g. empty hive partition. For example, if we have two Hive ds partitions, one is2020-07-01
and the other is2020-08-01
. Let's assume2020-08-01
is an empty partition. Then when computing without metadata optimizer, theds
rows come from data, and since2020-08-01
does not have any data, it won't be appearing in the result (e.g.DISTINCT ds
would only return2020-07-01
). However, if metadata optimizer is enabled, thends
rows come from metastore, andDISTINCT ds
would return both rows.