Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support evaluating min/max only metadata query #14845

Merged
merged 4 commits into from Jul 27, 2020

Conversation

shixuan-fan
Copy link
Contributor

@shixuan-fan shixuan-fan commented Jul 16, 2020

== RELEASE NOTES ==

General Changes
* Add support to optimize min/max only metadata query. This is controlled by existing config ``optimizer.optimize-metadata-queries`` and session property ``optimize_metadata_queries``. Note that enabling this config/session property might change query result if there are metadata that refers to empty data, e.g. empty hive partition.

Note that enabling existing config optimizer.optimize-metadata-queries and session property optimize_metadata_queries might change query result if there are metadata that refers to empty data, e.g. empty hive partition. For example, if we have two Hive ds partitions, one is 2020-07-01 and the other is 2020-08-01. Let's assume 2020-08-01 is an empty partition. Then when computing without metadata optimizer, the ds rows come from data, and since 2020-08-01 does not have any data, it won't be appearing in the result (e.g. DISTINCT ds would only return 2020-07-01). However, if metadata optimizer is enabled, then ds rows come from metastore, and DISTINCT ds would return both rows.

@shixuan-fan shixuan-fan changed the title [WIP] Support evaluating min/max only metadata query [TEST|WIP] Support evaluating min/max only metadata query Jul 16, 2020
@shixuan-fan shixuan-fan marked this pull request as draft July 16, 2020 23:19
@shixuan-fan shixuan-fan force-pushed the optimize branch 5 times, most recently from 6d81914 to bfca28a Compare July 23, 2020 01:12
Assuming we have a daily ingested table that is partitioned on ds,
a filter like `ds = (SELECT '2020-07-01')` is converted into an
INNER JOIN, but this value is not passed to the other side of Join,
which leads to full table scan.

This commit will enable this value being treated as predicate, and
thus we only need to read this one partition.
@shixuan-fan shixuan-fan marked this pull request as ready for review July 27, 2020 16:19
@shixuan-fan shixuan-fan changed the title [TEST|WIP] Support evaluating min/max only metadata query Support evaluating min/max only metadata query Jul 27, 2020
@shixuan-fan shixuan-fan requested a review from highker July 27, 2020 16:19
@shixuan-fan shixuan-fan requested a review from a team July 27, 2020 16:20
Copy link
Contributor

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Push expression translation above MetadataQueryOptimizer" LGTM

Copy link
Contributor

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Remove unused field": Change the title to "Remove unused field for MetadataQueryOptimizer"

@highker highker removed their assignment Jul 27, 2020
@jainxrohit jainxrohit self-requested a review July 27, 2020 17:37
Comment on lines 270 to 280
if (arguments.isEmpty()) {
return constant(null, returnType);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will the result be null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result would be null if all values are null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, I should probably move this to be the first step of this function.

Assuming we have a daily ingested table that is partitioned on ds, one
common use case is to fetch data from latest ds partition. One way to
compose such a query is using a filter like
`ds = (SELECT max(ds) FROM table)`. However, this filter is converted
into an INNER JOIN, and will lead to a full table scan on the other
side of join.

Instead, this commit enables a query like `SELECT max(ds) FROM table`
being evaluated at optimization time when OPTIMIZE_METADATA_QUERIES
is set to true, and convert it into a ValuesNode, which could then
be pushed to the other side of Join to avoid expensive full table
scan.
@shixuan-fan shixuan-fan merged commit ef4b537 into prestodb:master Jul 27, 2020
@shixuan-fan shixuan-fan deleted the optimize branch July 27, 2020 22:32
@caithagoras caithagoras mentioned this pull request Jul 28, 2020
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants