-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable filter pushdown for Parquet #17161
Conversation
This commit introduces parquet_pushdown_filter_enabled session property and parquet.pushdown-filter-enabled for the hive connector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yingsu00 Thank you for enabling filter pushdown for Velox.
It is only planner change and doesn't have any effect to query execution.
Is this the case? The planner would push down filter into table scan and remove the filter node above it. If the reader is not going to process the filter, the query results will be wrong. Shouldn't we add a check in the reader to fail fast if there is a pushed down filter that it cannot yet process?
It would be nice to add a test.
presto-hive/src/main/java/com/facebook/presto/hive/HiveClientConfig.java
Outdated
Show resolved
Hide resolved
@@ -2636,7 +2638,7 @@ private boolean isPushdownFilterEnabled(ConnectorSession session, ConnectorTable | |||
boolean pushdownFilterEnabled = HiveSessionProperties.isPushdownFilterEnabled(session); | |||
if (pushdownFilterEnabled) { | |||
HiveStorageFormat hiveStorageFormat = getHiveStorageFormat(getTableMetadata(session, tableHandle).getProperties()); | |||
if (hiveStorageFormat == ORC || hiveStorageFormat == DWRF) { | |||
if (hiveStorageFormat == ORC || hiveStorageFormat == DWRF || hiveStorageFormat == PARQUET && isParquetPushdownFilterEnabled(session)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you missing parentheses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
&&
is higher priority than ||
therefore this line is equal to
if (hiveStorageFormat == ORC || hiveStorageFormat == DWRF || (hiveStorageFormat == PARQUET && isParquetPushdownFilterEnabled(session)))
I think this is correct, since PARQUET selective readers are not implemented yet, and if hive.pushdown_filter_enabled is set true in order to run ORC queries fast, Parquet queries would fail. Therefore we need a second session property on Parquet and enables the pushdown only when both conditions are met.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. This is very subtle and hard to read. I think adding parenthesis will be clearer.
@@ -522,7 +523,7 @@ private boolean isPushdownFilterSupported(ConnectorSession session, TableHandle | |||
boolean pushdownFilterEnabled = HiveSessionProperties.isPushdownFilterEnabled(session); | |||
if (pushdownFilterEnabled) { | |||
HiveStorageFormat hiveStorageFormat = getHiveStorageFormat(getMetadata(tableHandle).getTableMetadata(session, tableHandle.getConnectorHandle()).getProperties()); | |||
if (hiveStorageFormat == HiveStorageFormat.ORC || hiveStorageFormat == HiveStorageFormat.DWRF) { | |||
if (hiveStorageFormat == HiveStorageFormat.ORC || hiveStorageFormat == HiveStorageFormat.DWRF || hiveStorageFormat == HiveStorageFormat.PARQUET && isParquetPushdownFilterEnabled(session)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing parentheses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above
CC: @majetideepak |
If we are adding a check at the Java ParquetReader, we can probably avoid adding |
Unfortunately I don't think so. Users may run ORC queries and Parquet queries at the same time, and they may turn on |
THanks @mbasmanova for reviewing the PR.
My intention was that default value of hive.parquet.pushdown_filter_enabled shall remain false, but you were right, when it was turned on, we should fail the query early. I updated the PR and fail the query with NON_SUPPORTED exception in the newly introduced empty ParquetSelectivePageSourceFactory.
testParquetSelectivePageSourceFails was added to TestHiveIntegrationSmokeTest.java |
d67bd1f
to
6c66c0e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yingsu00 Overall looks good. Minor comments below.
...o-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetSelectivePageSourceFactory.java
Outdated
Show resolved
Hide resolved
@@ -5424,6 +5424,37 @@ public void testPartialAggregatePushdownParquet() | |||
assertFalse(getQueryRunner().tableExists(session, "test_parquet_table")); | |||
} | |||
|
|||
@Test | |||
public void testParquetSelectivePageSourceFails() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no Parquet-specific test? Do you plan to add one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova I'm a bit confused. This test is parquet-specific test as the table was created in Parquet format:
assertUpdate("CREATE TABLE test_parquet_filter_pushdoown (a BIGINT, b BOOLEAN) WITH (format = 'parquet')");
Could you please explain what kind of test you refer to? Did you mean adding tests on ORC or DWRF as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestHiveIntegrationSmokeTest.java is a collection of basic tests for the Hive connector. It cannot possible hold all the tests for the connector. A good practice is to create separate test suites for different sets of functionality. Hence, I'd expect that Parquet-specific tests would be placed in a test suite that's specific to Parquet. Hope that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova thanks for explaining. The existing parquet specific tests like TestParquetDistributedQueries, TestParquetReader are all integrated tests with many queries. It doesn't seem necessary if we overwrite the tests with the new config property and expect every query to fail. Other tests are targeted for very specific functionalities like definition level decoding, compression, etc. and I couldn't find any existing ones to fit this test in. We will add more Parquet tests later if the Aria Parquet gets into place.
presto-hive/src/test/java/com/facebook/presto/hive/TestHiveIntegrationSmokeTest.java
Outdated
Show resolved
Hide resolved
presto-hive/src/test/java/com/facebook/presto/hive/TestHiveIntegrationSmokeTest.java
Outdated
Show resolved
Hide resolved
The release notes sound like filter pushdown is now available for Parquet which is not the case. I wonder if we can skip RN for this PR altogether. |
This commit enables parquet filter pushdown if the hive.parquet_pushdown_filter_enabled session property is enabled. Queries on Parquet tables would fail early with NOT_SUPPORTED exception for now, since the selective Parquet page source is not supported yet.
@mbasmanova I was wondering the same, and I removed the release note for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yingsu00 Thank you for adding a session property to pushdown filter into Parquet reader on the planner side and unblocking corresponding work on Velox.
We are seeing test com.facebook.presto.prism.plugin.TestPrismIntegrationSmokeTest.testParquetSelectivePageSourceFails failing in our jenkins build with following error: Error Message
|
@swapsmagic Is this error persistent? It ran fine on my laptop and all tests in CI passed before I merged the PR. I wonder why other tests don't fail e.g. testPartialAggregatePushdownORC and testPartialAggregatePushdownParquet are both creating unpartitioned tables and the insertions are good. |
This PR enables hive.pushdown_filter_enabled property for Parquet, and guard it with a new config/session property.
This PR is the preparation for Aria Parquet and Velox Parquet reader work.