-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Native] Add missing plumbing for Cte support #22780
Conversation
6abd059
to
98404f7
Compare
d352d27
to
2f3f268
Compare
presto-native-execution/presto_cpp/main/types/PrestoToVeloxConnector.cpp
Show resolved
Hide resolved
6177e1c
to
5f96e90
Compare
"legacy", | ||
hiveProperties, | ||
workerCount, | ||
Optional.of(Paths.get(dataDirectory + "/" + storageFormat)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
storageFormat should NOT be added to the path by default. If you specify storageFormat, you won't be able to see tables with different file formats in the same catalog or schema(database) when you're running any QueryRunner. Remember, the QueryRunners should allow the users to have tables with different file formats in the same database, and even allow joining them together in the same query. Also, the DATA_DIR won't be visible to other query runner at all. We should gradually retire this storageFormat from the data path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose you set DATA_DIR='/Users/aditi', adding storage format would make all your metadata and data in /Users/aditi/PARQUET/hive_data/. You will only be able to create and query Parquet tables running the QueryRunner, but not DWRF tables. We introduced boolean addStorageFormatToPath parameter in createNativeQueryRunner() and set it to false by default just for backward compatibility. Can you do the same? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad Ying. I do recall this discussion.
Updated the code.
public void testPersistentCteWithChar() {} | ||
|
||
@Override | ||
// Unsupported nested encoding in Velox Parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add writer to be more clear?
// Unsupported nested encoding in Velox Parquet writer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
{ | ||
if (!queryRunner.tableExists(queryRunner.getDefaultSession(), "lineitem")) { | ||
String shipDate = castDateToVarchar ? "cast(shipdate as varchar) as shipdate" : "shipdate"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed by the CTE? If not, will you be able to move this change into a separate commit? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is needed by CTE.
I reused CTE Java engine tests for Native engine as well. The CTE tests use SQL with the right EXTRACT calls that depend on date columns being retained as DATE. All the native engine tests type-cast DATE columns to VARCHAR for both DWRF and Parquet in TPC-H schema.
@yingsu00 : Have addressed your review comments. PTAL. |
We should add a new row for presto c++ support in the documentation later here |
Good idea! Maybe mention it in the Presto C++ Features documentation. |
@aditi-pandit there're some test failures, have you checked them? |
@yingsu00 @jaystarshot @steveburnett : So while this feature enables use of CTE, we support only DWRF and Parquet as file formats for the intermediate temporary tables created by it. I feel there is most potential with using pagefile format for these temp tables. We are working on the pagefile formats right now. So I preferred to add documentation once we have numbers for those. wydt ? I'll add a RELEASE NOTE though. |
@yingsu00 : Those failures were on account of the Iceberg issue that is reverted now. I've rebased the build and the tests pass again. |
Add CTE materialization support for Presto C++ clusters. It supports only the following storage and compression-codec options (ref https://prestodb.io/docs/0.286/admin/properties.html#cte-materialization-properties) |
I see your point as valid. Are you planning to add pagefile support in this PR, or open a new PR? |
@steveburnett : It will be in a new PR. |
@yingsu00 : Done. PTAL. |
Ideally I would suggest documenting the feature in the same PR, specifying the DWRF and Parquet file format in the doc, then in the new PR revising the doc to add pagefile support. But the separate PR soon should be okay. Thanks! |
May I suggest a draft release note for consideration?
|
@steveburnett : Sounds good. Have updated the release notes. |
Looks good, just change Prestissimo to Presto C++.... |
Done. Good idea to keep this consistent with all places. |
Description
CTE support is added in Presto from #20887.
This feature is largely in the Presto optimizer logic. But it relies on the Temporary table SPI to create TableWriterNodes on the workers.
The temporary table SPI was disabled in the PrestoToVelox conversion. The temporary table usage is like regular new Hive tables at the worker. The temporary table SPI creates table nodes and writes to them in the same pipeline. The table commit handling differs from regular tables and is processed with the TableFinish operator at the co-ordinator.
Motivation and Context
Use CTE with Prestissimo
Impact
https://prestodb.io/docs/0.286/admin/properties.html#cte-materialization-properties can be used with Prestissimo workers as well.
Though Prestissimo only supports only the following storage and compression-codec options
hive.temporary-table-storage-format = DWRF, PARQUET
hive.temporary-table-compression-codec = ZSTD, NONE
Test Plan
e2e tests added in the PR. These are derived from the e2e Java CTE tests. So CTE tests are now run with both engines.