Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Native] Add missing plumbing for Cte support #22780

Merged
merged 1 commit into from
Jul 12, 2024
Merged

[Native] Add missing plumbing for Cte support #22780

merged 1 commit into from
Jul 12, 2024

Conversation

aditi-pandit
Copy link
Contributor

@aditi-pandit aditi-pandit commented May 17, 2024

Description

CTE support is added in Presto from #20887.
This feature is largely in the Presto optimizer logic. But it relies on the Temporary table SPI to create TableWriterNodes on the workers.

The temporary table SPI was disabled in the PrestoToVelox conversion. The temporary table usage is like regular new Hive tables at the worker. The temporary table SPI creates table nodes and writes to them in the same pipeline. The table commit handling differs from regular tables and is processed with the TableFinish operator at the co-ordinator.

Motivation and Context

Use CTE with Prestissimo

Impact

https://prestodb.io/docs/0.286/admin/properties.html#cte-materialization-properties can be used with Prestissimo workers as well.

Though Prestissimo only supports only the following storage and compression-codec options

hive.temporary-table-storage-format = DWRF, PARQUET
hive.temporary-table-compression-codec = ZSTD, NONE

Test Plan

e2e tests added in the PR. These are derived from the e2e Java CTE tests. So CTE tests are now run with both engines.

== RELEASE NOTE ==

Add CTE materialization for Presto C++ workers with the configuration properties 
`hive.temporary-table-storage-format` (`DWRF` or `PARQUET` only) and 
`hive.temporary-table-compression-codec` (`ZSTD` or `NONE` only). 
:pr:`22780`

@aditi-pandit aditi-pandit requested a review from a team as a code owner May 17, 2024 23:32
@aditi-pandit aditi-pandit marked this pull request as draft May 17, 2024 23:33
@aditi-pandit aditi-pandit changed the title [Native] Add Cte support [Do not review][Native] Add Cte support May 17, 2024
@aditi-pandit aditi-pandit changed the title [Do not review][Native] Add Cte support [Do not review][Native] Add missing plumbing for Cte support Jun 28, 2024
@aditi-pandit aditi-pandit changed the title [Do not review][Native] Add missing plumbing for Cte support [Native] Add missing plumbing for Cte support Jun 28, 2024
@aditi-pandit aditi-pandit marked this pull request as ready for review June 28, 2024 22:25
"legacy",
hiveProperties,
workerCount,
Optional.of(Paths.get(dataDirectory + "/" + storageFormat)),
Copy link
Contributor

@yingsu00 yingsu00 Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storageFormat should NOT be added to the path by default. If you specify storageFormat, you won't be able to see tables with different file formats in the same catalog or schema(database) when you're running any QueryRunner. Remember, the QueryRunners should allow the users to have tables with different file formats in the same database, and even allow joining them together in the same query. Also, the DATA_DIR won't be visible to other query runner at all. We should gradually retire this storageFormat from the data path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose you set DATA_DIR='/Users/aditi', adding storage format would make all your metadata and data in /Users/aditi/PARQUET/hive_data/. You will only be able to create and query Parquet tables running the QueryRunner, but not DWRF tables. We introduced boolean addStorageFormatToPath parameter in createNativeQueryRunner() and set it to false by default just for backward compatibility. Can you do the same? Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad Ying. I do recall this discussion.

Updated the code.

public void testPersistentCteWithChar() {}

@Override
// Unsupported nested encoding in Velox Parquet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add writer to be more clear?
// Unsupported nested encoding in Velox Parquet writer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

{
if (!queryRunner.tableExists(queryRunner.getDefaultSession(), "lineitem")) {
String shipDate = castDateToVarchar ? "cast(shipdate as varchar) as shipdate" : "shipdate";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed by the CTE? If not, will you be able to move this change into a separate commit? Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is needed by CTE.

I reused CTE Java engine tests for Native engine as well. The CTE tests use SQL with the right EXTRACT calls that depend on date columns being retained as DATE. All the native engine tests type-cast DATE columns to VARCHAR for both DWRF and Parquet in TPC-H schema.

@aditi-pandit
Copy link
Contributor Author

@yingsu00 : Have addressed your review comments. PTAL.

@jaystarshot
Copy link
Member

We should add a new row for presto c++ support in the documentation later here

@steveburnett
Copy link
Contributor

We should add a new row for presto c++ support in the documentation later here

Good idea! Maybe mention it in the Presto C++ Features documentation.

@yingsu00
Copy link
Contributor

@aditi-pandit there're some test failures, have you checked them?
btw. I think it deserves a release note. Also, it'll be nice to add the documentation in this PR as well.

@aditi-pandit
Copy link
Contributor Author

aditi-pandit commented Jul 10, 2024

@yingsu00 @jaystarshot @steveburnett : So while this feature enables use of CTE, we support only DWRF and Parquet as file formats for the intermediate temporary tables created by it. I feel there is most potential with using pagefile format for these temp tables. We are working on the pagefile formats right now. So I preferred to add documentation once we have numbers for those.

wydt ? I'll add a RELEASE NOTE though.

@aditi-pandit
Copy link
Contributor Author

@aditi-pandit there're some test failures, have you checked them? btw. I think it deserves a release note. Also, it'll be nice to add the documentation in this PR as well.

@yingsu00 : Those failures were on account of the Iceberg issue that is reverted now. I've rebased the build and the tests pass again.

@yingsu00
Copy link
Contributor

CTE materialization can be used with Prestissimo workers as well.
@aditi-pandit Release notes need to follow https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines . I think this one can start with "Add". How about the following modification?

Add CTE materialization support for Presto C++ clusters. It supports only the following storage and compression-codec options (ref https://prestodb.io/docs/0.286/admin/properties.html#cte-materialization-properties)
hive.temporary-table-storage-format = DWRF, PARQUET
hive.temporary-table-compression-codec = ZSTD, NONE

@steveburnett
Copy link
Contributor

@yingsu00 @jaystarshot @steveburnett : So while this feature enables use of CTE, we support only DWRF and Parquet as file formats for the intermediate temporary tables created by it. I feel there is most potential with using pagefile format for these temp tables. We are working on the pagefile formats right now. So I preferred to add documentation once we have numbers for those.

wydt ? I'll add a RELEASE NOTE though.

I see your point as valid. Are you planning to add pagefile support in this PR, or open a new PR?

@aditi-pandit
Copy link
Contributor Author

@yingsu00 @jaystarshot @steveburnett : So while this feature enables use of CTE, we support only DWRF and Parquet as file formats for the intermediate temporary tables created by it. I feel there is most potential with using pagefile format for these temp tables. We are working on the pagefile formats right now. So I preferred to add documentation once we have numbers for those.
wydt ? I'll add a RELEASE NOTE though.

I see your point as valid. Are you planning to add pagefile support in this PR, or open a new PR?

@steveburnett : It will be in a new PR.

@aditi-pandit
Copy link
Contributor Author

CTE materialization can be used with Prestissimo workers as well.
@aditi-pandit Release notes need to follow https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines . I think this one can start with "Add". How about the following modification?

Add CTE materialization support for Presto C++ clusters. It supports only the following storage and compression-codec options (ref https://prestodb.io/docs/0.286/admin/properties.html#cte-materialization-properties) hive.temporary-table-storage-format = DWRF, PARQUET hive.temporary-table-compression-codec = ZSTD, NONE

@yingsu00 : Done. PTAL.

@steveburnett
Copy link
Contributor

@steveburnett : It will be in a new PR.

Ideally I would suggest documenting the feature in the same PR, specifying the DWRF and Parquet file format in the doc, then in the new PR revising the doc to add pagefile support. But the separate PR soon should be okay. Thanks!

@steveburnett
Copy link
Contributor

May I suggest a draft release note for consideration?

== RELEASE NOTE ==
Add CTE materialization for Prestissimo workers with the configuration properties `hive.temporary-table-storage-format` (`DWRF` or `PARQUET` only) and `hive.temporary-table-compression-codec` (`ZSTD` or `NONE` only). :pr:`22780`

@aditi-pandit
Copy link
Contributor Author

May I suggest a draft release note for consideration?

== RELEASE NOTE ==
Add CTE materialization for Prestissimo workers with the configuration properties `hive.temporary-table-storage-format` (`DWRF` or `PARQUET` only) and `hive.temporary-table-compression-codec` (`ZSTD` or `NONE` only). :pr:`22780`

@steveburnett : Sounds good. Have updated the release notes.

@yingsu00
Copy link
Contributor

Add CTE materialization for Prestissimo workers with the configuration properties hive.temporary-table-storage-format (DWRF or PARQUET only) and hive.temporary-table-compression-codec (ZSTD or NONE only). :pr:22780

Looks good, just change Prestissimo to Presto C++....

@aditi-pandit
Copy link
Contributor Author

Add CTE materialization for Prestissimo workers with the configuration properties hive.temporary-table-storage-format (DWRF or PARQUET only) and hive.temporary-table-compression-codec (ZSTD or NONE only). :pr:22780

Looks good, just change Prestissimo to Presto C++....

Done. Good idea to keep this consistent with all places.

@aditi-pandit aditi-pandit merged commit caa35d8 into master Jul 12, 2024
59 checks passed
@aditi-pandit aditi-pandit deleted the cte branch July 12, 2024 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants