Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't truncate min/max for iceberg delete files #12671

Conversation

homar
Copy link
Member

@homar homar commented Jun 2, 2022

Description

Previously min/max were truncated to very short numbers which may have had very bad impact on performance.

Related issues, pull requests, and links

Documentation

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Jun 2, 2022
@homar
Copy link
Member Author

homar commented Jun 2, 2022

I used metric FULL_METRIC_CONFIG for parquet to allow storing full paths in min/max. It was impossible to do it this way for ORC so I decided to put hard limit there.
Does it make sense to do it this way or should I also make hard limit on length for parquet to have it consistent ?
@alexjo2144 @findepi

@homar homar requested review from findepi and alexjo2144 and removed request for findepi June 2, 2022 21:04
@homar homar changed the title Homar/cleanup unused delete files during iceberg optimize Cleanup unused delete files during iceberg optimize Jun 2, 2022
@homar homar changed the title Cleanup unused delete files during iceberg optimize Don't truncate min/max for iceberg delete files Jun 3, 2022
case ORC:
return createOrcWriter(FULL_METRICS_CONFIG, outputPath, POSITION_DELETE_SCHEMA, jobConf, session, storageProperties);
return createOrcWriter(FULL_METRICS_CONFIG, outputPath, POSITION_DELETE_SCHEMA, jobConf, session, storageProperties, stringStatisticsLimit);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never want to truncate the stats path, right? So could we do

DataSize.ofBytes(Long.MAX_VALUE) here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm that makes sense, I will change it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it has to be Integer.MAX_VALUE

case ORC:
return createOrcWriter(metricsConfig, outputPath, icebergSchema, jobConf, session, storageProperties);
return createOrcWriter(metricsConfig, outputPath, icebergSchema, jobConf, session, storageProperties, Optional.empty());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe cleaner to not have the stats limit be optional in createOrcWriter and instead get it from the session here, so we always have a value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok!

@@ -132,9 +133,9 @@ public IcebergFileWriter createDataFileWriter(
switch (fileFormat) {
case PARQUET:
// TODO use metricsConfig
return createParquetWriter(outputPath, icebergSchema, jobConf, session, hdfsContext);
return createParquetWriter(MetricsConfig.getDefault(), outputPath, icebergSchema, jobConf, session, hdfsContext);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a GH Issue to support metrics mode for Parquet? If not, can you make one

Either way, link it here

@homar homar force-pushed the homar/cleanup_unused_delete_files_during_iceberg_optimize branch 2 times, most recently from 6168f68 to c242315 Compare June 4, 2022 07:23
@homar homar force-pushed the homar/cleanup_unused_delete_files_during_iceberg_optimize branch from c242315 to 582a1b1 Compare June 4, 2022 09:09
@findepi findepi requested a review from alexjo2144 June 4, 2022 19:11
@findepi findepi merged commit b385185 into trinodb:master Jun 8, 2022
@findepi findepi mentioned this pull request Jun 8, 2022
@github-actions github-actions bot added this to the 385 milestone Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

3 participants