New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't truncate min/max for iceberg delete files #12671
Don't truncate min/max for iceberg delete files #12671
Conversation
I used metric FULL_METRIC_CONFIG for parquet to allow storing full paths in min/max. It was impossible to do it this way for ORC so I decided to put hard limit there. |
case ORC: | ||
return createOrcWriter(FULL_METRICS_CONFIG, outputPath, POSITION_DELETE_SCHEMA, jobConf, session, storageProperties); | ||
return createOrcWriter(FULL_METRICS_CONFIG, outputPath, POSITION_DELETE_SCHEMA, jobConf, session, storageProperties, stringStatisticsLimit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never want to truncate the stats path, right? So could we do
DataSize.ofBytes(Long.MAX_VALUE)
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm that makes sense, I will change it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it has to be Integer.MAX_VALUE
case ORC: | ||
return createOrcWriter(metricsConfig, outputPath, icebergSchema, jobConf, session, storageProperties); | ||
return createOrcWriter(metricsConfig, outputPath, icebergSchema, jobConf, session, storageProperties, Optional.empty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe cleaner to not have the stats limit be optional in createOrcWriter
and instead get it from the session here, so we always have a value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok!
@@ -132,9 +133,9 @@ public IcebergFileWriter createDataFileWriter( | |||
switch (fileFormat) { | |||
case PARQUET: | |||
// TODO use metricsConfig | |||
return createParquetWriter(outputPath, icebergSchema, jobConf, session, hdfsContext); | |||
return createParquetWriter(MetricsConfig.getDefault(), outputPath, icebergSchema, jobConf, session, hdfsContext); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a GH Issue to support metrics mode for Parquet? If not, can you make one
Either way, link it here
6168f68
to
c242315
Compare
c242315
to
582a1b1
Compare
Description
Previously min/max were truncated to very short numbers which may have had very bad impact on performance.
Related issues, pull requests, and links
Documentation
( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
( ) Release notes entries required with the following suggested text: