Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Iceberg OPTIMIZE #10497

Merged
merged 4 commits into from
Jan 13, 2022
Merged

Conversation

findepi
Copy link
Member

@findepi findepi commented Jan 7, 2022

No description provided.

@alexjo2144
Copy link
Member

Just clarifying before I start reading this. This is specifically compaction of V1 tables which cannot contain positional or equality based delete markers?

@alexjo2144
Copy link
Member

The SparkSQL procedure is called rewrite_data_files should we name this procedure to match? https://github.com/apache/iceberg/blob/master/site/docs/spark-procedures.md?plain=1#L247

@findepi
Copy link
Member Author

findepi commented Jan 10, 2022

This is specifically compaction of V1 tables which cannot contain positional or equality based delete markers?

Yes, but only because the reader doesn't support positional or equality based delete markers today.

Once reader has support for them, this should work with v2 tables.

The SparkSQL procedure is called rewrite_data_files should we name this procedure to match?

Thanks for the pointer. "rewrite files" feels low-level description of what the operation does (today), and "optimize" describes (or hints at) the intent.

Integration tests rarely interact with Hadoop FS directly, so
`org.apache.hadoop.fs.Path` is uncommon. This allows to import
`java.nio.file.Path`.
newFiles.add(builder.build());
}

if (scannedFiles.isEmpty() && newFiles.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert we should not ever get one empty and other not? Feels like a bug situation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scanned file list may be non empty, but resulting data may be empty, if input files were empty.

@findepi
Copy link
Member Author

findepi commented Jan 13, 2022

CI #10583

@findepi findepi merged commit f0c67f0 into trinodb:master Jan 13, 2022
@findepi findepi deleted the findepi/iceberg-optimize branch January 13, 2022 08:09
@findepi findepi mentioned this pull request Jan 13, 2022
@github-actions github-actions bot added this to the 369 milestone Jan 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants