Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve Iceberg write path using LocationProvider #16609

Merged
merged 1 commit into from Aug 18, 2021

Conversation

jackye1995
Copy link
Contributor

@jackye1995 jackye1995 commented Aug 13, 2021

This PR changes Iceberg connector to write data files to paths defined by Iceberg's LocationProvider. This is especially important for cloud object storage users who leverage Iceberg's ObjectStorageLocationProvider to write data to hashed file paths to avoid throttling.

Because this change allows Iceberg to write data to a location different from the table's root location, this will cause drop table to not clean up all files. So currently if we see path override Iceberg properties exist, we will block the table drop operation.

I will put a subsequent PR to support creating and dropping such tables natively in Presto.

Test plan: basic unit tests pass. No new test is added because we just need to make sure there is no change of behavior for existing write operations. There is also no way to create tables with path override directly through Presto yet. This is mostly oriented for users who create tables through Spark or Hive and want to read and write through Presto.

== RELEASE NOTES ==

Iceberg Changes
* Iceberg data files are now written to paths defined by Iceberg's LocationProvider instead of hard-coded table root directory

@zhenxiao @ChunxuTang @beinan @pettyjamesm

Copy link
Member

@beinan beinan left a comment

Very helpful contribution! Thanks! Looks good to me except a clarification question. @ChunxuTang do you wanna take a look also? thanks!

this.fileWriterFactory = requireNonNull(fileWriterFactory, "fileWriterFactory is null");
this.hdfsEnvironment = requireNonNull(hdfsEnvironment, "hdfsEnvironment is null");
this.hdfsContext = requireNonNull(hdfsContext, "hdfsContext is null");
this.jobConf = toJobConf(hdfsEnvironment.getConfiguration(hdfsContext, new Path(outputPath)));
this.jobConf = toJobConf(hdfsEnvironment.getConfiguration(hdfsContext, new Path(locationProvider.newDataLocation("data-file"))));
Copy link
Member

@beinan beinan Aug 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "data-file" mean for locationProvider.newDataLocation? should we define a constant for this one? is there any other options for this arguments?

Copy link
Contributor Author

@jackye1995 jackye1995 Aug 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really just a dummy value to generate an example data file path, which is used to initialize a Hadoop configuration. We can move it to a static variable if you prefer, but it is only used here that's why I did not do that.

Copy link
Member

@beinan beinan Aug 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thank you for the explanation!

beinan
beinan approved these changes Aug 17, 2021
Copy link
Member

@beinan beinan left a comment

lgtm

Copy link
Contributor

@ChunxuTang ChunxuTang left a comment

LSTM.
@jackye1995 Thanks for your nice work!

@beinan beinan merged commit a14add2 into prestodb:master Aug 18, 2021
41 checks passed
@aweisberg aweisberg mentioned this pull request Aug 31, 2021
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants