-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to make Hive plugin respect inputFormat.getSplits() #6969
Conversation
…ain files in a partition - Contrubuting back feature added at Uber, to support queries on fresher data
cc @zhenxiao FYI |
if (stopped) { | ||
return; | ||
} | ||
if (addSplitsToSource(targetSplits, partitionName, schema, partitionKeys, effectivePredicate, partition.getColumnCoercions())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this simply moving the above code into a method? It's hard to tell if anything changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. Just moved the code block into a method to reduce repeated LOC.
I've been thinking about this, and it isn't actually respecting the returned splits. It's doing something special:
I'd rather not add configuration for this, since extending the Hive plugin by dropping in additional jars is not something we encourage or support. To make your case work, we could either hard-code support for this input format, or add a magic annotation like |
Thanks for the feedback. helps! Some background context: I have some questions around the two approaches you proposed.
We avoided this at Uber, since we did not want to add a hoodie dependency to presto. Still prefer to keep it like that.
I like this better, since this is self descriptive and provides general support for CustomInputFormats in Presto. Two follow ups:
|
I'm probably misunderstanding, but we already do that in
The cast to |
Fair point. Will open a new PR again using the annotation approach as discussed. Abandon this one.
This is not true for say Parquet tables, they use ParquetRecordCursorProvider, which does its own thing. But from the code, I think if we register the table with a different serde than ones below, it should call recordReader as you mentioned.
We can solve this in that way. So, I will just focus on the @UseFileSplitsFromInputFormat annotation for now. Thanks for your assistance @electrum |
Add option to make Hive plugin respect inputFormat.getSplits() to obtain files in a partition
Happy to add any edits as suggested.