Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to make Hive plugin respect inputFormat.getSplits() #6969

Closed
wants to merge 1 commit into from

Conversation

vinothchandar
Copy link
Collaborator

Add option to make Hive plugin respect inputFormat.getSplits() to obtain files in a partition

Happy to add any edits as suggested.

…ain files in a partition

 - Contrubuting back feature added at Uber, to support queries on fresher data
@vinothchandar
Copy link
Collaborator Author

cc @zhenxiao FYI

if (stopped) {
return;
}
if (addSplitsToSource(targetSplits, partitionName, schema, partitionKeys, effectivePredicate, partition.getColumnCoercions())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this simply moving the above code into a method? It's hard to tell if anything changed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. Just moved the code block into a method to reduce repeated LOC.

@electrum
Copy link
Contributor

I've been thinking about this, and it isn't actually respecting the returned splits. It's doing something special:

  • It assumes the input split is a FileSplit because it downcasts, extracts the path, then re-creates a FileSplit in HiveUtil.createRecordReader().
  • It assumes the split path is a file (not a directory), because it calls FileSystem.getFileBlockLocations() on that path. (This call is safe to use on a directory, but is useless.)

I'd rather not add configuration for this, since extending the Hive plugin by dropping in additional jars is not something we encourage or support.

To make your case work, we could either hard-code support for this input format, or add a magic annotation like @UseFileSplitsFromInputFormat to the input format that we'd look for by simple name (so the annotation package wouldn't matter).

@vinothchandar
Copy link
Collaborator Author

vinothchandar commented Dec 28, 2016

Thanks for the feedback. helps!

Some background context:
Hoodie implements incremental upserts to Hive tables, by versioning files internally. At Uber, we use the HoodieInputFormat (Subclass of FileInputFormat, ergo the FileSplit downcasting) to register the Hive table. Here in the getSplits(), we filter out old versions and pick the latest version for the query to use.

I have some questions around the two approaches you proposed.

we could either hard-code support for this input format

We avoided this at Uber, since we did not want to add a hoodie dependency to presto. Still prefer to keep it like that.

add a magic annotation like @UseFileSplitsFromInputFormat to the input format

I like this better, since this is self descriptive and provides general support for CustomInputFormats in Presto.

Two follow ups:

  1. While we are at it, shall we also add a new annotation say @UseRecordReaderFromInputFormat that will make Presto fall back to calling ipf.recordReader() later on?
  • Right now, it does not do that by default. This is not needed by Hoodie right now, just for completeness
  • Spark for eg, has a config, that lets us fallback totally on ipf.getSplits() and ipf.createRecordReader()
  1. We still need to work on only subclasses of FileInputFormat, if we are expecting FileSplit (which is what createHiveSplits seems to assume since it needs block locations etc). So can we just add another check/assert such that the InputFormat is an instance of FileInputFormat?

@electrum
Copy link
Contributor

While we are at it, shall we also add a new annotation say @UseRecordReaderFromInputFormat that will make Presto fall back to calling ipf.recordReader() later on?

I'm probably misunderstanding, but we already do that in HiveUtil.createRecordReader(), which is called by GenericHiveRecordCursorProvider.

We still need to work on only subclasses of FileInputFormat, if we are expecting FileSplit

The cast to FileSplit will fail, which is sufficient, since that's the actual requirement. Any InputFormat is fine as long as it returns FileSplit. Also, being a subclass of FileInputFormat doesn't mean it will work. For example, MultiFileInputFormat is a subclass, but it doesn't return FileSplit.

@vinothchandar
Copy link
Collaborator Author

Fair point. Will open a new PR again using the annotation approach as discussed. Abandon this one.

but we already do that in HiveUtil.createRecordReader(), which is called by GenericHiveRecordCursorProvider

This is not true for say Parquet tables, they use ParquetRecordCursorProvider, which does its own thing. But from the code, I think if we register the table with a different serde than ones below, it should call recordReader as you mentioned.

private static final Set<String> PARQUET_SERDE_CLASS_NAMES = ImmutableSet.<String>builder() .add("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe") .add("parquet.hive.serde.ParquetHiveSerDe") .build();

We can solve this in that way. So, I will just focus on the @UseFileSplitsFromInputFormat annotation for now.

Thanks for your assistance @electrum

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants