Add option to make Hive plugin respect inputFormat.getSplits() #6969

vinothchandar · 2016-12-27T18:52:20Z

Add option to make Hive plugin respect inputFormat.getSplits() to obtain files in a partition

Contributing back feature added at Uber, to support queries on fresher data (https://github.com/uber/hoodie)
Running in production for 3 months

Happy to add any edits as suggested.

…ain files in a partition - Contrubuting back feature added at Uber, to support queries on fresher data

vinothchandar · 2016-12-27T18:52:50Z

cc @zhenxiao FYI

electrum · 2016-12-28T03:14:05Z

presto-hive/src/main/java/com/facebook/presto/hive/BackgroundHiveSplitLoader.java

-                    if (stopped) {
-                        return;
-                    }
+                if (addSplitsToSource(targetSplits, partitionName, schema, partitionKeys, effectivePredicate, partition.getColumnCoercions())) {


Is this simply moving the above code into a method? It's hard to tell if anything changed.

yes. Just moved the code block into a method to reduce repeated LOC.

electrum · 2016-12-28T16:28:51Z

I've been thinking about this, and it isn't actually respecting the returned splits. It's doing something special:

It assumes the input split is a FileSplit because it downcasts, extracts the path, then re-creates a FileSplit in HiveUtil.createRecordReader().
It assumes the split path is a file (not a directory), because it calls FileSystem.getFileBlockLocations() on that path. (This call is safe to use on a directory, but is useless.)

I'd rather not add configuration for this, since extending the Hive plugin by dropping in additional jars is not something we encourage or support.

To make your case work, we could either hard-code support for this input format, or add a magic annotation like @UseFileSplitsFromInputFormat to the input format that we'd look for by simple name (so the annotation package wouldn't matter).

vinothchandar · 2016-12-28T21:01:34Z

Thanks for the feedback. helps!

Some background context:
Hoodie implements incremental upserts to Hive tables, by versioning files internally. At Uber, we use the HoodieInputFormat (Subclass of FileInputFormat, ergo the FileSplit downcasting) to register the Hive table. Here in the getSplits(), we filter out old versions and pick the latest version for the query to use.

I have some questions around the two approaches you proposed.

we could either hard-code support for this input format

We avoided this at Uber, since we did not want to add a hoodie dependency to presto. Still prefer to keep it like that.

add a magic annotation like @UseFileSplitsFromInputFormat to the input format

I like this better, since this is self descriptive and provides general support for CustomInputFormats in Presto.

Two follow ups:

While we are at it, shall we also add a new annotation say @UseRecordReaderFromInputFormat that will make Presto fall back to calling ipf.recordReader() later on?

Right now, it does not do that by default. This is not needed by Hoodie right now, just for completeness
Spark for eg, has a config, that lets us fallback totally on ipf.getSplits() and ipf.createRecordReader()

We still need to work on only subclasses of FileInputFormat, if we are expecting FileSplit (which is what createHiveSplits seems to assume since it needs block locations etc). So can we just add another check/assert such that the InputFormat is an instance of FileInputFormat?

electrum · 2016-12-29T17:07:05Z

While we are at it, shall we also add a new annotation say @UseRecordReaderFromInputFormat that will make Presto fall back to calling ipf.recordReader() later on?

I'm probably misunderstanding, but we already do that in HiveUtil.createRecordReader(), which is called by GenericHiveRecordCursorProvider.

We still need to work on only subclasses of FileInputFormat, if we are expecting FileSplit

The cast to FileSplit will fail, which is sufficient, since that's the actual requirement. Any InputFormat is fine as long as it returns FileSplit. Also, being a subclass of FileInputFormat doesn't mean it will work. For example, MultiFileInputFormat is a subclass, but it doesn't return FileSplit.

vinothchandar · 2016-12-29T18:13:01Z

Fair point. Will open a new PR again using the annotation approach as discussed. Abandon this one.

but we already do that in HiveUtil.createRecordReader(), which is called by GenericHiveRecordCursorProvider

This is not true for say Parquet tables, they use ParquetRecordCursorProvider, which does its own thing. But from the code, I think if we register the table with a different serde than ones below, it should call recordReader as you mentioned.

private static final Set<String> PARQUET_SERDE_CLASS_NAMES = ImmutableSet.<String>builder() .add("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe") .add("parquet.hive.serde.ParquetHiveSerDe") .build();

We can solve this in that way. So, I will just focus on the @UseFileSplitsFromInputFormat annotation for now.

Thanks for your assistance @electrum

Add option to make Hive plugin respect inputFormat.getSplits() to obt…

b316106

…ain files in a partition - Contrubuting back feature added at Uber, to support queries on fresher data

facebook-github-bot added the CLA Signed label Dec 27, 2016

electrum reviewed Dec 28, 2016

View reviewed changes

vinothchandar closed this Dec 29, 2016

vinothchandar mentioned this pull request Jan 4, 2017

Enable FileSplits to be obtained directly from InputFormat #7002

Closed

vinothchandar mentioned this pull request Dec 8, 2021

Make Hudi bootstrap table use custom split #16708

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to make Hive plugin respect inputFormat.getSplits() #6969

Add option to make Hive plugin respect inputFormat.getSplits() #6969

vinothchandar commented Dec 27, 2016

vinothchandar commented Dec 27, 2016

electrum Dec 28, 2016

vinothchandar Dec 28, 2016

electrum commented Dec 28, 2016

vinothchandar commented Dec 28, 2016 •

edited

Loading

electrum commented Dec 29, 2016

vinothchandar commented Dec 29, 2016

Add option to make Hive plugin respect inputFormat.getSplits() #6969

Add option to make Hive plugin respect inputFormat.getSplits() #6969

Conversation

vinothchandar commented Dec 27, 2016

vinothchandar commented Dec 27, 2016

electrum Dec 28, 2016

Choose a reason for hiding this comment

vinothchandar Dec 28, 2016

Choose a reason for hiding this comment

electrum commented Dec 28, 2016

vinothchandar commented Dec 28, 2016 • edited Loading

electrum commented Dec 29, 2016

vinothchandar commented Dec 29, 2016

vinothchandar commented Dec 28, 2016 •

edited

Loading