Support for PathFilter in DirectoryLister #13511

bhasudha · 2019-10-08T00:41:22Z

Hello!

Here is some context:
This PR allows to get FileSplits directly from HoodieInputFormat to be able to query hudi datasets. Currently this is integrated in the loadPartition(partition) in BackgroundHiveSplitLoader, which is called for every hive Partition. This internally returns the InputSplit[] from that Hive partition by calling HoodieInputFormat.getSplits().

Some extra overheads that are observed with this implementation:

for each of the InputSplit[] returned by HoodieInputFormat.getSplits(), Presto would further need fileStatus and fileblocklocation and this can add to additional Namenode calls - https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/util/InternalHiveSplitFactory.java#L100, https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/util/InternalHiveSplitFactory.java#L103. This translates to many Namenode calls on the order of #splits_returned * #hivePartitions for every query.
further, the overhead of listing Hudi metadata everytime for each partition can be removed if HoodieTableMetadata can be loaded just once per table query and reused across loadPartition() call (Since the table metadata is not expected to change within the lifetime of a query)

Current Uber specific solution:
To address these, we took a compile time dependency on Hudi and instantiated the HoodieTableMetadata once in BackgroundHiveSplitLoader constructor. And leveraged Hudi Library APIs to filter the Partition files instead of calling HoodieInputFormat.getSplits(). This gave us significant reduction on number of Namenode calls in this path.

We were looking at generalizing this solution and wanted to pick your thoughts on leveraging PathFilter(https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/PathFilter.html)) to make this more generic and inputFormat agnostic. Here is the proposed generic solution that does NOT bring compile time dependencies on Hudi lib:

If DirectoryLister can expose another API - list(FileSystem fs, Table table, Path path, PathFilter pathFilter), we can load a PathFilter implementation (such as HoodieROTablePathFilter) configurable via HiveClientConfig. This can be instantiated once in every BackgroundHiveSplitLoader object and passed along to diretoryLister.list(). This new PathFilter implementation can cache the Directory -> Filtered Hoodie Paths, which is then readily available via HiveFileIterator.

For additional reference,
Spark queries to Hudi datasets on ReadOnlyViews leverage this similarly via HadoopConfiguration like this - sc.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[TmpFileFilter], classOf[PathFilter]).

Please let me know your thoughts/concerns. If it seems okay, I can send in an implementation.

Thanks!

shixuan-fan · 2019-10-09T17:39:54Z

Sounds reasonable to me. What do you think? @wenleix @arhimondr

arhimondr · 2019-10-10T16:17:45Z

@bhasudha

further, the overhead of listing Hudi metadata everytime for each partition can be removed if HoodieTableMetadata can be loaded just once per table query and reused across loadPartition() call (Since the table metadata is not expected to change within the lifetime of a query)

Could you please elaborate more on what the HoodieTableMetadata is? Can it be partition specific?

If DirectoryLister can expose another API - list(FileSystem fs, Table table, Path path, PathFilter pathFilter)

That sounds reasonable

we can load a PathFilter implementation (such as HoodieROTablePathFilter) configurable via HiveClientConfig

The HiveClientConfig is a static configuration applicable for all tables in all formats for all queries. I wonder if the filter can somehow be applied only for Hoodie tables?

For additional reference,
Spark queries to Hudi datasets on ReadOnlyViews leverage this similarly via HadoopConfiguration like this - sc.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[TmpFileFilter], classOf[PathFilter]).

For spark the sparkContext is a per query entity. That's why it is possible to configure it that way.

This new PathFilter implementation can cache the Directory -> Filtered Hoodie Paths

What about consistency? What if the partition / table has changed?

bhasudha · 2019-10-16T17:58:30Z

Thanks for the questions @arihimondr. I understand your concerns . I have replied inline.

@bhasudha

further, the overhead of listing Hudi metadata everytime for each partition can be removed if HoodieTableMetadata can be loaded just once per table query and reused across loadPartition() call (Since the table metadata is not expected to change within the lifetime of a query)

Could you please elaborate more on what the HoodieTableMetadata is? Can it be partition specific?

Hoodie table metadata logs metadata about the Hudi table. Metadata includes commit, savepoints, compactions, cleanups, rollbacks, etc. These are not partition specific. The HoodieTableMetaClient allows to access this metadata. More details here.In the current context, since BackgroundHiveSplitLoader loadPartition() calls HoodieInputFormat.getSplits(), this HoodieTableMetaClient is created multiple times for the same table (for every partition loaded via loadPartition()) instead of just once per query per table.

If DirectoryLister can expose another API - list(FileSystem fs, Table table, Path path, PathFilter pathFilter)

That sounds reasonable

we can load a PathFilter implementation (such as HoodieROTablePathFilter) configurable via HiveClientConfig

The HiveClientConfig is a static configuration applicable for all tables in all formats for all queries. I wonder if the filter can somehow be applied only for Hoodie tables?

The HoodieROTablePathFilter is implemented such a way that, if the partition does not belong to a Hoodie table, public boolean accept(Path path) returns true. This should take care of handling all other table format types. This is also performant since, it caches the parent folder of this path, so future lookups of files on the same path do not have to check again if this file belongs to a Hoodie table. Since, we are instantiating the PathFilter instance per BackgroundHiveSplitLoader object, we dont have to worry about cache lifetime, validation etc. After the query is complete the PathFilter instance will also be garbage collected. Does this address your question? Please suggest if you have any other recommendation here.

For additional reference,
Spark queries to Hudi datasets on ReadOnlyViews leverage this similarly via HadoopConfiguration like this - sc.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[TmpFileFilter], classOf[PathFilter]).

For spark the sparkContext is a per query entity. That's why it is possible to configure it that way.

Yeah, understand. The instantiation of PathFilter object everytime when a BackgroundSplitLoader is instantiated (which is created in HiveSplitManager.getSplits), takes care of the per query behavior. Please correct me if my assumption is wrong.

This new PathFilter implementation can cache the Directory -> Filtered Hoodie Paths

What about consistency? What if the partition / table has changed?

The Presto query's view of a Hoodie table belongs to commits that have already completed and is immutable. The HoodieROTablePathfilter cache will also point to these immutable commits.
This is ensured by the getLatestDataFiles() API from Hudi (invocation of getLatestDataFiles() and implementation). Basically this API, filters out inflight files and only picks those that are already marked as complete. So consistency is not an issue. The changes to partitions/table, if any, will be visible to a new query that is issued for the same table in future.

bhasudha · 2019-10-23T18:50:10Z

@arhimondr, @shishunzhong, @wenleix I sent a PR towards this . Please take a look when you can.

wenleix · 2019-10-30T17:46:51Z

@bhasudha : A quick tip, when paste lines to Github file, try to click "y" to include commit id, this allows creates permanent link: https://help.github.com/en/github/managing-files-in-a-repository/getting-permanent-links-to-files

Otherwise, the file might change and the same line might refer to different thing ;)

wenleix · 2019-10-30T17:51:36Z

@bhasudha , @arhimondr

This is probably off-topic, but in a long term do we want to have HudiConnector since I imagine its long term vision is to evolve as something similar to IceBerg to support better metadata management? Thanks!

bhasudha · 2019-10-30T18:37:17Z

@bhasudha : A quick tip, when paste lines to Github file, try to click "y" to include commit id, this allows creates permanent link: https://help.github.com/en/github/managing-files-in-a-repository/getting-permanent-links-to-files

Otherwise, the file might change and the same line might refer to different thing ;)

Thanks for the tip! Dint know about it.

bhasudha · 2019-10-30T18:38:15Z

@bhasudha , @arhimondr

This is probably off-topic, but in a long term do we want to have HudiConnector since I imagine its long term vision is to evolve as something similar to IceBerg to support better metadata management? Thanks!

@wenleix. Hudi already has a TimelineService that has features similar to Iceberg for better metadata management. We might have to support a seperate Presto Hudi connector down the line leveraging that and also supporting real time views. I am happy to create a seperate thread on this.

bhasudha mentioned this issue Oct 23, 2019

Introduce PathFilter support in DirectoryLister #13590

Closed

bhasudha mentioned this issue Feb 7, 2020

Support PathFilter when fetching splits for Hudi tables #13818

Merged

bhasudha closed this as completed Mar 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for PathFilter in DirectoryLister #13511

Support for PathFilter in DirectoryLister #13511

bhasudha commented Oct 8, 2019

shixuan-fan commented Oct 9, 2019

arhimondr commented Oct 10, 2019

bhasudha commented Oct 16, 2019

bhasudha commented Oct 23, 2019 •

edited

Loading

wenleix commented Oct 30, 2019

wenleix commented Oct 30, 2019 •

edited

Loading

bhasudha commented Oct 30, 2019

bhasudha commented Oct 30, 2019

Support for PathFilter in DirectoryLister #13511

Support for PathFilter in DirectoryLister #13511

Comments

bhasudha commented Oct 8, 2019

shixuan-fan commented Oct 9, 2019

arhimondr commented Oct 10, 2019

bhasudha commented Oct 16, 2019

bhasudha commented Oct 23, 2019 • edited Loading

wenleix commented Oct 30, 2019

wenleix commented Oct 30, 2019 • edited Loading

bhasudha commented Oct 30, 2019

bhasudha commented Oct 30, 2019

bhasudha commented Oct 23, 2019 •

edited

Loading

wenleix commented Oct 30, 2019 •

edited

Loading