Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for PathFilter in DirectoryLister #13511

Closed
bhasudha opened this issue Oct 8, 2019 · 8 comments
Closed

Support for PathFilter in DirectoryLister #13511

bhasudha opened this issue Oct 8, 2019 · 8 comments

Comments

@bhasudha
Copy link

bhasudha commented Oct 8, 2019

Hello!

Here is some context:
This PR allows to get FileSplits directly from HoodieInputFormat to be able to query hudi datasets. Currently this is integrated in the loadPartition(partition) in BackgroundHiveSplitLoader, which is called for every hive Partition. This internally returns the InputSplit[] from that Hive partition by calling HoodieInputFormat.getSplits().

Some extra overheads that are observed with this implementation:

Current Uber specific solution:
To address these, we took a compile time dependency on Hudi and instantiated the HoodieTableMetadata once in BackgroundHiveSplitLoader constructor. And leveraged Hudi Library APIs to filter the Partition files instead of calling HoodieInputFormat.getSplits(). This gave us significant reduction on number of Namenode calls in this path.

We were looking at generalizing this solution and wanted to pick your thoughts on leveraging PathFilter(https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/PathFilter.html)) to make this more generic and inputFormat agnostic. Here is the proposed generic solution that does NOT bring compile time dependencies on Hudi lib:

If DirectoryLister can expose another API - list(FileSystem fs, Table table, Path path, PathFilter pathFilter), we can load a PathFilter implementation (such as HoodieROTablePathFilter) configurable via HiveClientConfig. This can be instantiated once in every BackgroundHiveSplitLoader object and passed along to diretoryLister.list(). This new PathFilter implementation can cache the Directory -> Filtered Hoodie Paths, which is then readily available via HiveFileIterator.

For additional reference,
Spark queries to Hudi datasets on ReadOnlyViews leverage this similarly via HadoopConfiguration like this - sc.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[TmpFileFilter], classOf[PathFilter]).

Please let me know your thoughts/concerns. If it seems okay, I can send in an implementation.

Thanks!

@shixuan-fan
Copy link
Contributor

Sounds reasonable to me. What do you think? @wenleix @arhimondr

@arhimondr
Copy link
Member

@bhasudha

further, the overhead of listing Hudi metadata everytime for each partition can be removed if HoodieTableMetadata can be loaded just once per table query and reused across loadPartition() call (Since the table metadata is not expected to change within the lifetime of a query)

Could you please elaborate more on what the HoodieTableMetadata is? Can it be partition specific?

If DirectoryLister can expose another API - list(FileSystem fs, Table table, Path path, PathFilter pathFilter)

That sounds reasonable

we can load a PathFilter implementation (such as HoodieROTablePathFilter) configurable via HiveClientConfig

The HiveClientConfig is a static configuration applicable for all tables in all formats for all queries. I wonder if the filter can somehow be applied only for Hoodie tables?

For additional reference,
Spark queries to Hudi datasets on ReadOnlyViews leverage this similarly via HadoopConfiguration like this - sc.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[TmpFileFilter], classOf[PathFilter]).

For spark the sparkContext is a per query entity. That's why it is possible to configure it that way.

This new PathFilter implementation can cache the Directory -> Filtered Hoodie Paths

What about consistency? What if the partition / table has changed?

@bhasudha
Copy link
Author

Thanks for the questions @arihimondr. I understand your concerns . I have replied inline.

@bhasudha

further, the overhead of listing Hudi metadata everytime for each partition can be removed if HoodieTableMetadata can be loaded just once per table query and reused across loadPartition() call (Since the table metadata is not expected to change within the lifetime of a query)

Could you please elaborate more on what the HoodieTableMetadata is? Can it be partition specific?

Hoodie table metadata logs metadata about the Hudi table. Metadata includes commit, savepoints, compactions, cleanups, rollbacks, etc. These are not partition specific. The HoodieTableMetaClient allows to access this metadata. More details here.In the current context, since BackgroundHiveSplitLoader loadPartition() calls HoodieInputFormat.getSplits(), this HoodieTableMetaClient is created multiple times for the same table (for every partition loaded via loadPartition()) instead of just once per query per table.

If DirectoryLister can expose another API - list(FileSystem fs, Table table, Path path, PathFilter pathFilter)

That sounds reasonable

we can load a PathFilter implementation (such as HoodieROTablePathFilter) configurable via HiveClientConfig

The HiveClientConfig is a static configuration applicable for all tables in all formats for all queries. I wonder if the filter can somehow be applied only for Hoodie tables?

The HoodieROTablePathFilter is implemented such a way that, if the partition does not belong to a Hoodie table, public boolean accept(Path path) returns true. This should take care of handling all other table format types. This is also performant since, it caches the parent folder of this path, so future lookups of files on the same path do not have to check again if this file belongs to a Hoodie table. Since, we are instantiating the PathFilter instance per BackgroundHiveSplitLoader object, we dont have to worry about cache lifetime, validation etc. After the query is complete the PathFilter instance will also be garbage collected. Does this address your question? Please suggest if you have any other recommendation here.

For additional reference,
Spark queries to Hudi datasets on ReadOnlyViews leverage this similarly via HadoopConfiguration like this - sc.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[TmpFileFilter], classOf[PathFilter]).

For spark the sparkContext is a per query entity. That's why it is possible to configure it that way.

Yeah, understand. The instantiation of PathFilter object everytime when a BackgroundSplitLoader is instantiated (which is created in HiveSplitManager.getSplits), takes care of the per query behavior. Please correct me if my assumption is wrong.

This new PathFilter implementation can cache the Directory -> Filtered Hoodie Paths

What about consistency? What if the partition / table has changed?

The Presto query's view of a Hoodie table belongs to commits that have already completed and is immutable. The HoodieROTablePathfilter cache will also point to these immutable commits.
This is ensured by the getLatestDataFiles() API from Hudi (invocation of getLatestDataFiles() and implementation). Basically this API, filters out inflight files and only picks those that are already marked as complete. So consistency is not an issue. The changes to partitions/table, if any, will be visible to a new query that is issued for the same table in future.

@bhasudha
Copy link
Author

bhasudha commented Oct 23, 2019

@arhimondr, @shishunzhong, @wenleix I sent a PR towards this . Please take a look when you can.

@wenleix
Copy link
Contributor

wenleix commented Oct 30, 2019

@bhasudha : A quick tip, when paste lines to Github file, try to click "y" to include commit id, this allows creates permanent link: https://help.github.com/en/github/managing-files-in-a-repository/getting-permanent-links-to-files

Otherwise, the file might change and the same line might refer to different thing ;)

@wenleix
Copy link
Contributor

wenleix commented Oct 30, 2019

@bhasudha , @arhimondr

This is probably off-topic, but in a long term do we want to have HudiConnector since I imagine its long term vision is to evolve as something similar to IceBerg to support better metadata management? Thanks!

@bhasudha
Copy link
Author

@bhasudha : A quick tip, when paste lines to Github file, try to click "y" to include commit id, this allows creates permanent link: https://help.github.com/en/github/managing-files-in-a-repository/getting-permanent-links-to-files

Otherwise, the file might change and the same line might refer to different thing ;)

Thanks for the tip! Dint know about it.

@bhasudha
Copy link
Author

@bhasudha , @arhimondr

This is probably off-topic, but in a long term do we want to have HudiConnector since I imagine its long term vision is to evolve as something similar to IceBerg to support better metadata management? Thanks!

@wenleix. Hudi already has a TimelineService that has features similar to Iceberg for better metadata management. We might have to support a seperate Presto Hudi connector down the line leveraging that and also supporting real time views. I am happy to create a seperate thread on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants