-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for PathFilter in DirectoryLister #13511
Comments
Sounds reasonable to me. What do you think? @wenleix @arhimondr |
Could you please elaborate more on what the
That sounds reasonable
The
For spark the
What about consistency? What if the partition / table has changed? |
Thanks for the questions @arihimondr. I understand your concerns . I have replied inline.
Hoodie table metadata logs metadata about the Hudi table. Metadata includes commit, savepoints, compactions, cleanups, rollbacks, etc. These are not partition specific. The HoodieTableMetaClient allows to access this metadata. More details here.In the current context, since BackgroundHiveSplitLoader loadPartition() calls HoodieInputFormat.getSplits(), this HoodieTableMetaClient is created multiple times for the same table (for every partition loaded via loadPartition()) instead of just once per query per table.
The HoodieROTablePathFilter is implemented such a way that, if the partition does not belong to a Hoodie table, public boolean accept(Path path) returns true. This should take care of handling all other table format types. This is also performant since, it caches the parent folder of this
Yeah, understand. The instantiation of PathFilter object everytime when a BackgroundSplitLoader is instantiated (which is created in HiveSplitManager.getSplits), takes care of the per query behavior. Please correct me if my assumption is wrong.
The Presto query's view of a Hoodie table belongs to commits that have already completed and is immutable. The HoodieROTablePathfilter cache will also point to these immutable commits. |
@arhimondr, @shishunzhong, @wenleix I sent a PR towards this . Please take a look when you can. |
@bhasudha : A quick tip, when paste lines to Github file, try to click "y" to include commit id, this allows creates permanent link: https://help.github.com/en/github/managing-files-in-a-repository/getting-permanent-links-to-files Otherwise, the file might change and the same line might refer to different thing ;) |
This is probably off-topic, but in a long term do we want to have |
Thanks for the tip! Dint know about it. |
@wenleix. Hudi already has a TimelineService that has features similar to Iceberg for better metadata management. We might have to support a seperate Presto Hudi connector down the line leveraging that and also supporting real time views. I am happy to create a seperate thread on this. |
Hello!
Here is some context:
This PR allows to get FileSplits directly from HoodieInputFormat to be able to query hudi datasets. Currently this is integrated in the loadPartition(partition) in BackgroundHiveSplitLoader, which is called for every hive Partition. This internally returns the InputSplit[] from that Hive partition by calling HoodieInputFormat.getSplits().
Some extra overheads that are observed with this implementation:
Current Uber specific solution:
To address these, we took a compile time dependency on Hudi and instantiated the HoodieTableMetadata once in BackgroundHiveSplitLoader constructor. And leveraged Hudi Library APIs to filter the Partition files instead of calling HoodieInputFormat.getSplits(). This gave us significant reduction on number of Namenode calls in this path.
We were looking at generalizing this solution and wanted to pick your thoughts on leveraging PathFilter(https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/PathFilter.html)) to make this more generic and inputFormat agnostic. Here is the proposed generic solution that does NOT bring compile time dependencies on Hudi lib:
If DirectoryLister can expose another API - list(FileSystem fs, Table table, Path path, PathFilter pathFilter), we can load a PathFilter implementation (such as HoodieROTablePathFilter) configurable via HiveClientConfig. This can be instantiated once in every BackgroundHiveSplitLoader object and passed along to diretoryLister.list(). This new PathFilter implementation can cache the Directory -> Filtered Hoodie Paths, which is then readily available via HiveFileIterator.
For additional reference,
Spark queries to Hudi datasets on ReadOnlyViews leverage this similarly via HadoopConfiguration like this - sc.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[TmpFileFilter], classOf[PathFilter]).
Please let me know your thoughts/concerns. If it seems okay, I can send in an implementation.
Thanks!
The text was updated successfully, but these errors were encountered: