-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VL] Unified design for data lake read support in Gluten + Velox #3378
Comments
@yma11 In fact, these two PRs were all proposed by our team. @YannByron and I have communicated offline to a better architecture design for these two PRs. We agree with what you said, and we can be responsible for optimizing this area and provide a better design structure as soon as possible. |
We also encountered some problems with BatchScanExecTransformer, and the current implementation may also need optimization. |
Hi @yma11, thank you for the invitation. Here, I will briefly explain that our following design will mainly cover the three parts in the integration of Lake format and Gluten/Velox. IMO all these problems are related to the Lake format, but they can be solved independently: 1 Schema Related:such as column mapping. I think this part can be solved at the gluten-core layer, such as #3376. But perhaps a better solution is to rewrite As @yma11 mentioned, velex plans to implement column mapping in the native layer (I am very worried that will make these native readers/datasources more complicated). Even so, I think we can still accept the current solution (rewriting
2 Spark Datasource V1 and V2:Currently gluten/velox already supports 3 COW and MOR tables:To support cow table, one of the native implementation of cow tables (Deltalake non-DV table, Hudi cow table, Iceberg V1 table, Paimon append-only table) in different lake formats is that using the underlying file format readers(parquet/orc), which is the way that is adopted by #3043 and #3376. I personally think that this method is acceptable under the current circumstances. In addition, on the code framework (or the transformer hierarchy), our thoughts is consistent with @yma11 mentioned above. Next, we will propose the design (maybe it's a draft) soon cc @liujiayi771 , so that we can discuss deeply on it. |
It's acceptable to continue current solution, like column mapping, and then switch to velox when it's supported for Gluten quick adoption. But even this, a clear detailed design need to be finalized first so that community can work together based on it. Let's discuss more when the draft is ready. |
Are the two PRs submitted? |
|
GoalsMakes gluten/velox support for lake format(DeltaLake, Iceberg) query.
Some design thoughts:
Design SketchProject Frameworkgluten-core needs to provide some interfaces which is used to extend lake format-specified logic. some of interfaces are shown as follows:
RewritePlanRules and Column MappingIt allows table columns and the underlying file(Parquet/ORC) columns to use different names. This enables schema evolution operations such as Once we scan a table that enable column mapping (for example: for Delta Table), we should to deal with the relationship between table columns and file columns. And this can be achieved by rewriting plan. To solve this case, RewritePlanRules is abstracted out, which provide an ability to rewrite/transform SparkPlan before transform SparkPlan to GlutenPlan. This inferface is like this:
And then, A
And we can define Datasource ScanCurrently, Spark supports four Scan types:
And gluten supports the first three types. For For
The two functions are functionally aligned Scan Transformer StructureHere this desgin refines the Scan Transformer structure. BaseDataSource Encapsulate the necessary datasource information.
BaseScanTransformer the core abstraction, the key abilities:
Notice: For mor tables, LocalFilesNode doesn't keep the enough information. Thus, we allow to create its child class to expand, and the child class is required to override the Datasource v1 scan transformerDatasourceScanTransformer the base class for datasource v1.
FileSourceScanTransformer: the scan implementation of Datasource v1 based on DeltaLakeScanTransformer: the scan implementation of Delta table, located in gluten-delta, extends Datasource v2 scan transformerBatchScanTransformer the base class for datasource v2.
IcebergScanTransformer the scan implementation of Iceberg table, located in gluten-iceberg, extends Here give a detail explanation for this.
To support Iceberg V2 table(MOR), the data split needs to contain these deleteFiles. We also upgrade the Scan Transformer FactoryThere are many places to create the scan transformer object based on file format or scan. Also a factory class for these needs to be provided. Among them, the scan transformer of lake format is initialized by reflection.
ReadFileFormatLet explain the To support the cow table of lake format, we allow to use the underlying file format until a native reader implemention is provided. So this method can return To support the mor table, a native reader is necessary and it should support cow and mor table. With this, this mehtod should return the specific ReadFileFormat, e.g. DeltaReadFileFormat, IcebergReadFileFormat. Native ReaderProtobufCorresponding to the Due to the different parameters required for different lake formats and table types, we need to abstract different ReadOptions for each lake format and pass specific information. For example, in the FileOrFiles message, add the IcebergReadOptions message to convey iceberg-specific information. Due to the preference of the Velox community, the DeleteFile information is likely to be serialized as a string for transmission. The DeleteFile section in protobuf may be adjusted to a string.
Native SplitInfoOn the native, after receiving the protobuf message, we need to construct different SplitInfo based on the file_format information recorded in the protobuf. Similarly, in the native, different lake formats also need to construct xxSplitInfo that inherits from SplitInfo to pass additional information. These information will be used to construct the corresponding HiveConnectorSplit in WholeStageResultIteratorFirstStage. Similar to handling in Java, we need to abstract the process of converting SplitInfo to HiveConnectorSplit in a method of SplitInfo itself. SplitInfo should be converted to HiveConnectorSplit, while IcebergSplitInfo should be converted to HiveIcebergSplit ,etc.
|
@yma11 @weiting-chen please help to review this and look forward your feedbacks. |
Thanks a lot @YannByron for the detailed design!
This looks like a very common feature. Did you already have some thoughts on the way to implement this? By a service loader or some kind of reflection tools? Also, should we care about the applying order for the plugged rules?
The hierarchy of datasource transformers in Gluten has an issue that they extend from Vanilla Spark's case classes. For example, BatchScanExecTransformer inherits from
Two questions here:
And some other questions for the whole design:
Again, thank you everyone for bringing this up. Can't wait to see it landing. |
I have encountered this issue before, and I also suggest that the community remove this inheritance relationship. We can try to do it.
Not yet. But they have already merged many prerequisite PRs, and the PR supporting iceberg has also been proposed. It should be relatively quick to merge.
I have successfully used the relevant PR from the Velox community to establish the process of reading iceberg mor table. In |
yep, maybe use a service loader like
In fact, I'm aware of the problem. But this does not affect the design, so it can be discussed separately. cc @liujiayi771
Based on my design, I plan to move the logic of
Yep. the former way. If users want to use lake format, they need to put the separate jar to the classpath, not a fat jar.
No any configs in this design. I hope users enable native query again lake format only by putting the gluten-lakeformat jar to the class path. |
Hi @YannByron and @liujiayi771, thank you guys for such a detailed design!
Does it mean that in MOR table case, we don't need to care it's a parquet or orc file?
My understanding is that we don't have direct dependency on these lake related jars but do our job based on Spark interfaces. The data lake read/write offloading to Gluten/velox is transparent for users who have Spark+data lake properly worked. right? |
I don't know whether there are other cases that need
In the java layer, yes. But the enough information (base file's file format, delete file's file format) needs to be passed as options to native layer.
These modules (like gluten-delta, gluten-iceberg) depends on lake related jars, but gluten-core doesn't. Users just need to put the gluten-datalake jar to the class path, then it's transparent for them. |
@zhztheplayer Update the answer to this question.
I tried to remove this case-class-extending today. I found that this problematical case-class-extending is used to retrieve the |
Update: As discussing with @yma11 offline, we choose the solution that rewrite |
@YannByron @liujiayi771 Thanks for your guys' active involvement. Look forward to your PRs! |
Greetings, We would like to join the discussion and ask for your guidance - we are developing a new Velox connector for integrating our storage engine with the Gluten project. IIUC, the current Gluten design allows pushing down file-based scans of supported formats (e.g. Parquet, ORC, ...) into Velox. However, our storage engine uses RPC-based communication for metadata and data retrieval, similar to the Arrow Flight protocol. Currently we have a Spark Datasource V2 connector and we would like to integrate it with the Gluten project. We have a few thoughts and would be happy to hear your opinions:
Please let us know what do you think should be the best way forward. |
Hi @rz-vastdata. We are currently working on implementing the design mentioned above. Iceberg may need a new file format as well. You can participate in the review of our PRs and provide some suggestions. @yma11 We need to expedite the progress of PR review as there are many additional tasks to do. |
Sounds great, thanks for the update!
We'd be happy to participate in the review. |
@rz-vastdata I have modified the title of the PR, and it will reference this issue. |
Following from #3650, IIUC the current approach is to accelerate file-based data sources in Gluten - right? Asking since in our case, the Vast data source is based on a generic Would it be possible to support converting custom I understand that it also requires adding support to the Substrait protobuf definition and the Velox source code, but it seems to match better the use-case where the connector is not using files to store the data. For example, in our case, there is a custom protocol to retrieve the data from an external RPC server, so there is not need to use file-based I/O and abstractions. BTW, how is this handled when Gluten is used to interface with external hardware devices (e.g. FPGA/GPU/ASIC accelerators, as shown in |
Since #3650 merged, I suggest define SplitInfo proto in gluten side which mean do not use substrait LocalFile, then we can decouple SplitInfo from substrait plan, cause plan is immutable in specific stage but SplitInfo is mutable, it bring more clear design and interface, and help to decrease serialization consume. |
@Yohahaha What's the purpose to make |
plan or plan fragment is immutable means not contain task info, SplitInfo is mutable means all task of stage has its own SplitInfo to process. Combine these bring much complexity, we need pass each task's SplitInfo into plan and then serialize whole plan. We encounter memory issue before, and serialization cost issue now.
so you'd like use LocalFileNode for all format? does it satisfy all case? |
The serialized substrait plan has a complete definition for corresponding velox task so it needs contain SplitInfo. The serialization happens for each task even we don't pass SplitInfo as it may still has other task specific infos. Even you do the serialization only once at driver side, you need to combine this
I am not sure. But it's quite open to add more necessary fields in this proto if we need. |
@rz-vastdata #3650 is just a preparatory work. The next PR will be submitted next week and will include |
@yma11 It seems that some of the information required for the Iceberg format can be added into |
Sounds good, many thanks @liujiayi771! |
By the way, is there an open-source example for Gluten being used to interface with external devices (e.g. FPGA/GPU/ASIC accelerators, as shown at https://github.com/oap-project/gluten#2-architecture), e.g. when the data sources are not file-based? |
I'm working on the FPGA version |
Why is there a DeleteFile option to the Iceberg ReadOptions? Presumably you aren't actually deleting files as part of a read operation. Are you ignoring the files instead? For that matter, why is the option that says how to delete named FileContent? |
Iceberg uses DeleteFile to exclude some records in data file. |
Description
Currently there are 2 PRs opened in Gluten to support Iceberg COW table read and Delta Lake read. There is also one hot discussion in Velox about Iceberg read support. By consolidating the ideas and based on Gluten's position, we would like to share a draft unified design for data lake read support in Gluten.
As addressed in this project's home page, one of Gluten key function is to transform Spark’s whole stage physical plan to Substrait plan and send to native. It applies to data lake read support, thus:
We'd better avoid hacking of original Spark physical plan node. Gluten core has plan transformer to generate correct plan info into substrait format and then pass it to Velox for read and computation. So no matter what kind of the hack is, it should can be done in the transformer layer, such as column mapping. IMO, we should try best pass original info in spark plan to Velox as a bridge and do correct consumption at Velox side, unless it's not doable or velox can't support. By the way, one issue for feature like column mapping is, it's a common feature for kinds of file format reading, velox can handle this at its datasource level and the community has plan to do so.
Clear transformer hierarchy is need for different data lake backends. In the Iceberg COW table read PR, a new branch is added to do specific process for Iceberg and leverage an utility class put in a dedicated folder, and in future, I believe more branches will be needed to support other cases, like MOR. So introducing a new transformer inherited from
BatchScanExecTransformer
would be a better way. The possible hierarchy should be like following:@YannByron @felipepessoto @liujiayi771 @ulysses-you, please give comments on above suggestions. Thanks.
The text was updated successfully, but these errors were encountered: