-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-2877][VL]Feat: Support read iceberg cow table #3043
Conversation
Thanks for opening a pull request! Could you open an issue for this pull request on Github Issues? https://github.com/oap-project/gluten/issues Then could you also rename commit message and pull request title in the following format?
See also: |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
@@ -48,6 +50,8 @@ class BatchScanExecTransformer( | |||
override def filterExprs(): Seq[Expression] = scan match { | |||
case fileScan: FileScan => | |||
fileScan.dataFilters ++ pushdownFilters | |||
case scan if scan.getClass.getSimpleName == "SparkBatchQueryScan" => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use isInstanceOf ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SparkBatchQueryScan is protected class in iceberg, we can not import this class here.
gluten-core/src/main/scala/org/apache/iceberg/spark/source/GlutenSparkBatchQueryScan.scala
Outdated
Show resolved
Hide resolved
tasks.map(_.asCombinedScanTask()).foreach { | ||
task => | ||
val file = task.files().asScala.head.file() | ||
file.format() match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iceberg files maybe the mix of parquet and orc, how can we handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is a constraint, we can document it. Check all the files is a too high expense and this situation is rarely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should clarify the constraints.
@@ -47,7 +47,7 @@ class SoftAffinitySuite extends QueryTest with SharedSparkSession with Predicate | |||
).toArray | |||
) | |||
|
|||
val locations = SoftAffinityUtil.getFilePartitionLocations(partition) | |||
val locations = SoftAffinityUtil.getFilePartitionLocations(partition, partition.files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can get partition.files in getFilePartitionLocations
, don't need to add it as argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The modification here is because the partition file of iceberg cannot be obtained directly through icebergPartition.files
.
Can you add the test about iceberg table? |
I'm afraid it would be hard to maintain if we mixed iceberg code into everywhere on gluten. What this pr does is something like:
I wonder if we can make it like:
The main difference is that, we can create new modules for iceberg or other Spark downstream projects. We can expose the power of Gluten as extension as much as possile, then these modules can be developed base on Gluten extension. |
Run Gluten Clickhouse CI |
Add a new test case VeloxTPCHIcebergSuite. |
I agree with you. We need to add the ability to inject new rules into the scan processing part in gluten-core. We will optimize this later. Our team are also supporting delta, hudi and paimon. support for different lake formats requires a better architecture. |
Run Gluten Clickhouse CI |
@@ -48,6 +50,8 @@ class BatchScanExecTransformer( | |||
override def filterExprs(): Seq[Expression] = scan match { | |||
case fileScan: FileScan => | |||
fileScan.dataFilters ++ pushdownFilters | |||
case scan if scan.getClass.getSimpleName == "SparkBatchQueryScan" => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing simple name may be dangerous. What if Delta Lake (or other formats) has the same class name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing simple name may be dangerous. What if Delta Lake (or other formats) has the same class name
You are right, it will be more accurate to use a name containing package name, but basically the scan names of different formats are different. Gluten uses simple name when judging the scan type.
@liujiayi771 Do you work on refactor this PR or plan to do so based on discussion? |
We will post the design in the next couple of days as it is almost finished. After discussing it, we will refactor this pull request. |
@yma11 Apologies for the delay. We will completely decouple iceberg, delta, and gluten-core. Many areas will require modifications. |
I see. Look forward for your design. Thanks. |
Run Gluten Clickhouse CI |
Close this PR as related implementation all done in later PRs. |
What changes were proposed in this pull request?
Support read iceberg cow table in gluten. Resolve the first step in #2877.
How was this patch tested?
Run tpcds benchmark in 1T tpcds iceberg tables.