[GLUTEN-2877][VL]Feat: Support read iceberg cow table #3043

liujiayi771 · 2023-09-06T07:33:56Z

What changes were proposed in this pull request?

Support read iceberg cow table in gluten. Resolve the first step in #2877.

How was this patch tested?

Run tpcds benchmark in 1T tpcds iceberg tables.

github-actions · 2023-09-06T07:34:13Z

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Other pull requests

github-actions · 2023-09-06T07:34:27Z

Run Gluten Clickhouse CI

github-actions · 2023-09-06T08:12:01Z

Run Gluten Clickhouse CI

github-actions · 2023-09-06T10:48:53Z

Run Gluten Clickhouse CI

jinchengchenghh · 2023-09-06T14:06:13Z

gluten-core/src/main/scala/io/glutenproject/execution/BatchScanExecTransformer.scala

@@ -48,6 +50,8 @@ class BatchScanExecTransformer(
  override def filterExprs(): Seq[Expression] = scan match {
    case fileScan: FileScan =>
      fileScan.dataFilters ++ pushdownFilters
+    case scan if scan.getClass.getSimpleName == "SparkBatchQueryScan" =>


Why not use isInstanceOf ?

SparkBatchQueryScan is protected class in iceberg, we can not import this class here.

gluten-core/src/main/scala/org/apache/iceberg/spark/source/GlutenSparkBatchQueryScan.scala

jinchengchenghh · 2023-09-06T14:07:37Z

gluten-core/src/main/scala/org/apache/iceberg/spark/source/GlutenSparkBatchQueryScan.scala

+        tasks.map(_.asCombinedScanTask()).foreach {
+          task =>
+            val file = task.files().asScala.head.file()
+            file.format() match {


Iceberg files maybe the mix of parquet and orc, how can we handle this?

Maybe this is a constraint, we can document it. Check all the files is a too high expense and this situation is rarely.

Yes, we should clarify the constraints.

jinchengchenghh · 2023-09-06T14:13:23Z

gluten-core/src/test/scala/org/apache/spark/softaffinity/SoftAffinitySuite.scala

@@ -47,7 +47,7 @@ class SoftAffinitySuite extends QueryTest with SharedSparkSession with Predicate
      ).toArray
    )

-    val locations = SoftAffinityUtil.getFilePartitionLocations(partition)
+    val locations = SoftAffinityUtil.getFilePartitionLocations(partition, partition.files)


Can get partition.files in getFilePartitionLocations, don't need to add it as argument.

The modification here is because the partition file of iceberg cannot be obtained directly through icebergPartition.files.

https://github.com/liujiayi771/gluten/blob/df0d79568da5561ba5b45a425903f21922558cc3/backends-velox/src/main/scala/io/glutenproject/backendsapi/velox/IteratorHandler.scala#L127

jinchengchenghh · 2023-09-06T14:14:11Z

Can you add the test about iceberg table?

ulysses-you · 2023-09-07T02:59:56Z

I'm afraid it would be hard to maintain if we mixed iceberg code into everywhere on gluten. What this pr does is something like:

Gluten    Iceberg
   \       /
     \   /
       |
   Spark SQL

I wonder if we can make it like:

    Iceberg
       |
     Gluten 
       |
   Spark SQL

The main difference is that, we can create new modules for iceberg or other Spark downstream projects. We can expose the power of Gluten as extension as much as possile, then these modules can be developed base on Gluten extension.

github-actions · 2023-09-07T08:51:37Z

Run Gluten Clickhouse CI

liujiayi771 · 2023-09-07T08:53:15Z

Can you add the test about iceberg table?

Add a new test case VeloxTPCHIcebergSuite.

liujiayi771 · 2023-09-07T08:59:42Z

I'm afraid it would be hard to maintain if we mixed iceberg code into everywhere on gluten. What this pr does is something like:
Gluten    Iceberg
   \       /
     \   /
       |
   Spark SQL
I wonder if we can make it like:
    Iceberg
       |
     Gluten 
       |
   Spark SQL
The main difference is that, we can create new modules for iceberg or other Spark downstream projects. We can expose the power of Gluten as extension as much as possile, then these modules can be developed base on Gluten extension.

I agree with you. We need to add the ability to inject new rules into the scan processing part in gluten-core. We will optimize this later. Our team are also supporting delta, hudi and paimon. support for different lake formats requires a better architecture.

github-actions · 2023-09-11T03:08:37Z

#2877

github-actions · 2023-09-13T03:07:12Z

Run Gluten Clickhouse CI

felipepessoto · 2023-09-29T20:48:24Z

gluten-core/src/main/scala/io/glutenproject/execution/BatchScanExecTransformer.scala

@@ -48,6 +50,8 @@ class BatchScanExecTransformer(
  override def filterExprs(): Seq[Expression] = scan match {
    case fileScan: FileScan =>
      fileScan.dataFilters ++ pushdownFilters
+    case scan if scan.getClass.getSimpleName == "SparkBatchQueryScan" =>


Comparing simple name may be dangerous. What if Delta Lake (or other formats) has the same class name

Comparing simple name may be dangerous. What if Delta Lake (or other formats) has the same class name

https://github.com/oap-project/gluten/blob/ff79df0084aec978161e99359c3d4dd58376ec45/gluten-core/src/main/scala/io/glutenproject/execution/BatchScanExecTransformer.scala#L127

You are right, it will be more accurate to use a name containing package name, but basically the scan names of different formats are different. Gluten uses simple name when judging the scan type.

yma11 · 2023-10-30T13:55:37Z

@liujiayi771 Do you work on refactor this PR or plan to do so based on discussion?

liujiayi771 · 2023-10-30T14:42:02Z

@liujiayi771 Do you work on refactor this PR or plan to do so based on discussion?

We will post the design in the next couple of days as it is almost finished. After discussing it, we will refactor this pull request.

liujiayi771 · 2023-10-30T14:43:56Z

@yma11 Apologies for the delay. We will completely decouple iceberg, delta, and gluten-core. Many areas will require modifications.

yma11 · 2023-10-30T14:44:34Z

We will post the design in the next couple of days as it is almost finished. After discussing it, we will refactor this pull request.

I see. Look forward for your design. Thanks.

github-actions · 2023-11-07T07:23:41Z

Run Gluten Clickhouse CI

yma11 · 2023-12-12T02:13:30Z

Close this PR as related implementation all done in later PRs.

[VL] support read iceberg cow table

3a3d0eb

liujiayi771 force-pushed the iceberg branch from 3a75c98 to 3a3d0eb Compare September 6, 2023 08:11

exclude orc dependency

df0d795

jinchengchenghh reviewed Sep 6, 2023

View reviewed changes

gluten-core/src/main/scala/org/apache/iceberg/spark/source/GlutenSparkBatchQueryScan.scala Outdated Show resolved Hide resolved

jinchengchenghh reviewed Sep 6, 2023

View reviewed changes

add iceberg unit test

e88f99e

zhouyuan changed the title ~~[VL] Support read iceberg cow table~~ [GLUTEN-2877][VL]Feat: Support read iceberg cow table Sep 11, 2023

fix ck api

f9508cb

felipepessoto reviewed Sep 29, 2023

View reviewed changes

felipepessoto mentioned this pull request Sep 29, 2023

[Gluten-core][VL] Supports Delta Lake Read #2902

Closed

yma11 mentioned this pull request Oct 11, 2023

[VL] Unified design for data lake read support in Gluten + Velox #3378

Open

liujiayi771 closed this Nov 6, 2023

liujiayi771 deleted the iceberg branch November 6, 2023 04:54

liujiayi771 restored the iceberg branch November 6, 2023 04:54

liujiayi771 reopened this Nov 7, 2023

yma11 closed this Dec 12, 2023

liujiayi771 deleted the iceberg branch December 12, 2023 03:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GLUTEN-2877][VL]Feat: Support read iceberg cow table #3043

[GLUTEN-2877][VL]Feat: Support read iceberg cow table #3043

liujiayi771 commented Sep 6, 2023

github-actions bot commented Sep 6, 2023

github-actions bot commented Sep 6, 2023

github-actions bot commented Sep 6, 2023

github-actions bot commented Sep 6, 2023

jinchengchenghh Sep 6, 2023

liujiayi771 Sep 7, 2023

jinchengchenghh Sep 6, 2023

jinchengchenghh Sep 6, 2023

liujiayi771 Sep 7, 2023

jinchengchenghh Sep 6, 2023

liujiayi771 Sep 7, 2023

jinchengchenghh commented Sep 6, 2023

ulysses-you commented Sep 7, 2023

github-actions bot commented Sep 7, 2023

liujiayi771 commented Sep 7, 2023

liujiayi771 commented Sep 7, 2023

github-actions bot commented Sep 11, 2023

github-actions bot commented Sep 13, 2023

felipepessoto Sep 29, 2023

liujiayi771 Oct 3, 2023

yma11 commented Oct 30, 2023

liujiayi771 commented Oct 30, 2023

liujiayi771 commented Oct 30, 2023

yma11 commented Oct 30, 2023

github-actions bot commented Nov 7, 2023

yma11 commented Dec 12, 2023

[GLUTEN-2877][VL]Feat: Support read iceberg cow table #3043

[GLUTEN-2877][VL]Feat: Support read iceberg cow table #3043

Conversation

liujiayi771 commented Sep 6, 2023

What changes were proposed in this pull request?

How was this patch tested?

github-actions bot commented Sep 6, 2023

github-actions bot commented Sep 6, 2023

github-actions bot commented Sep 6, 2023

github-actions bot commented Sep 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jinchengchenghh commented Sep 6, 2023

ulysses-you commented Sep 7, 2023

github-actions bot commented Sep 7, 2023

liujiayi771 commented Sep 7, 2023

liujiayi771 commented Sep 7, 2023

github-actions bot commented Sep 11, 2023

github-actions bot commented Sep 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yma11 commented Oct 30, 2023

liujiayi771 commented Oct 30, 2023

liujiayi771 commented Oct 30, 2023

yma11 commented Oct 30, 2023

github-actions bot commented Nov 7, 2023

yma11 commented Dec 12, 2023