Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-2877][VL]Feat: Support read iceberg cow table #3043

Closed
wants to merge 4 commits into from

Conversation

liujiayi771
Copy link
Contributor

What changes were proposed in this pull request?

Support read iceberg cow table in gluten. Resolve the first step in #2877.

How was this patch tested?

Run tpcds benchmark in 1T tpcds iceberg tables.

@github-actions
Copy link

github-actions bot commented Sep 6, 2023

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

@github-actions
Copy link

github-actions bot commented Sep 6, 2023

Run Gluten Clickhouse CI

@github-actions
Copy link

github-actions bot commented Sep 6, 2023

Run Gluten Clickhouse CI

@github-actions
Copy link

github-actions bot commented Sep 6, 2023

Run Gluten Clickhouse CI

@@ -48,6 +50,8 @@ class BatchScanExecTransformer(
override def filterExprs(): Seq[Expression] = scan match {
case fileScan: FileScan =>
fileScan.dataFilters ++ pushdownFilters
case scan if scan.getClass.getSimpleName == "SparkBatchQueryScan" =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use isInstanceOf ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SparkBatchQueryScan is protected class in iceberg, we can not import this class here.

tasks.map(_.asCombinedScanTask()).foreach {
task =>
val file = task.files().asScala.head.file()
file.format() match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iceberg files maybe the mix of parquet and orc, how can we handle this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is a constraint, we can document it. Check all the files is a too high expense and this situation is rarely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should clarify the constraints.

@@ -47,7 +47,7 @@ class SoftAffinitySuite extends QueryTest with SharedSparkSession with Predicate
).toArray
)

val locations = SoftAffinityUtil.getFilePartitionLocations(partition)
val locations = SoftAffinityUtil.getFilePartitionLocations(partition, partition.files)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can get partition.files in getFilePartitionLocations, don't need to add it as argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The modification here is because the partition file of iceberg cannot be obtained directly through icebergPartition.files.

https://github.com/liujiayi771/gluten/blob/df0d79568da5561ba5b45a425903f21922558cc3/backends-velox/src/main/scala/io/glutenproject/backendsapi/velox/IteratorHandler.scala#L127

@jinchengchenghh
Copy link
Contributor

Can you add the test about iceberg table?

@ulysses-you
Copy link
Contributor

I'm afraid it would be hard to maintain if we mixed iceberg code into everywhere on gluten. What this pr does is something like:

Gluten    Iceberg
   \       /
     \   /
       |
   Spark SQL

I wonder if we can make it like:

    Iceberg
       |
     Gluten 
       |
   Spark SQL

The main difference is that, we can create new modules for iceberg or other Spark downstream projects. We can expose the power of Gluten as extension as much as possile, then these modules can be developed base on Gluten extension.

@github-actions
Copy link

github-actions bot commented Sep 7, 2023

Run Gluten Clickhouse CI

@liujiayi771
Copy link
Contributor Author

Can you add the test about iceberg table?

Add a new test case VeloxTPCHIcebergSuite.

@liujiayi771
Copy link
Contributor Author

I'm afraid it would be hard to maintain if we mixed iceberg code into everywhere on gluten. What this pr does is something like:

Gluten    Iceberg
   \       /
     \   /
       |
   Spark SQL

I wonder if we can make it like:

    Iceberg
       |
     Gluten 
       |
   Spark SQL

The main difference is that, we can create new modules for iceberg or other Spark downstream projects. We can expose the power of Gluten as extension as much as possile, then these modules can be developed base on Gluten extension.

I agree with you. We need to add the ability to inject new rules into the scan processing part in gluten-core. We will optimize this later. Our team are also supporting delta, hudi and paimon. support for different lake formats requires a better architecture.

@zhouyuan zhouyuan changed the title [VL] Support read iceberg cow table [GLUTEN-2877][VL]Feat: Support read iceberg cow table Sep 11, 2023
@github-actions
Copy link

#2877

@github-actions
Copy link

Run Gluten Clickhouse CI

@@ -48,6 +50,8 @@ class BatchScanExecTransformer(
override def filterExprs(): Seq[Expression] = scan match {
case fileScan: FileScan =>
fileScan.dataFilters ++ pushdownFilters
case scan if scan.getClass.getSimpleName == "SparkBatchQueryScan" =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing simple name may be dangerous. What if Delta Lake (or other formats) has the same class name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing simple name may be dangerous. What if Delta Lake (or other formats) has the same class name

https://github.com/oap-project/gluten/blob/ff79df0084aec978161e99359c3d4dd58376ec45/gluten-core/src/main/scala/io/glutenproject/execution/BatchScanExecTransformer.scala#L127

You are right, it will be more accurate to use a name containing package name, but basically the scan names of different formats are different. Gluten uses simple name when judging the scan type.

@yma11
Copy link
Contributor

yma11 commented Oct 30, 2023

@liujiayi771 Do you work on refactor this PR or plan to do so based on discussion?

@liujiayi771
Copy link
Contributor Author

@liujiayi771 Do you work on refactor this PR or plan to do so based on discussion?

We will post the design in the next couple of days as it is almost finished. After discussing it, we will refactor this pull request.

@liujiayi771
Copy link
Contributor Author

@yma11 Apologies for the delay. We will completely decouple iceberg, delta, and gluten-core. Many areas will require modifications.

@yma11
Copy link
Contributor

yma11 commented Oct 30, 2023

We will post the design in the next couple of days as it is almost finished. After discussing it, we will refactor this pull request.

I see. Look forward for your design. Thanks.

@liujiayi771 liujiayi771 closed this Nov 6, 2023
@liujiayi771 liujiayi771 deleted the iceberg branch November 6, 2023 04:54
@liujiayi771 liujiayi771 restored the iceberg branch November 6, 2023 04:54
@liujiayi771 liujiayi771 reopened this Nov 7, 2023
Copy link

github-actions bot commented Nov 7, 2023

Run Gluten Clickhouse CI

@yma11
Copy link
Contributor

yma11 commented Dec 12, 2023

Close this PR as related implementation all done in later PRs.

@yma11 yma11 closed this Dec 12, 2023
@liujiayi771 liujiayi771 deleted the iceberg branch December 12, 2023 03:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants