Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Gluten-core][VL] Supports Delta Lake Read #2902

Closed
wants to merge 5 commits into from

Conversation

Shirosakirukia
Copy link

@Shirosakirukia Shirosakirukia commented Aug 25, 2023

What changes were proposed in this pull request?

  1. Supports Delta scan in Velox .
  2. Delta 2.x supports Column Mapping, which is also supported in this PR.
  3. Not support DeletionVector that is a new feature after Delta2.3

(Fixes: #ISSUE-2891)

How was this patch tested?

TPC-DS test

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

@github-actions
Copy link

Run Gluten Clickhouse CI

@felipepessoto
Copy link
Contributor

@Shirosakirukia, do you know why Delta doesn't just work as it is implemented as an extension of ParquetFileFormat?

For me it is not clear why some things work, for example it scans the correct set of Parquet files, instead of all the files in the folder, but some other doesn't.

You are re-implementing column mapping here, ideally, we shouldn't duplicate Delta implementation as it would be impossible to maintain it, and it would also miss many other features: optimize command, DV, reorg command, optimize write, auto compact, invariants, check constraints, etc.

@YannByron
Copy link
Contributor

@felipepessoto We need to distinguish between these features (including OSS Delta or databricks Delta) and identify which ones need gluten/velox support. For example, some features related to optimize (auto-compaction, optimize write), is to redistribute data to files, and constraints only affect whether the coming data meets these constraints and how to deal with unqualified data. So IMO, these features don't need to taken int account when make gluten/velox supports DeltaLake.

While these features, like column-mapping and DV need. But the two features are still different. Essentially, column-mapping is just a mapping between table schema and file schema, So we can append ProjectExec before FileScanExec (as this pr) to make the native ParquetScan work for Delta Scan.
For DV, it's more complicated. Theoretically, we also transform Deletion Vector to a FilterExec that maybe has a bitmap, and put it before FileScanExec, but this is not a good way and also affects reading efficiency. So I prefer a solution that make velox to support DeltaScan with DV.

@YannByron
Copy link
Contributor

@felipepessoto Based on this, we prefer to support Delta Column Mapping by rewriting plan. and support DV by velox supporting DeltaFileFormat later on.

@github-actions
Copy link

github-actions bot commented Sep 1, 2023

Run Gluten Clickhouse CI

@github-actions
Copy link

github-actions bot commented Sep 1, 2023

Run Gluten Clickhouse CI

@github-actions
Copy link

github-actions bot commented Sep 7, 2023

Run Gluten Clickhouse CI

@YannByron
Copy link
Contributor

@zhouyuan may you take a look please.

@github-actions
Copy link

Run Gluten Clickhouse CI

@felipepessoto
Copy link
Contributor

felipepessoto commented Sep 29, 2023

The Iceberg PR's can also provide some ideas:

Gluten
#3043

They implemented it differently, instead of changing FileSourceScanExecTransformer.scala to return ParquetReadFormat, they changed BatchScanExecTransformer.fileFormat to return ParquetReadFormat (or ORC in their case).

I wonder if we could use a simple approach for both cases. Idk which one is better though

Velox
facebookincubator/velox#5977 - facebookincubator/velox#5897

@YannByron
Copy link
Contributor

YannByron commented Oct 7, 2023

@felipepessoto
I know the #3043. The key reason why the two pr implementations are different, is not the lake format (one for deltalake, one for iceberg), but the spark datasource interface used. DeltaLake uses Spark DS V1, while Iceberg uses Spark DS V2.

For spark datasource v2, we are working to provide a better design that should have a generic interface in gluten to support datasources (like iceberg and paimon) used spark DS v2, and be a nice project framework that makes easier to support more formats as @liujiayi771 said in #3043 (comment).

Look forward to your reply a lot.

@felipepessoto
Copy link
Contributor

Got it. I don’t have much to add here. Just started with Gluten and still learning it.

Hope to see this merged soon as I use mostly Delta table.

@felipepessoto
Copy link
Contributor

@Shirosakirukia build is failing. Any idea why it can't find Delta classes? Maybe you need to specify the version to use Delta 2.2.0, which is compatible to Spark 3.3

java.lang.NoClassDefFoundError: org/apache/spark/sql/delta/DeltaParquetFileFormat
at io.glutenproject.extension.RewritePlanIfNeeded.io$glutenproject$extension$RewritePlanIfNeeded$$isDeltaColumnMappingFileFormat(ColumnarOverrides.scala:70)
at io.glutenproject.extension.RewritePlanIfNeeded$$anonfun$apply$1.applyOrElse(ColumnarOverrides.scala:63)

@Shirosakirukia
Copy link
Author

@felipepessoto Sure. Gluten-core build was successful with no functional issues. The error message indicates that the velox-backend test is unable to recognize the DeltaParquetFileFormat class. @YannByron May you take a look please?

@YannByron
Copy link
Contributor

I will take over this pr soon, maybe open another to address the failure of CI, and add some UT.

Copy link

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale stale label Nov 26, 2023
Copy link

github-actions bot commented Dec 7, 2023

This PR was auto-closed because it has been stalled for 10 days with no activity. Please feel free to reopen if it is still valid. Thanks.

@github-actions github-actions bot closed this Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Velox doesn't work with Spark Delta Lake
3 participants