Support DataSourceV2 sources #321

andrei-ionescu · 2021-01-11T22:26:15Z

What is the context for this pull request?

Tracking Issue: [FEATURE REQUEST]: Add support for Iceberg table format #306
Proposal: [PROPOSAL]: Support Iceberg table format #318
Dependencies: N/A
Fixes: [FEATURE REQUEST]: Add support for Iceberg table format #306
Fixes: [PROPOSAL]: Support Iceberg table format #318

What changes were proposed in this pull request?

This PR adds support for DataSourceV2.

The following changes are in this PR and each of them are separate commits:

Use LogicaPlan instead of LogicalRelation. This gives us the possibility to add support for other kinds of relations like Spark datas source version 2.
Add support for DataSourceV2.

Does this PR introduce any user-facing change?

The source interfaces has changed. Now instead of using LogicalRelation as parameter type it now uses LogicalPlan. There has been added support for DataSourceV2 sources.

Detailed information can be found in the #318 proposal.

How was this patch tested?

All already present tests are passing
Locally & Databricks Runtime tests based on the Hyperspace usage API in Apache Spark.

sezruby

BTW, as there's no test, could you open the next PR including IceBergIntegrationTest based on this PR? so that we could check the code works as expected.

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala

src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala

src/main/scala/com/microsoft/hyperspace/util/LogicalPlanUtils.scala

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

sezruby

Generally looks good to me!

Could you add a simple test case for "delete" hybrid scan in IcebergIntegrationTest?
Since Hybrid Scan test refactoring is on the way (#274), it's difficult to add Iceberg Hybrid Scan tests until the refactoring change is merged.

@imback82 Could you have a look at this? Thanks!

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala

src/main/scala/com/microsoft/hyperspace/util/LogicalPlanUtils.scala

src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala

andrei-ionescu · 2021-01-14T08:40:39Z

@sezruby The integration test for Iceberg is in the #320. I cannot add a test for Iceberg in this PR because this one is for DataSourceV2 support only.

sezruby · 2021-01-14T08:48:02Z

@andrei-ionescu Yep, I checked it; the request is adding a test case for Hybrid Scan with deleted files. You can test it by:

create index with lineage column & verify index application
- use partitioned data to remove files easily & test partition spec
remove few files in the source data
verify index application with hybrid scan disabled => shouldn't be applied
verify index application with hybrid scan enabled

Please use TestConfig.HybridScanEnabled to enable Hybrid Scan. The hybrid scan configs are updated today https://github.com/microsoft/hyperspace/pull/300/files#diff-0f1f2bb40109f283e9c668b4755f3a68aa2ca68653d4a80c8f39de345f86cc6eR111

andrei-ionescu · 2021-01-14T09:33:25Z

@sezruby I integrated this second round of feedback and rebased the PR on top of today's master.

In regards to the "Hybrid Scan with deleted files" I have some questions:

Is there a test for it for non-DataSourceV2 sources that I can use as inspiration?
What to you see so specific to Hybrid Scan + DataSourceV2 that is not covered by the already present tests?

There is a complexity added to it: DataSourceV2 needs an implementation (like IcebergSource) which I don't know what it means in terms of implementation effort.

I'll add the test on the IcebergIntegrationTests on #320.

sezruby · 2021-01-14T10:16:28Z

@andrei-ionescu
1 => You can refer
"Verify JoinIndexRule utilizes indexes correctly after quick refresh when some file gets deleted and some appended to source data." in DeltaLakeIntegrationTest, except for the refreshIndex part in the test. If hybrid scan config is enabled, we don't need to refresh index. In the test, there're both deleted & appended files, but we could test with deleted files. (and both)

2 => Yes, we need to check the exact plan transformation of Hybrid Scan + Datasource v2, but let's just check if the plan is transformed or not for now. Few things I'd like to check:

deleted files after index creation => Hybrid Scan should be applied
partitioned source data & appended files after index creation => Hybrid Scan should handle the appended data properly

And could you share the result of query.queryExecution.optimizedPlan/sparkPlan after applying the index + Hybrid Scan on IcebergSource? for the following case:

appended files only (the current test)
deleted files only
appended files + deleted files

Yep I'll check the IcebergIntegrationTest in #320.

Thanks for the work 👍 👍

andrei-ionescu · 2021-01-14T13:18:34Z

@sezruby The DataSourceV2, DataSourceV2Relation are just API. They don't necessary have anything to do with files on disk. There can be implementation over files on disk but in the same time it can be over other means (like a Kafka reader). This is very similar to LogicalRelation where you can have relations over files or not.

Asking for a test for DataSourceV2 (DataSourceV2Relation) is like asking for tests over LogicalRelation.

If I try to add a test for it first I need to implement the source based on the DataSourceV2 API and with some file attached logic which has to be used for tests. This would mean to add both the new source and the new *FilesBasedSource in index/sources.

IcebergSource is an implementation over DataSourceV2 API and works with files. It's the perfect place for such test.

sezruby · 2021-01-14T13:27:00Z

Sorry for the confusion, I meant Hybrid Scan + Iceberg. Please add the tests in #320 :)

We need to make sure that this change works as expected before merging it.

andrei-ionescu · 2021-01-14T21:09:59Z

@sezruby I added the requested tests in https://github.com/microsoft/hyperspace/pull/320/files#diff-ce1f32f296e1683385beb0fe1954b154710c0ba0120f028167afbe5953347dd3 similar to the DeltaLakeIntegrationTests ones.

BTW, it did show up a place where I missed adding the pattern matching on DataSourceV2Relation, so it was a good call to have them added in #320. Thanks @sezruby!

src/main/scala/com/microsoft/hyperspace/util/LogicalPlanUtils.scala

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala

sezruby

@imback82 Could you review this change?
Tests are in #320 https://github.com/microsoft/hyperspace/pull/320/files#diff-ce1f32f296e1683385beb0fe1954b154710c0ba0120f028167afbe5953347dd3R61

Thanks!

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala

imback82

I did a quick review. I will do a more thorough review this week. Since this is touching the core parts, I want to make sure this is reviewed thoroughly.

@apoorvedave1 / @pirz Can you please take a look?

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala

imback82 · 2021-01-19T00:44:27Z

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala

@@ -211,6 +213,7 @@ private[actions] abstract class CreateActionBase(dataManager: IndexDataManager)
    // Extract partition keys, if original data is partitioned.
    val partitionSchemas = df.queryExecution.optimizedPlan.collect {
      case LogicalRelation(HadoopFsRelation(_, pSchema, _, _, _, _), _, _, _) => pSchema
+      case DataSourceV2Relation(_, _, _, _, uSchema) => uSchema.getOrElse(StructType(Nil))


Hmm, I don't think we should have DataSourceV2Relation specific code here. Can we move this to the source provider API?

Both LogicalRelation and DataSourceV2Relation are on the same level. Both directly extend LeafNode. If LogicalRelation is present here I would say that DataSourceV2Relation should also be here, as in this PR we open up to DataSourceV2 Spark API.

Sorry, I didn't get your argument on the same level. Why can't we introduce partitionSchema to the source provider? I think we missed moving this into source provider since default/delta have the same implementation; we can have the different implementation (matching FileIndex) for them.

Delta is not built on top of DataSourceV2 Spark API thus it's not the same implementation.

Both LogicalRelation and DataSourceV2Relation are first "child" from LeafNode, both directly extend LeafNode.

LeafNode // \\ // \\ // \\ LogicalRelation DataSourceV2Relation

This is the PR addressing support for DataSourceV2 which is Spark not Iceberg

My point is that rules shouldn't directly work with LogicalRelation or DataSourceV2Relation. I think we can abstract that out. Source provider can choose which relation it supports.

Btw, I will create a PR to your branch this week.

src/main/scala/com/microsoft/hyperspace/util/LogicalPlanUtils.scala

src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala

src/main/scala/com/microsoft/hyperspace/index/sources/interfaces.scala

src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeFileBasedSource.scala

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala

imback82 · 2021-01-19T01:31:56Z

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

      case baseRelation @ LogicalRelation(
            _ @HadoopFsRelation(location: FileIndex, _, _, _, _, _),
            baseOutput,
            _,
            _) =>
-        val (filesDeleted, filesAppended) =


@andrei-ionescu Could you also add some comments for "guides" in this file. I see that many lines are being moved / refactored without the logic being changed. It would help reviewers if you add something like "this part has moved to line blah, refactored to "foo", etc. Thanks.

imback82 · 2021-02-02T07:01:13Z

@andrei-ionescu we are planning to do the next release at the end of February. I marked the milestone for this PR accordingly.

andrei-ionescu · 2021-02-02T07:25:55Z

@imback82 Thanks! Is there anything else that I have to do on my side?

imback82 · 2021-02-02T07:36:08Z

Not at the moment. I will finish reviewing this PR this week. Thanks!

imback82

@andrei-ionescu I went thru this PR once more. I see lots of code that depends on LogicalRelation and DataSourceV2Relation in rules. I think we can solve this by introducing one level of abstraction (so that rules don't need to worry about handling specific relations). I will create a PR to your branch in coming days.

src/main/scala/com/microsoft/hyperspace/util/LogicalPlanUtils.scala

src/main/scala/com/microsoft/hyperspace/index/sources/interfaces.scala

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala

imback82 · 2021-02-07T23:55:41Z

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala

@@ -211,6 +213,7 @@ private[actions] abstract class CreateActionBase(dataManager: IndexDataManager)
    // Extract partition keys, if original data is partitioned.
    val partitionSchemas = df.queryExecution.optimizedPlan.collect {
      case LogicalRelation(HadoopFsRelation(_, pSchema, _, _, _, _), _, _, _) => pSchema
+      case DataSourceV2Relation(_, _, _, _, uSchema) => uSchema.getOrElse(StructType(Nil))


Sorry, I didn't get your argument on the same level. Why can't we introduce partitionSchema to the source provider? I think we missed moving this into source provider since default/delta have the same implementation; we can have the different implementation (matching FileIndex) for them.

imback82 · 2021-02-08T00:13:51Z

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

+        val relation = makeHadoopFsRelation(index, v2Relation)
+        val updatedOutput =
+          output.filter(attr => relation.schema.fieldNames.contains(attr.name))
+        new LogicalRelation(relation, updatedOutput, None, false)


Do we lose anything by going from DataSourceV2Relation to LogicalRelation?

The intention is to overwrite the plan with the relation that the index dataset uses. The index uses Parquet files which are not related to DataSourceV2 API in any way.

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

imback82 · 2021-02-10T03:23:13Z

@andrei-ionescu Sorry for being late to create the PR. I just created a draft PR to your branch: andrei-ionescu#1.

I haven't finished it, but you can see that duplicate logics to handle LogicalRelation and DatasourceV2Relation can be removed by SourceRelation abstraction; and some of the SourceProvider APIs like allFiles can be moved to SourceRelation, which makes more sense.

I tried to finish the PR on your branch, but I am introducing some changes to the existing code and need to revert some of your code in RuleUtils.scala, so I think it may be better to create a fresh PR to master repo.

Are you OK with it? I will make you a co-author since you inspired the refactoring. And once the PR is done, you can convert this PR to to implement SourceRelation for DatasourceV2Relation, which should be simple. WDYT?

andrei-ionescu · 2021-02-10T10:24:49Z

@imback82 I'm ok with it and thanks for your involvement. I would like to ask you to do it ASAP because I will need to update the implementation for supporting Iceberg too.

imback82 · 2021-02-10T15:34:07Z

Thanks, I'll have it by tomorrow (oof today)

…m multiple sources

The new `IndexHadoopFsRelation` wrapper was not used after rebasing Added wrapper functions to maintain the closure context

imback82 · 2021-02-11T20:13:33Z

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

+          case _ => false
+        }
+      case v2: DataSourceV2Relation =>
+        v2.options.exists(_.equals(IndexConstants.INDEX_RELATION_IDENTIFIER))


Question: where are you injecting this option to v2.options? Looks like this will always be false?

andrei-ionescu · 2021-02-15T16:25:55Z

Closing this PR because of the new work from @imback82 - PR #355. I created this new Iceberg format table related PR only: #358.

imback82 · 2021-02-26T17:17:30Z

@andrei-ionescu we are planning to do the next release at the end of February. I marked the milestone for this PR accordingly.

@andrei-ionescu would it be OK for you if we do the release at the end of March. It seems more realistic to do bi-monthly releases to pack more features. But if this is a blocker for you, we can do a minor release this week. Please let me know. Thanks!

andrei-ionescu changed the title ~~Datasourcev2~~ Support DataSourceV2 sources Jan 11, 2021

andrei-ionescu mentioned this pull request Jan 11, 2021

Support Iceberg table format #320

Closed

andrei-ionescu force-pushed the datasourcev2 branch from 581d02c to 2f5269b Compare January 11, 2021 22:40

sezruby reviewed Jan 12, 2021

View reviewed changes

sezruby assigned andrei-ionescu Jan 12, 2021

andrei-ionescu force-pushed the datasourcev2 branch 5 times, most recently from cb9295d to 5ce3bf8 Compare January 13, 2021 14:48

sezruby suggested changes Jan 14, 2021

View reviewed changes

andrei-ionescu force-pushed the datasourcev2 branch from 5ce3bf8 to 09cf7e0 Compare January 14, 2021 09:26

andrei-ionescu force-pushed the datasourcev2 branch from 09cf7e0 to 2638de3 Compare January 14, 2021 21:04

sezruby reviewed Jan 15, 2021

View reviewed changes

src/main/scala/com/microsoft/hyperspace/util/LogicalPlanUtils.scala Outdated Show resolved Hide resolved

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala Outdated Show resolved Hide resolved

andrei-ionescu force-pushed the datasourcev2 branch 4 times, most recently from e65c924 to c74b9f5 Compare January 15, 2021 20:59

sezruby requested a review from imback82 January 18, 2021 09:31

sezruby previously approved these changes Jan 19, 2021

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala Outdated Show resolved Hide resolved

imback82 reviewed Jan 19, 2021

View reviewed changes

andrei-ionescu dismissed sezruby’s stale review via 6530549 January 19, 2021 07:40

andrei-ionescu force-pushed the datasourcev2 branch 3 times, most recently from 3616107 to 37100fa Compare January 29, 2021 20:55

imback82 added this to the February 2021 (v0.5.0) milestone Feb 2, 2021

imback82 added the enhancement New feature or request label Feb 2, 2021

andrei-ionescu force-pushed the datasourcev2 branch from 37100fa to 469d40a Compare February 2, 2021 18:23

andrei-ionescu mentioned this pull request Feb 5, 2021

Add config to use bucketed scan for filter indexes #329

Merged

imback82 reviewed Feb 8, 2021

View reviewed changes

andrei-ionescu force-pushed the datasourcev2 branch from 469d40a to fb070fc Compare February 8, 2021 02:38

andrei-ionescu added 6 commits February 10, 2021 20:51

Use the LogicalPlan instead of LogicalRelation in sources interfaces

210ed52

Modify paritionBasePath signature to support extracting base path fro…

9b85968

…m multiple sources

Add support for DataSourceV2

228d574

Integrated review feedback (1, 2, 3 & 4)

9ddbf87

Integrated changes due to rebasing on master

9414cb8

The new `IndexHadoopFsRelation` wrapper was not used after rebasing Added wrapper functions to maintain the closure context

Integrated review feedback (5)

4dc8a94

andrei-ionescu force-pushed the datasourcev2 branch from fb070fc to 4dc8a94 Compare February 10, 2021 18:51

andrei-ionescu mentioned this pull request Feb 11, 2021

Introduce SourceRelation/FileBasedRelation traits to remove direct dependency on LogicalRelation from actions/rules #355

Merged

imback82 reviewed Feb 11, 2021

View reviewed changes

andrei-ionescu mentioned this pull request Feb 15, 2021

Support Iceberg table format #358

Merged

andrei-ionescu closed this Feb 15, 2021

andrei-ionescu deleted the datasourcev2 branch February 22, 2021 20:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support DataSourceV2 sources #321

Support DataSourceV2 sources #321

andrei-ionescu commented Jan 11, 2021 •

edited

Loading

sezruby left a comment

sezruby left a comment

andrei-ionescu commented Jan 14, 2021

sezruby commented Jan 14, 2021 •

edited

Loading

andrei-ionescu commented Jan 14, 2021 •

edited

Loading

sezruby commented Jan 14, 2021 •

edited

Loading

andrei-ionescu commented Jan 14, 2021 •

edited

Loading

sezruby commented Jan 14, 2021 •

edited

Loading

andrei-ionescu commented Jan 14, 2021 •

edited

Loading

sezruby left a comment

imback82 left a comment

imback82 Jan 19, 2021

andrei-ionescu Jan 19, 2021

imback82 Feb 7, 2021

andrei-ionescu Feb 8, 2021

imback82 Feb 8, 2021 •

edited

Loading

imback82 Feb 8, 2021

imback82 Jan 19, 2021

imback82 commented Feb 2, 2021

andrei-ionescu commented Feb 2, 2021

imback82 commented Feb 2, 2021

imback82 left a comment

imback82 Feb 7, 2021

imback82 Feb 8, 2021

andrei-ionescu Feb 8, 2021 •

edited

Loading

imback82 commented Feb 10, 2021

andrei-ionescu commented Feb 10, 2021

imback82 commented Feb 10, 2021

imback82 Feb 11, 2021 •

edited

Loading

andrei-ionescu commented Feb 15, 2021

imback82 commented Feb 26, 2021

Support DataSourceV2 sources #321

Support DataSourceV2 sources #321

Conversation

andrei-ionescu commented Jan 11, 2021 • edited Loading

What is the context for this pull request?

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

sezruby left a comment

Choose a reason for hiding this comment

sezruby left a comment

Choose a reason for hiding this comment

andrei-ionescu commented Jan 14, 2021

sezruby commented Jan 14, 2021 • edited Loading

andrei-ionescu commented Jan 14, 2021 • edited Loading

sezruby commented Jan 14, 2021 • edited Loading

andrei-ionescu commented Jan 14, 2021 • edited Loading

sezruby commented Jan 14, 2021 • edited Loading

andrei-ionescu commented Jan 14, 2021 • edited Loading

sezruby left a comment

Choose a reason for hiding this comment

imback82 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imback82 Feb 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imback82 commented Feb 2, 2021

andrei-ionescu commented Feb 2, 2021

imback82 commented Feb 2, 2021

imback82 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrei-ionescu Feb 8, 2021 • edited Loading

Choose a reason for hiding this comment

imback82 commented Feb 10, 2021

andrei-ionescu commented Feb 10, 2021

imback82 commented Feb 10, 2021

imback82 Feb 11, 2021 • edited Loading

Choose a reason for hiding this comment

andrei-ionescu commented Feb 15, 2021

imback82 commented Feb 26, 2021

andrei-ionescu commented Jan 11, 2021 •

edited

Loading

sezruby commented Jan 14, 2021 •

edited

Loading

andrei-ionescu commented Jan 14, 2021 •

edited

Loading

sezruby commented Jan 14, 2021 •

edited

Loading

andrei-ionescu commented Jan 14, 2021 •

edited

Loading

sezruby commented Jan 14, 2021 •

edited

Loading

andrei-ionescu commented Jan 14, 2021 •

edited

Loading

imback82 Feb 8, 2021 •

edited

Loading

andrei-ionescu Feb 8, 2021 •

edited

Loading

imback82 Feb 11, 2021 •

edited

Loading