-
Notifications
You must be signed in to change notification settings - Fork 115
Conversation
581d02c
to
2f5269b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, as there's no test, could you open the next PR including IceBergIntegrationTest
based on this PR? so that we could check the code works as expected.
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/util/LogicalPlanUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
cb9295d
to
5ce3bf8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good to me!
Could you add a simple test case for "delete" hybrid scan in IcebergIntegrationTest
?
Since Hybrid Scan test refactoring is on the way (#274), it's difficult to add Iceberg Hybrid Scan tests until the refactoring change is merged.
@imback82 Could you have a look at this? Thanks!
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/util/LogicalPlanUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/util/LogicalPlanUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala
Outdated
Show resolved
Hide resolved
@andrei-ionescu Yep, I checked it; the request is adding a test case for Hybrid Scan with deleted files. You can test it by:
Please use |
5ce3bf8
to
09cf7e0
Compare
@sezruby I integrated this second round of feedback and rebased the PR on top of today's master. In regards to the "Hybrid Scan with deleted files" I have some questions:
There is a complexity added to it: I'll add the test on the |
@andrei-ionescu 2 => Yes, we need to check the exact plan transformation of Hybrid Scan + Datasource v2, but let's just check if the plan is transformed or not for now. Few things I'd like to check:
And could you share the result of
Yep I'll check the Thanks for the work 👍 👍 |
@sezruby The Asking for a test for If I try to add a test for it first I need to implement the source based on the
|
Sorry for the confusion, I meant Hybrid Scan + Iceberg. Please add the tests in #320 :) We need to make sure that this change works as expected before merging it. |
09cf7e0
to
2638de3
Compare
@sezruby I added the requested tests in https://github.com/microsoft/hyperspace/pull/320/files#diff-ce1f32f296e1683385beb0fe1954b154710c0ba0120f028167afbe5953347dd3 similar to the BTW, it did show up a place where I missed adding the pattern matching on |
src/main/scala/com/microsoft/hyperspace/util/LogicalPlanUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Outdated
Show resolved
Hide resolved
e65c924
to
c74b9f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imback82 Could you review this change?
Tests are in #320 https://github.com/microsoft/hyperspace/pull/320/files#diff-ce1f32f296e1683385beb0fe1954b154710c0ba0120f028167afbe5953347dd3R61
Thanks!
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick review. I will do a more thorough review this week. Since this is touching the core parts, I want to make sure this is reviewed thoroughly.
@apoorvedave1 / @pirz Can you please take a look?
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
@@ -211,6 +213,7 @@ private[actions] abstract class CreateActionBase(dataManager: IndexDataManager) | |||
// Extract partition keys, if original data is partitioned. | |||
val partitionSchemas = df.queryExecution.optimizedPlan.collect { | |||
case LogicalRelation(HadoopFsRelation(_, pSchema, _, _, _, _), _, _, _) => pSchema | |||
case DataSourceV2Relation(_, _, _, _, uSchema) => uSchema.getOrElse(StructType(Nil)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't think we should have DataSourceV2Relation
specific code here. Can we move this to the source provider API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both LogicalRelation
and DataSourceV2Relation
are on the same level. Both directly extend LeafNode
. If LogicalRelation
is present here I would say that DataSourceV2Relation
should also be here, as in this PR we open up to DataSourceV2
Spark API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't get your argument on the same level. Why can't we introduce partitionSchema
to the source provider? I think we missed moving this into source provider since default/delta have the same implementation; we can have the different implementation (matching FileIndex) for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Delta is not built on top of DataSourceV2 Spark API thus it's not the same implementation.
- Both
LogicalRelation
andDataSourceV2Relation
are first "child" fromLeafNode
, both directly extendLeafNode
.
LeafNode
// \\
// \\
// \\
LogicalRelation DataSourceV2Relation
- This is the PR addressing support for DataSourceV2 which is Spark not Iceberg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that rules shouldn't directly work with LogicalRelation
or DataSourceV2Relation
. I think we can abstract that out. Source provider can choose which relation it supports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I will create a PR to your branch this week.
src/main/scala/com/microsoft/hyperspace/util/LogicalPlanUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeFileBasedSource.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/delta/DeltaLakeFileBasedSource.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
Outdated
Show resolved
Hide resolved
case baseRelation @ LogicalRelation( | ||
_ @HadoopFsRelation(location: FileIndex, _, _, _, _, _), | ||
baseOutput, | ||
_, | ||
_) => | ||
val (filesDeleted, filesAppended) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrei-ionescu Could you also add some comments for "guides" in this file. I see that many lines are being moved / refactored without the logic being changed. It would help reviewers if you add something like "this part has moved to line blah, refactored to "foo", etc. Thanks.
3616107
to
37100fa
Compare
@andrei-ionescu we are planning to do the next release at the end of February. I marked the milestone for this PR accordingly. |
@imback82 Thanks! Is there anything else that I have to do on my side? |
Not at the moment. I will finish reviewing this PR this week. Thanks! |
37100fa
to
469d40a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrei-ionescu I went thru this PR once more. I see lots of code that depends on LogicalRelation
and DataSourceV2Relation
in rules. I think we can solve this by introducing one level of abstraction (so that rules don't need to worry about handling specific relations). I will create a PR to your branch in coming days.
src/main/scala/com/microsoft/hyperspace/util/LogicalPlanUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/util/LogicalPlanUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/sources/interfaces.scala
Outdated
Show resolved
Hide resolved
@@ -211,6 +213,7 @@ private[actions] abstract class CreateActionBase(dataManager: IndexDataManager) | |||
// Extract partition keys, if original data is partitioned. | |||
val partitionSchemas = df.queryExecution.optimizedPlan.collect { | |||
case LogicalRelation(HadoopFsRelation(_, pSchema, _, _, _, _), _, _, _) => pSchema | |||
case DataSourceV2Relation(_, _, _, _, uSchema) => uSchema.getOrElse(StructType(Nil)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't get your argument on the same level. Why can't we introduce partitionSchema
to the source provider? I think we missed moving this into source provider since default/delta have the same implementation; we can have the different implementation (matching FileIndex) for them.
val relation = makeHadoopFsRelation(index, v2Relation) | ||
val updatedOutput = | ||
output.filter(attr => relation.schema.fieldNames.contains(attr.name)) | ||
new LogicalRelation(relation, updatedOutput, None, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we lose anything by going from DataSourceV2Relation
to LogicalRelation
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention is to overwrite the plan with the relation that the index dataset uses. The index uses Parquet files which are not related to DataSourceV2 API in any way.
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
469d40a
to
fb070fc
Compare
@andrei-ionescu Sorry for being late to create the PR. I just created a draft PR to your branch: andrei-ionescu#1. I haven't finished it, but you can see that duplicate logics to handle I tried to finish the PR on your branch, but I am introducing some changes to the existing code and need to revert some of your code in Are you OK with it? I will make you a co-author since you inspired the refactoring. And once the PR is done, you can convert this PR to to implement |
@imback82 I'm ok with it and thanks for your involvement. I would like to ask you to do it ASAP because I will need to update the implementation for supporting Iceberg too. |
Thanks, I'll have it by tomorrow (oof today) |
…m multiple sources
The new `IndexHadoopFsRelation` wrapper was not used after rebasing Added wrapper functions to maintain the closure context
fb070fc
to
4dc8a94
Compare
case _ => false | ||
} | ||
case v2: DataSourceV2Relation => | ||
v2.options.exists(_.equals(IndexConstants.INDEX_RELATION_IDENTIFIER)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: where are you injecting this option to v2.options
? Looks like this will always be false?
@andrei-ionescu would it be OK for you if we do the release at the end of March. It seems more realistic to do bi-monthly releases to pack more features. But if this is a blocker for you, we can do a minor release this week. Please let me know. Thanks! |
What is the context for this pull request?
What changes were proposed in this pull request?
This PR adds support for DataSourceV2.
The following changes are in this PR and each of them are separate commits:
LogicaPlan
instead ofLogicalRelation
. This gives us the possibility to add support for other kinds of relations like Spark datas source version 2.DataSourceV2
.Does this PR introduce any user-facing change?
The source interfaces has changed. Now instead of using
LogicalRelation
as parameter type it now usesLogicalPlan
. There has been added support forDataSourceV2
sources.Detailed information can be found in the #318 proposal.
How was this patch tested?