-
Notifications
You must be signed in to change notification settings - Fork 115
Add signature computation support for arbitrary logical plans #77
Conversation
var signature = "" | ||
logicalPlan.foreachUp(o => signature += HashingUtils.md5Hex(o.nodeName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please add a final HashingUtils.md5Hex(signature) on the overall signature? (to keep it of same length as FileBasedSignatureProvider and IndexSignatureProvider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also p
instead of o
as in "plan"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imback82 o
is actually added for "Operator" because for each operator we add hash of its node name to the signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't think "operator" when I saw o
since technically it's logicalPlan.foreachUp(p: LogicalPlan => ...)
. But it's nit anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I will change it in the next push.
logicalPlan match { | ||
case r: LogicalRelation => Some(r) | ||
case p: LogicalPlan => | ||
if (p.children.size == 1) { | ||
getLogicalRelation(p.children.head) | ||
} else { | ||
None // plan is non-linear or it does not have any LogicalRelation | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can simplify to
val lrs = logicalPlan.collect {
case r: LogicalRelation => r
}
if (lrs.count == 1) Some(lrs.head) else None
or something like
logicalPlan.collectLeaves match {
case r : LogicalRelation :: Nil => r
case _ => None
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thnx, It is changed and this code is now in RuleUtils.scala
.
private def createSimplePlan( | ||
length: Long, | ||
modificationTime: Long, | ||
path: Path): LogicalRelation = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thnx.
RulesHelper.getLogicalRelation(project) match { | ||
case Some(r) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain the reason for this change? To me, this looks like hard-coding signature calculation logic to a signature provider which works only on LogicalRelation. I think here we should keep it relaxed to any "plan" node and not just logical relation. I also don't think it saves on calculation so some explanation could help. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not done for saving calculations, but it is needed for correctness.
Assume we have a query plan as:
QueryPlan (P): Scan(R: a, b, c) -> Filter(a>5) -> Project(b)
and we have two indexes defined as:
Covering Index 1 (IX1): R(a, b)
Index 2 (IX2): Scan(a, b, c) -> Filter(a>100) -> Project(b, c)
As the LogicalRelation node is the same for all 3 plans, FileBasedSignatureProvider
generates the same signature for all three of them.
IndexSignatureProvider
(introduced by this change) includes node names in signature calculation and as a result it generate the same signature for P and IX2 and a different signature for IX1.
In FilterIndex and JoinIndex rules, a candidate index has to meet below two conditions:
- Index has to be defined on the same LogicalRelation as the one query uses,
- Index has to cover all records (i.e. index plan should not have a Filter, Join, Agg or any other node that removes or modifies original relation's records).
As a result, when we look for indexes in FilterIndex and JoinIndex rules we need to: "Extract a single logical relation node from original query plan and compare the signature of that subplan with the signature of covering indexes".
This way the signature comparison will satisfy both conditions mentioned above for these rules and only returns covering indexes that include all records of original relation.
In the above example, this procedure of extracting logical relation from query plan and comparing its signature with index signatures (which are calculated and stored using IndexSignatureProvider) results in only marking IX1 as a candidate index.
Please let me know if above explanation makes sense; we can discuss it further if you prefer.
RulesHelper.getLogicalRelation(plan) match { | ||
case Some(r) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as in FilterIndexRule, copy pasting here.
Could you please explain the reason for this change? To me, this looks like hard-coding signature calculation logic to a signature provider which works only on LogicalRelation. I think here we should keep it relaxed to any "plan" node and not just logical relation. I also don't think it saves on calculation so some explanation could help. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an explanation to the previous comment in FilterIndex rule. Please check that and let me know if further discussion is required. Thnx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pirz, left some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few minor comments, but generally looking good to me. @pirz please let us know after you test against 0.1.0. thanks.
src/main/scala/com/microsoft/hyperspace/index/IndexSignatureProvider.scala
Show resolved
Hide resolved
fsRelation.fileFormat) | ||
} | ||
|
||
case None => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
// There is zero or more than one LogicalRelation nodes in Filter's subplan.
case None => Nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
.signature(plan) match { | ||
case Some(s) => | ||
signatureMap.put(sourcePlanSignature.provider, s) | ||
case None => return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should still put the value if the signature generation is already tried. You probably need to change the map to val signatureMap = mutable.Map[String, Option[String]]()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, changed it. Thnx
@imback82 I verified backward compatibility and updated PR description to reflect that with details. |
Thanks! |
@@ -47,7 +48,10 @@ object RuleUtils { | |||
.signature(plan) | |||
signatureMap.put(sourcePlanSignature.provider, signature) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use getOrElseUpdate
to remove double look up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, added it.
LogicalPlanSignatureProvider.create().name, | ||
s))))) | ||
val sourceFiles = df.queryExecution.optimizedPlan.collect { | ||
case LogicalRelation( | ||
HadoopFsRelation(location: PartitioningAwareFileIndex, _, _, _, _, _), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation looks off? is this from auto-format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
Map()) | ||
entry.state = state | ||
entry | ||
case None => throw HyperspaceException("Invalid plan for index dataFrame.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just do assert? I mean fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
private def createIndexSignature(logicalPlan: LogicalPlan): String = | ||
new IndexSignatureProvider().signature(logicalPlan) match { | ||
case Some(s) => s | ||
case None => throw HyperspaceException("Invalid plan for signature generation.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just fail
here? unless you are testing the scenario where exception is thrown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -130,7 +130,7 @@ class IndexSignatureProviderTest extends SparkFunSuite with SparkInvolvedSuite { | |||
private def createIndexSignature(logicalPlan: LogicalPlan): String = | |||
new IndexSignatureProvider().signature(logicalPlan) match { | |||
case Some(s) => s | |||
case None => throw HyperspaceException("Invalid plan for signature generation.") | |||
case None => fail // Invalid plan for signature generation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can do fail("Invalid plan for signature generation")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @pirz!
@apoorvedave1 Any other comments? |
*/ | ||
def signature(logicalPlan: LogicalPlan): Option[String] = { | ||
var signature = "" | ||
logicalPlan.foreachUp(p => signature += HashingUtils.md5Hex(p.nodeName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can avoid this calculation by just doing md5Hex in the last step in match-case statement. It would be faster and still correct. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @imback82
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an approach that we take in FileBasedSignatureProvider as well and it makes the final value non-sensitive to the order of strings (node names in this case or files in case of FileBasedSignatureProvider). This is important as for example for a plan with a join node the signature should be independent of order of visiting join children. Otherwise the signature computation would not be reproducible if the order of visiting left and right children swap.
Visiting plan as Join,LeftChild,RightChild should produce the same signature as visiting the plan as Join,RightChild,LeftChild. Calculating Hash value per node name and summing them up makes this achievable but concatenating node names and calling hash() on final String breaks this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imback82 @apoorvedave1 Can you please check above comment and merge this PR if you are fine with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Visiting plan as Join,LeftChild,RightChild should produce the same signature as visiting the plan as Join,RightChild,LeftChild. Calculating Hash value per node name and summing them up makes this achievable but concatenating node names and calling hash() on final String breaks this.
@pirz are you sure about this? hash(a) + hash(b)
is the same as hash(b) + hash(a)
when hash
returns a string?
@apoorvedave1 I am not too sure about the performance improvement here (<100 nodes) and would depend on how we append the strings. I would say it would be negligible. Do you have some numbers to back up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, can we do this instead?
var signature = ""
logicalPlan.foreachUp(p => signature = HashingUtils.md5Hex(signature + p.nodeName))
signature match {
case "" => None
case _ => signature
}
We will always bound the string size and also consistent with file based signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this PR is getting long, I will merge this now. @pirz Can you do a follow up PR to address @apoorvedave1's concern? (if we are going with one md5hex, let's use StringBuffer
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I updated it. @imback82 and @apoorvedave1 Can you plz check/approve again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imback82 Sure, Thanks. I guess we first need some numbers to see if calling Hash multiple times on the plans (which are normally <100 nodes as you said) is a perf bottleneck or not. If that turned out to be an issue, we will switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one minor comment (which could slightly improve performance) but otherwise LGTM. (Also it would be breaking change if we do it in future.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks @pirz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @pirz!
What changes were proposed in this pull request?
Currently, Hyperspace has one signature provider:
FileBasedSignatureProvider
, which only covers LogicalRelation nodes in a plan to compute the plan signature.This limits the DataFrames that could potentially be indexed to only those whose plans consist of only logical relation nodes.
Moving forward, Hyperspace needs to support generating signature for DataFrames with arbitrary logical plans.
Below changes address issue #76
1- Adds new signature provider which computes a signature for arbitrary logical plans.
2- Changes existing Hyperspace rules to comply with changes in the signature provider.
3- Changes the API for signature() in Signature provider to return an Option (instead of a default value) to avoid potential issues due to using a default signature.
4- It adds test cases for the new signature provider and refactors existing test code.
Below change addresses Issue #31
5- It refactors the test code in JoinIndex and FilterIndex rules tests.
Why are the changes needed?
Supporting signature computation for arbitrary logical plans.
Refactoring and cleaning up test code.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Test cases are added under new files: IndexSignatureProviderTest and RulesHelperTest.
Backward combability tested manually. A sample application was used to generate indexes with current master (0.1.0 code without this change) and new application code using "current master + this change" used to read and leverage those indexes. The output was verified manually to confirm the indexes created with an older version of bits are still usable with new bits including this change.