Skip to content
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.

Add signature computation support for arbitrary logical plans #77

Merged
merged 36 commits into from
Jul 23, 2020
Merged

Add signature computation support for arbitrary logical plans #77

merged 36 commits into from
Jul 23, 2020

Conversation

pirz
Copy link
Contributor

@pirz pirz commented Jun 30, 2020

What changes were proposed in this pull request?

Currently, Hyperspace has one signature provider: FileBasedSignatureProvider, which only covers LogicalRelation nodes in a plan to compute the plan signature.
This limits the DataFrames that could potentially be indexed to only those whose plans consist of only logical relation nodes.
Moving forward, Hyperspace needs to support generating signature for DataFrames with arbitrary logical plans.

Below changes address issue #76
1- Adds new signature provider which computes a signature for arbitrary logical plans.
2- Changes existing Hyperspace rules to comply with changes in the signature provider.
3- Changes the API for signature() in Signature provider to return an Option (instead of a default value) to avoid potential issues due to using a default signature.
4- It adds test cases for the new signature provider and refactors existing test code.

Below change addresses Issue #31
5- It refactors the test code in JoinIndex and FilterIndex rules tests.

Why are the changes needed?

Supporting signature computation for arbitrary logical plans.

Refactoring and cleaning up test code.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Test cases are added under new files: IndexSignatureProviderTest and RulesHelperTest.
Backward combability tested manually. A sample application was used to generate indexes with current master (0.1.0 code without this change) and new application code using "current master + this change" used to read and leverage those indexes. The output was verified manually to confirm the indexes created with an older version of bits are still usable with new bits including this change.

@pirz pirz self-assigned this Jun 30, 2020
@pirz pirz added the enhancement New feature or request label Jun 30, 2020
Comment on lines 37 to 38
var signature = ""
logicalPlan.foreachUp(o => signature += HashingUtils.md5Hex(o.nodeName))
Copy link
Contributor

@apoorvedave1 apoorvedave1 Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please add a final HashingUtils.md5Hex(signature) on the overall signature? (to keep it of same length as FileBasedSignatureProvider and IndexSignatureProvider

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also p instead of o as in "plan"?

Copy link
Contributor Author

@pirz pirz Jul 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imback82 o is actually added for "Operator" because for each operator we add hash of its node name to the signature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think "operator" when I saw o since technically it's logicalPlan.foreachUp(p: LogicalPlan => ...). But it's nit anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I will change it in the next push.

Comment on lines 32 to 40
logicalPlan match {
case r: LogicalRelation => Some(r)
case p: LogicalPlan =>
if (p.children.size == 1) {
getLogicalRelation(p.children.head)
} else {
None // plan is non-linear or it does not have any LogicalRelation
}
}
Copy link
Contributor

@apoorvedave1 apoorvedave1 Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can simplify to

val lrs = logicalPlan.collect {
    case r: LogicalRelation => r
}
if (lrs.count == 1) Some(lrs.head) else None

or something like

logicalPlan.collectLeaves match {
   case r : LogicalRelation :: Nil => r
   case _ => None
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thnx, It is changed and this code is now in RuleUtils.scala.

Comment on lines 64 to 67
private def createSimplePlan(
length: Long,
modificationTime: Long,
path: Path): LogicalRelation = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thnx.

Comment on lines 152 to 153
RulesHelper.getLogicalRelation(project) match {
case Some(r) =>
Copy link
Contributor

@apoorvedave1 apoorvedave1 Jul 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain the reason for this change? To me, this looks like hard-coding signature calculation logic to a signature provider which works only on LogicalRelation. I think here we should keep it relaxed to any "plan" node and not just logical relation. I also don't think it saves on calculation so some explanation could help. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not done for saving calculations, but it is needed for correctness.
Assume we have a query plan as:
QueryPlan (P): Scan(R: a, b, c) -> Filter(a>5) -> Project(b)
and we have two indexes defined as:
Covering Index 1 (IX1): R(a, b)
Index 2 (IX2): Scan(a, b, c) -> Filter(a>100) -> Project(b, c)
As the LogicalRelation node is the same for all 3 plans, FileBasedSignatureProvider generates the same signature for all three of them.
IndexSignatureProvider (introduced by this change) includes node names in signature calculation and as a result it generate the same signature for P and IX2 and a different signature for IX1.

In FilterIndex and JoinIndex rules, a candidate index has to meet below two conditions:

  • Index has to be defined on the same LogicalRelation as the one query uses,
  • Index has to cover all records (i.e. index plan should not have a Filter, Join, Agg or any other node that removes or modifies original relation's records).

As a result, when we look for indexes in FilterIndex and JoinIndex rules we need to: "Extract a single logical relation node from original query plan and compare the signature of that subplan with the signature of covering indexes".
This way the signature comparison will satisfy both conditions mentioned above for these rules and only returns covering indexes that include all records of original relation.
In the above example, this procedure of extracting logical relation from query plan and comparing its signature with index signatures (which are calculated and stored using IndexSignatureProvider) results in only marking IX1 as a candidate index.

Please let me know if above explanation makes sense; we can discuss it further if you prefer.

Comment on lines 332 to 333
RulesHelper.getLogicalRelation(plan) match {
case Some(r) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as in FilterIndexRule, copy pasting here.

Could you please explain the reason for this change? To me, this looks like hard-coding signature calculation logic to a signature provider which works only on LogicalRelation. I think here we should keep it relaxed to any "plan" node and not just logical relation. I also don't think it saves on calculation so some explanation could help. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an explanation to the previous comment in FilterIndex rule. Please check that and let me know if further discussion is required. Thnx

Copy link
Contributor

@apoorvedave1 apoorvedave1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pirz, left some comments.

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few minor comments, but generally looking good to me. @pirz please let us know after you test against 0.1.0. thanks.

fsRelation.fileFormat)
}

case None =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

// There is zero or more than one LogicalRelation nodes in Filter's subplan.
case None => Nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

.signature(plan) match {
case Some(s) =>
signatureMap.put(sourcePlanSignature.provider, s)
case None => return false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still put the value if the signature generation is already tried. You probably need to change the map to val signatureMap = mutable.Map[String, Option[String]]().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, changed it. Thnx

@pirz
Copy link
Contributor Author

pirz commented Jul 23, 2020

@imback82 I verified backward compatibility and updated PR description to reflect that with details.

@imback82
Copy link
Contributor

@imback82 I verified backward compatibility and updated PR description to reflect that with details.

Thanks!

@@ -47,7 +48,10 @@ object RuleUtils {
.signature(plan)
signatureMap.put(sourcePlanSignature.provider, signature)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use getOrElseUpdate to remove double look up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, added it.

LogicalPlanSignatureProvider.create().name,
s)))))
val sourceFiles = df.queryExecution.optimizedPlan.collect {
case LogicalRelation(
HadoopFsRelation(location: PartitioningAwareFileIndex, _, _, _, _, _),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation looks off? is this from auto-format?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Map())
entry.state = state
entry
case None => throw HyperspaceException("Invalid plan for index dataFrame.")
Copy link
Contributor

@imback82 imback82 Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just do assert? I mean fail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

private def createIndexSignature(logicalPlan: LogicalPlan): String =
new IndexSignatureProvider().signature(logicalPlan) match {
case Some(s) => s
case None => throw HyperspaceException("Invalid plan for signature generation.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just fail here? unless you are testing the scenario where exception is thrown

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -130,7 +130,7 @@ class IndexSignatureProviderTest extends SparkFunSuite with SparkInvolvedSuite {
private def createIndexSignature(logicalPlan: LogicalPlan): String =
new IndexSignatureProvider().signature(logicalPlan) match {
case Some(s) => s
case None => throw HyperspaceException("Invalid plan for signature generation.")
case None => fail // Invalid plan for signature generation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do fail("Invalid plan for signature generation")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

imback82
imback82 previously approved these changes Jul 23, 2020
Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @pirz!

@imback82
Copy link
Contributor

@apoorvedave1 Any other comments?

*/
def signature(logicalPlan: LogicalPlan): Option[String] = {
var signature = ""
logicalPlan.foreachUp(p => signature += HashingUtils.md5Hex(p.nodeName))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can avoid this calculation by just doing md5Hex in the last step in match-case statement. It would be faster and still correct. Right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@pirz pirz Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an approach that we take in FileBasedSignatureProvider as well and it makes the final value non-sensitive to the order of strings (node names in this case or files in case of FileBasedSignatureProvider). This is important as for example for a plan with a join node the signature should be independent of order of visiting join children. Otherwise the signature computation would not be reproducible if the order of visiting left and right children swap.
Visiting plan as Join,LeftChild,RightChild should produce the same signature as visiting the plan as Join,RightChild,LeftChild. Calculating Hash value per node name and summing them up makes this achievable but concatenating node names and calling hash() on final String breaks this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imback82 @apoorvedave1 Can you please check above comment and merge this PR if you are fine with it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Visiting plan as Join,LeftChild,RightChild should produce the same signature as visiting the plan as Join,RightChild,LeftChild. Calculating Hash value per node name and summing them up makes this achievable but concatenating node names and calling hash() on final String breaks this.

@pirz are you sure about this? hash(a) + hash(b) is the same as hash(b) + hash(a) when hash returns a string?

@apoorvedave1 I am not too sure about the performance improvement here (<100 nodes) and would depend on how we append the strings. I would say it would be negligible. Do you have some numbers to back up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, can we do this instead?

var signature = ""
logicalPlan.foreachUp(p => signature = HashingUtils.md5Hex(signature + p.nodeName))
signature match {
  case "" => None
  case _ => signature
}

We will always bound the string size and also consistent with file based signature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR is getting long, I will merge this now. @pirz Can you do a follow up PR to address @apoorvedave1's concern? (if we are going with one md5hex, let's use StringBuffer).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I updated it. @imback82 and @apoorvedave1 Can you plz check/approve again?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imback82 Sure, Thanks. I guess we first need some numbers to see if calling Hash multiple times on the plans (which are normally <100 nodes as you said) is a perf bottleneck or not. If that turned out to be an issue, we will switch.

Copy link
Contributor

@apoorvedave1 apoorvedave1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor comment (which could slightly improve performance) but otherwise LGTM. (Also it would be breaking change if we do it in future.)

apoorvedave1
apoorvedave1 previously approved these changes Jul 23, 2020
@pirz pirz dismissed stale reviews from apoorvedave1 and imback82 via a188b52 July 23, 2020 18:54
Copy link
Contributor

@apoorvedave1 apoorvedave1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks @pirz

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @pirz!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add signature computation support for arbitrary logical plans
4 participants