-
Notifications
You must be signed in to change notification settings - Fork 115
Conversation
assert(FilterIndexRanker.rank(indexes, false).get.equals(ind1)) | ||
} | ||
|
||
test("testRank: return the index with the largest source file list if HybridScan is enabled") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Why testRank
instead of FilterIndexRankerTest
? Sorry for the naive question - I'm trying to understand how we should name these tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just used the previous one in JoinRankerTest. Yes we need some guideline for the naming 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imback82 Any recommendations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, testRank
seems weird. I would do something like test("rank() should return the index with the largest number of source files if HybridScan is enabled)
. (Note that FilterIndexRanker
is omitted since it's already under FilterIndexRankerTest
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... good you wrote a great name 🙂
I was also hoping: can we agree on a naming convention for tests now that the code base complexity is increasing? For instance:
- Tests do not need to contain the test method name
- Name of the test should clearly state the hypothesis and what call out what is being tested
- ...
src/test/scala/com/microsoft/hyperspace/index/rankers/RankerTestHelper.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/rankers/JoinIndexRankerTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/rankers/JoinIndexRankerTest.scala
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/rankers/FilterIndexRankerTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/rankers/FilterIndexRankerTest.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rankers/JoinIndexRanker.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rankers/FilterIndexRanker.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rankers/FilterIndexRanker.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rankers/JoinIndexRanker.scala
Outdated
Show resolved
Hide resolved
Changed as draft since further investigation is required for this issue; we cannot guarantee that not using hybrid scan is always better than full indexes, especially for join index. For example, it's might be more efficient to avoid full-shuffling of index data. So an index pair with bucketing and hybrid scan might be more efficient than a full index pair without bucketing (no hybrid scan). |
9317a45
to
0d6f521
Compare
6c81e48
to
36ccfda
Compare
* A mutable map for holding auxiliary information of this index log entry while applying rules. | ||
*/ | ||
@JsonIgnore | ||
private val tags: mutable.Map[IndexLogEntryTag[_], Any] = mutable.Map.empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// If the index contains the source update info, it means the index was validated | ||
// with the latest signature including appended files and deleted files, but | ||
// index data is not updated with those files. Therefore, we need to handle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised comment for quick refresh PR
5fe1139
to
4d6ad6a
Compare
@imback82 Could you review this change when you have the time? Thanks! |
Sorry for the delay. I will try to get to this this week. Btw, does this change any of the index selections for our TPCH/TPCDS benchmarks? |
This change only affects Hybrid Scan case and currently all of the common bytes will be the same as we don't refresh any of indexes after modifying the dataset. I didn't check it but I guess any of index selection won't be changed. For explain time, 100k source files, 100 deleted file case (TPC-H queries): Oct binary
Dec binary (current master + rank pr)
For hybrid scan result,
|
This reverts commit 54f6171
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntryTags.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rankers/JoinIndexRanker.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rankers/JoinIndexRanker.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rankers/JoinIndexRanker.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rankers/JoinIndexRanker.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/rankers/FilterIndexRankerTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/rankers/FilterIndexRankerTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/rankers/JoinIndexRankerTest.scala
Outdated
Show resolved
Hide resolved
val actualOrder = JoinIndexRanker.rank(spark, dummy, dummy, indexPairs) | ||
assert(actualOrder.equals(expectedOrder)) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we testing all the if
conditions in the ranker? How about number of default partitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition is removed.
src/test/scala/com/microsoft/hyperspace/index/rules/HyperspaceRuleTestSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (few minor comments), thanks @sezruby!
What changes were proposed in this pull request?
This PR introduces new rank algorithm for each FilterIndex & JoinIndex to support Hybrid Scan properly.
You can find more details about Hybrid scan in #150.
In order to support rank algorithm for Hybrid Scan efficiently, this PR introduces a new IndexLogEntryTag using #223
If hybrid scan is enabled, the outdated index might be chosen even if there are refresher ones.
Since Hybrid Scan can incur additional overhead such as on-the-fly shuffle for appended files or filtering for deleted files, it would be good to pick the indexes with less "diff" data.
For FilterIndexRule, the index with the largest common bytes will be chosen if Hybrid Scan is enabled.
For JoinIndexRule, the bucket number would be prior to common bytes as removing shuffles will outweigh the overhead from Hybrid Scan.
Does this PR introduce any user-facing change?
Yes.
This PR changes the rank order of candidate indexes when hybrid scan is enabled.
How was this patch tested?
Unit test