Fix rank algorithm for Hybrid Scan #164

sezruby · 2020-09-15T05:28:59Z

What changes were proposed in this pull request?

This PR introduces new rank algorithm for each FilterIndex & JoinIndex to support Hybrid Scan properly.
You can find more details about Hybrid scan in #150.

In order to support rank algorithm for Hybrid Scan efficiently, this PR introduces a new IndexLogEntryTag using #223

  // COMMON_BYTES stores overlapping bytes of index source files and given relation.
  // This is set in getCandidateIndexes and utilized in rank functions.
  val COMMON_BYTES: IndexLogEntryTag[Long] = IndexLogEntryTag[Long]("commonBytes")

If hybrid scan is enabled, the outdated index might be chosen even if there are refresher ones.
Since Hybrid Scan can incur additional overhead such as on-the-fly shuffle for appended files or filtering for deleted files, it would be good to pick the indexes with less "diff" data.

For FilterIndexRule, the index with the largest common bytes will be chosen if Hybrid Scan is enabled.
For JoinIndexRule, the bucket number would be prior to common bytes as removing shuffles will outweigh the overhead from Hybrid Scan.

Does this PR introduce any user-facing change?

Yes.
This PR changes the rank order of candidate indexes when hybrid scan is enabled.

How was this patch tested?

Unit test

rapoth · 2020-09-16T01:37:18Z

src/test/scala/com/microsoft/hyperspace/index/rankers/FilterIndexRankerTest.scala

+    assert(FilterIndexRanker.rank(indexes, false).get.equals(ind1))
+  }
+
+  test("testRank: return the index with the largest source file list if HybridScan is enabled") {


nit: Why testRank instead of FilterIndexRankerTest? Sorry for the naive question - I'm trying to understand how we should name these tests.

I just used the previous one in JoinRankerTest. Yes we need some guideline for the naming 🤔

@imback82 Any recommendations?

Yea, testRank seems weird. I would do something like test("rank() should return the index with the largest number of source files if HybridScan is enabled). (Note that FilterIndexRanker is omitted since it's already under FilterIndexRankerTest).

Oh... good you wrote a great name 🙂

I was also hoping: can we agree on a naming convention for tests now that the code base complexity is increasing? For instance:

Tests do not need to contain the test method name

Name of the test should clearly state the hypothesis and what call out what is being tested

...

src/test/scala/com/microsoft/hyperspace/index/rankers/RankerTestHelper.scala

src/test/scala/com/microsoft/hyperspace/index/rankers/JoinIndexRankerTest.scala

src/test/scala/com/microsoft/hyperspace/index/rankers/FilterIndexRankerTest.scala

src/main/scala/com/microsoft/hyperspace/index/rankers/JoinIndexRanker.scala

src/main/scala/com/microsoft/hyperspace/index/rankers/FilterIndexRanker.scala

src/main/scala/com/microsoft/hyperspace/index/rankers/JoinIndexRanker.scala

sezruby · 2020-10-06T08:35:30Z

Changed as draft since further investigation is required for this issue; we cannot guarantee that not using hybrid scan is always better than full indexes, especially for join index.

For example, it's might be more efficient to avoid full-shuffling of index data. So an index pair with bucketing and hybrid scan might be more efficient than a full index pair without bucketing (no hybrid scan).

sezruby · 2020-10-20T10:46:48Z

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala

+   * A mutable map for holding auxiliary information of this index log entry while applying rules.
+   */
+  @JsonIgnore
+  private val tags: mutable.Map[IndexLogEntryTag[_], Any] = mutable.Map.empty


tag implementation from Spark 3.0 - https://github.com/apache/spark/blob/eb9966b70055a67dd02451c78ec205d913a38a42/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L93

sezruby · 2020-12-04T06:22:05Z

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

+            // If the index contains the source update info, it means the index was validated
+            // with the latest signature including appended files and deleted files, but
+            // index data is not updated with those files. Therefore, we need to handle


Revised comment for quick refresh PR

sezruby · 2020-12-10T02:35:59Z

@imback82 Could you review this change when you have the time? Thanks!

imback82 · 2020-12-10T04:45:08Z

Sorry for the delay. I will try to get to this this week. Btw, does this change any of the index selections for our TPCH/TPCDS benchmarks?

sezruby · 2020-12-11T07:18:08Z

Sorry for the delay. I will try to get to this this week. Btw, does this change any of the index selections for our TPCH/TPCDS benchmarks?

This change only affects Hybrid Scan case and currently all of the common bytes will be the same as we don't refresh any of indexes after modifying the dataset. I didn't check it but I guess any of index selection won't be changed.

For explain time, 100k source files, 100 deleted file case (TPC-H queries):

Oct binary

hybrid scan - sum of max explain time (of 3 runs): 310625
full index - sum of max explain time (of 3): 235606

Dec binary (current master + rank pr)

hybrid scan - sum of max explain time (of 3): 233333
full index - sum of max explain time (of 3): 333228 (=> I think it's because of signature calculation fix (sorting))

For hybrid scan result,

other optimizations ( column schema check first .. etc)
shows this pr doesn't have too much overhead

This reverts commit 54f6171

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntryTags.scala

src/main/scala/com/microsoft/hyperspace/index/rankers/JoinIndexRanker.scala

src/test/scala/com/microsoft/hyperspace/index/rankers/FilterIndexRankerTest.scala

src/test/scala/com/microsoft/hyperspace/index/rankers/JoinIndexRankerTest.scala

imback82 · 2021-01-04T21:58:56Z

src/test/scala/com/microsoft/hyperspace/index/rankers/JoinIndexRankerTest.scala

+      val actualOrder = JoinIndexRanker.rank(spark, dummy, dummy, indexPairs)
+      assert(actualOrder.equals(expectedOrder))
+    }
+  }


Are we testing all the if conditions in the ranker? How about number of default partitions?

The condition is removed.

src/test/scala/com/microsoft/hyperspace/index/rules/HyperspaceRuleTestSuite.scala

imback82

LGTM (few minor comments), thanks @sezruby!

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntryTags.scala

src/main/scala/com/microsoft/hyperspace/index/rankers/JoinIndexRanker.scala

src/test/scala/com/microsoft/hyperspace/index/rankers/JoinIndexRankerTest.scala

src/main/scala/com/microsoft/hyperspace/index/rankers/JoinIndexRanker.scala

sezruby mentioned this pull request Sep 15, 2020

Hybrid Scan for File/Partition Mutable Datasets #150

Closed

7 tasks

rapoth requested a review from apoorvedave1 September 16, 2020 01:20

rapoth reviewed Sep 16, 2020

View reviewed changes

src/test/scala/com/microsoft/hyperspace/index/rankers/RankerTestHelper.scala Outdated Show resolved Hide resolved

rapoth reviewed Sep 16, 2020

View reviewed changes

src/test/scala/com/microsoft/hyperspace/index/rankers/JoinIndexRankerTest.scala Outdated Show resolved Hide resolved

rapoth reviewed Sep 16, 2020

View reviewed changes

src/test/scala/com/microsoft/hyperspace/index/rankers/JoinIndexRankerTest.scala Show resolved Hide resolved

rapoth reviewed Sep 16, 2020

View reviewed changes

src/test/scala/com/microsoft/hyperspace/index/rankers/FilterIndexRankerTest.scala Outdated Show resolved Hide resolved

rapoth reviewed Sep 16, 2020

View reviewed changes

src/test/scala/com/microsoft/hyperspace/index/rankers/FilterIndexRankerTest.scala Outdated Show resolved Hide resolved

imback82 reviewed Sep 17, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/rankers/JoinIndexRanker.scala Outdated Show resolved Hide resolved

src/main/scala/com/microsoft/hyperspace/index/rankers/FilterIndexRanker.scala Outdated Show resolved Hide resolved

rapoth added this to the 0.4.0 milestone Sep 17, 2020

rapoth assigned sezruby Sep 18, 2020

rapoth added enhancement New feature or request intermediate issue This is the tag for intermediate issues which involve discussion labels Sep 18, 2020

pirz reviewed Sep 25, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/rankers/FilterIndexRanker.scala Outdated Show resolved Hide resolved

src/main/scala/com/microsoft/hyperspace/index/rankers/JoinIndexRanker.scala Outdated Show resolved Hide resolved

sezruby mentioned this pull request Sep 30, 2020

Update index log entry for enforce delete during read time #170

Closed

sezruby marked this pull request as draft September 30, 2020 03:09

rapoth modified the milestones: 0.4.0, 0.5.0 Oct 8, 2020

imback82 modified the milestones: October 2020, November 2020 Oct 13, 2020

sezruby force-pushed the hybridscan_rank branch from 9317a45 to 0d6f521 Compare October 20, 2020 10:23

sezruby marked this pull request as ready for review October 20, 2020 10:31

sezruby force-pushed the hybridscan_rank branch 3 times, most recently from 6c81e48 to 36ccfda Compare October 20, 2020 10:46

sezruby commented Oct 20, 2020

View reviewed changes

sezruby requested review from pirz, rapoth and imback82 October 20, 2020 12:18

Merge remote-tracking branch 'upstream/master' into hybridscan_rank

ecab790

imback82 mentioned this pull request Dec 3, 2020

Support refresh quick mode by using Hybrid Scan #238

Merged

sezruby added 2 commits December 4, 2020 14:26

Merge remote-tracking branch 'upstream/master' into hybridscan_rank

b0003ab

Minor fix

e456ffe

sezruby commented Dec 4, 2020

View reviewed changes

sezruby added 2 commits December 9, 2020 11:45

Merge remote-tracking branch 'upstream/master' into hybridscan_rank

d375d97

Merge remote-tracking branch 'upstream/master' into hybridscan_rank

4d6ad6a

sezruby force-pushed the hybridscan_rank branch from 5fe1139 to 4d6ad6a Compare December 9, 2020 02:51

sezruby added 2 commits December 11, 2020 16:33

Check tagged values

54f6171

Revert "Check tagged values"

3ff54f0

This reverts commit 54f6171

imback82 reviewed Jan 4, 2021

View reviewed changes

sezruby added 2 commits January 5, 2021 14:48

Review commit

12979f5

Merge remote-tracking branch 'upstream/master' into hybridscan_rank

4c2eb91

sezruby mentioned this pull request Jan 5, 2021

Add a condition using the default shuffle partition number in JoinIndexRanker #308

Open

minor fix

ac4de88

imback82 reviewed Jan 5, 2021

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/rankers/JoinIndexRanker.scala Outdated Show resolved Hide resolved

sezruby added 2 commits January 6, 2021 09:53

Review commit

12f4f9e

Fix test

dc96d03

imback82 approved these changes Jan 6, 2021

View reviewed changes

imback82 merged commit 3472bd4 into microsoft:master Jan 6, 2021

sezruby deleted the hybridscan_rank branch January 6, 2021 07:41

sezruby mentioned this pull request Jan 6, 2021

Fix for setting COMMON_SOURCE_SIZE_IN_BYTES tag #310

Merged

sezruby mentioned this pull request Jan 25, 2021

Enable Hybrid Scan by default #333

Open

imback82 modified the milestones: November 2020, January 2021 Jan 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix rank algorithm for Hybrid Scan #164

Fix rank algorithm for Hybrid Scan #164

sezruby commented Sep 15, 2020 •

edited

Loading

rapoth Sep 16, 2020

sezruby Sep 16, 2020

rapoth Sep 16, 2020

imback82 Sep 16, 2020

rapoth Sep 16, 2020

sezruby commented Oct 6, 2020

sezruby Oct 20, 2020

sezruby Dec 4, 2020

sezruby commented Dec 10, 2020

imback82 commented Dec 10, 2020

sezruby commented Dec 11, 2020 •

edited

Loading

imback82 Jan 4, 2021

sezruby Jan 5, 2021

imback82 left a comment

Fix rank algorithm for Hybrid Scan #164

Fix rank algorithm for Hybrid Scan #164

Conversation

sezruby commented Sep 15, 2020 • edited Loading

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sezruby commented Oct 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sezruby commented Dec 10, 2020

imback82 commented Dec 10, 2020

sezruby commented Dec 11, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imback82 left a comment

Choose a reason for hiding this comment

sezruby commented Sep 15, 2020 •

edited

Loading

sezruby commented Dec 11, 2020 •

edited

Loading