Add new IndexLogEntryTags to cache InMemoryFileIndex #324

sezruby · 2021-01-18T07:51:25Z

What is the context for this pull request?

Tracking Issue: Cache InMemoryFileIndex for Index #317
Parent Issue: n/a
Dependencies: n/a
Fixes Cache InMemoryFileIndex for Index #317

What changes were proposed in this pull request?

This PR introduces 3 new IndexLogEntryTags for caching InMemoryFileIndex.
Currently, InMemoryFileIndex is created for every query execution. However, it incurs a new spark job & sometimes takes longer in cluster mode. To avoid the overhead, we could cache InMemoryFileIndex object for each index entry.

  // INMEMORYFILEINDEX_INDEX_ONLY stores InMemoryFileIndex for index only scan.
  val INMEMORYFILEINDEX_INDEX_ONLY: IndexLogEntryTag[InMemoryFileIndex] =
    IndexLogEntryTag[InMemoryFileIndex]("inMemoryFileIndexIndexOnly")

  // INMEMORYFILEINDEX_HYBRID_SCAN stores InMemoryFileIndex including index data files and also
  // appended files for Hybrid Scan.
  val INMEMORYFILEINDEX_HYBRID_SCAN: IndexLogEntryTag[InMemoryFileIndex] =
    IndexLogEntryTag[InMemoryFileIndex]("inMemoryFileIndexHybridScan")

  // INMEMORYFILEINDEX_HYBRID_SCAN_APPENDED stores InMemoryFileIndex including only appended files
  // for Hybrid Scan.
  val INMEMORYFILEINDEX_HYBRID_SCAN_APPENDED: IndexLogEntryTag[InMemoryFileIndex] =
    IndexLogEntryTag[InMemoryFileIndex]("inMemoryFileIndexHybridScanAppended")

Test result with cache

Test scripts:

hs.refreshIndex(indexName) // use refreshIndex to clear the tags.
val linetable = spark.read.parquet(tableName)
val filter = linetable.filter(linetable("l_partkey") isin (1234,12341234, 123456)).select("l_suppkey","l_quantity","l_shipdate","l_extendedprice","l_discount","l_orderkey")
measure(filter.count)
val filter = linetable.filter(linetable("l_partkey") isin (1234,12341234, 123456)).select("l_suppkey","l_quantity","l_shipdate","l_extendedprice","l_discount","l_orderkey")
measure(filter.count) // second query with cache

hs.refreshIndex(indexName)
val filter = linetable.filter(linetable("l_partkey") isin (1234,12341234, 123456)).select("l_suppkey","l_quantity","l_shipdate","l_extendedprice","l_discount","l_orderkey")
measure(filter.count)
hs.refreshIndex(indexName) // clear cache
val filter = linetable.filter(linetable("l_partkey") isin (1234,12341234, 123456)).select("l_suppkey","l_quantity","l_shipdate","l_extendedprice","l_discount","l_orderkey")
measure(filter.count) // second query without cache

Result:

linetable: org.apache.spark.sql.DataFrame = [l_orderkey: bigint, l_partkey: bigint ... 14 more fields]
filter: org.apache.spark.sql.DataFrame = [l_suppkey: bigint, l_quantity: double ... 4 more fields]
duration: 2212
filter: org.apache.spark.sql.DataFrame = [l_suppkey: bigint, l_quantity: double ... 4 more fields]
duration: 868
filter: org.apache.spark.sql.DataFrame = [l_suppkey: bigint, l_quantity: double ... 4 more fields]
duration: 2235
filter: org.apache.spark.sql.DataFrame = [l_suppkey: bigint, l_quantity: double ... 4 more fields]
duration: 2121

Spark UI:

Does this PR introduce any user-facing change?

Yes, if the cached InMemoryFileIndex object is used, we could avoid unnecessary listing files jobs for every query execution.

How was this patch tested?

Unit test

imback82 · 2021-01-20T06:59:49Z

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

-        val newLocation = new InMemoryFileIndex(spark, filesAppended, options, None)
+        val newLocation = index.getTagValueOrUpdate(originalPlan,
+          IndexLogEntryTags.INMEMORYFILEINDEX_HYBRID_SCAN_APPENDED,
+          new InMemoryFileIndex(spark, filesAppended, options, None))


Question: is it always guaranteed that the cached file index will always have the same files (filesAppended in this case)?

yea because hybrid scan tags are tagged with the plan.

Cool. Can you add a test? It will be the opposite of what you already added; check the cached file index is not used if plan changes.

imback82 · 2021-01-20T07:01:30Z

Currently, InMemoryFileIndex is created for every query execution. However, it incurs a new spark job & sometimes takes longer in cluster mode.

Can you attach the screenshots of jobs from Spark UI if available (before and after)? It will be very useful to check the improvements.

sezruby · 2021-01-20T08:01:18Z

Test scripts:

hs.refreshIndex(indexName) // use refreshIndex to clear the tags.
val linetable = spark.read.parquet(tableName)
val filter = linetable.filter(linetable("l_partkey") isin (1234,12341234, 123456)).select("l_suppkey","l_quantity","l_shipdate","l_extendedprice","l_discount","l_orderkey")
measure(filter.count)
val filter = linetable.filter(linetable("l_partkey") isin (1234,12341234, 123456)).select("l_suppkey","l_quantity","l_shipdate","l_extendedprice","l_discount","l_orderkey")
measure(filter.count) // second query with cache

hs.refreshIndex(indexName)
val filter = linetable.filter(linetable("l_partkey") isin (1234,12341234, 123456)).select("l_suppkey","l_quantity","l_shipdate","l_extendedprice","l_discount","l_orderkey")
measure(filter.count)
hs.refreshIndex(indexName) // clear cache
val filter = linetable.filter(linetable("l_partkey") isin (1234,12341234, 123456)).select("l_suppkey","l_quantity","l_shipdate","l_extendedprice","l_discount","l_orderkey")
measure(filter.count) // second query without cache

Result:

linetable: org.apache.spark.sql.DataFrame = [l_orderkey: bigint, l_partkey: bigint ... 14 more fields]
filter: org.apache.spark.sql.DataFrame = [l_suppkey: bigint, l_quantity: double ... 4 more fields]
duration: 2212
filter: org.apache.spark.sql.DataFrame = [l_suppkey: bigint, l_quantity: double ... 4 more fields]
duration: 868
filter: org.apache.spark.sql.DataFrame = [l_suppkey: bigint, l_quantity: double ... 4 more fields]
duration: 2235
filter: org.apache.spark.sql.DataFrame = [l_suppkey: bigint, l_quantity: double ... 4 more fields]
duration: 2121

Spark UI:

imback82 · 2021-01-20T08:07:04Z

Great! Can you update the description with this info (you can just copy/paste)? Thanks!

imback82 · 2021-01-26T07:24:44Z

@sezruby Can you fix the conflicts please?

imback82

LGTM (few minor comments), thanks @sezruby!

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

src/test/scala/com/microsoft/hyperspace/index/E2EHyperspaceRulesTest.scala

imback82 · 2021-01-27T03:36:40Z

src/test/scala/com/microsoft/hyperspace/index/E2EHyperspaceRulesTest.scala

+        def query(df: DataFrame): DataFrame = {
+          df.filter("c3 == 'facebook'").select("c3", "c4")
+        }
+        def getQueryPlanKey(df: DataFrame): LogicalPlan = {


What do you mean "Key" here?

Renamed getOriginalQueryPlan; used "Key" because the plan before transformation is one of hash keys for IndexLogEntryTags

src/test/scala/com/microsoft/hyperspace/index/E2EHyperspaceRulesTest.scala

imback82 · 2021-01-29T00:03:25Z

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

+        def fileIndex: InMemoryFileIndex =
+          new InMemoryFileIndex(spark, filesToRead, Map(), None)
+        val newLocation = if (filesToRead.length == index.content.files.size) {
+          index.withCachedTag(IndexLogEntryTags.INMEMORYFILEINDEX_INDEX_ONLY)(fileIndex)
+        } else {
+          index.withCachedTag(plan, IndexLogEntryTags.INMEMORYFILEINDEX_HYBRID_SCAN)(fileIndex)
+        }


I was thinking the following to make the intention clear, but the current approach is also fine:

val tagForFileIndex = if (filesToRead.length == index.content.files.size) { IndexLogEntryTags.INMEMORYFILEINDEX_INDEX_ONLY } else { IndexLogEntryTags.INMEMORYFILEINDEX_HYBRID_SCAN } val newLocation = index.withCachedTag(plan, tagForFileIndex) { new InMemoryFileIndex(spark, filesToRead, Map(), None) }

it's because INDEX_ONLY doesn't require plan

missed that. thanks

imback82 · 2021-01-29T00:06:24Z

src/test/scala/com/microsoft/hyperspace/index/E2EHyperspaceRulesTest.scala

+  }
+
+  private def equalsRef(a: Set[FileIndex], b: Set[FileIndex]): Boolean = {
+    a.zip(b).forall(f => f._1 eq f._2)


Don't you need a.size == b.size &&?

Add new tags to cache InMemoryFileIndex

9e68c2c

sezruby force-pushed the fileindexcache branch from 9d4cc68 to 82f4706 Compare January 18, 2021 07:58

test fix

64dd13c

sezruby force-pushed the fileindexcache branch from 82f4706 to 64dd13c Compare January 18, 2021 08:56

sezruby requested review from apoorvedave1, imback82, pirz and rapoth January 18, 2021 08:58

sezruby self-assigned this Jan 18, 2021

sezruby added 3 commits January 18, 2021 18:08

test fix

0088af6

test fix

b5425e3

Add INMEMORYFILEINDEX_HYBRID_SCAN_APPENDED tag

62ca646

imback82 reviewed Jan 20, 2021

View reviewed changes

sezruby added 4 commits January 20, 2021 18:30

Add test2

c8e62d6

Add test3

bb70e59

Merge remote-tracking branch 'upstream/master' into fileindexcache

ba2be6b

test fix2

9b5ce32

This was referenced Jan 21, 2021

Add new IndexLogEntryTag to avoid duplicate calculation in getCandidateIndexes #293

Merged

Enable Hybrid Scan by default #333

Open

sezruby added 3 commits January 26, 2021 16:33

Merge remote-tracking branch 'upstream/master' into fileindexcache

34f48e0

update functions

aaf3167

misc

ae469e3

sezruby force-pushed the fileindexcache branch from 51cc506 to ae469e3 Compare January 26, 2021 07:43

imback82 reviewed Jan 27, 2021

View reviewed changes

Review commit

e557e7b

imback82 reviewed Jan 29, 2021

View reviewed changes

sezruby added 2 commits January 29, 2021 09:34

Review commit

7450728

Merge remote-tracking branch 'upstream/master' into fileindexcache

a45d55b

sezruby force-pushed the fileindexcache branch from c3fa396 to a45d55b Compare January 29, 2021 00:41

imback82 approved these changes Jan 29, 2021

View reviewed changes

imback82 merged commit cd9a632 into microsoft:master Jan 29, 2021

imback82 added the enhancement New feature or request label Jan 29, 2021

imback82 added this to the January 2021 milestone Jan 29, 2021

sezruby mentioned this pull request Feb 6, 2021

Support adaptive bucketed scan for FilterIndexRule #332

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new IndexLogEntryTags to cache InMemoryFileIndex #324

Add new IndexLogEntryTags to cache InMemoryFileIndex #324

sezruby commented Jan 18, 2021 •

edited

Loading

imback82 Jan 20, 2021

sezruby Jan 20, 2021

imback82 Jan 20, 2021

imback82 commented Jan 20, 2021

sezruby commented Jan 20, 2021

imback82 commented Jan 20, 2021

imback82 commented Jan 26, 2021

imback82 left a comment

imback82 Jan 27, 2021

sezruby Jan 27, 2021

imback82 Jan 29, 2021

sezruby Jan 29, 2021

imback82 Jan 29, 2021

imback82 Jan 29, 2021

Add new IndexLogEntryTags to cache InMemoryFileIndex #324

Add new IndexLogEntryTags to cache InMemoryFileIndex #324

Conversation

sezruby commented Jan 18, 2021 • edited Loading

What is the context for this pull request?

What changes were proposed in this pull request?

Test result with cache

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imback82 commented Jan 20, 2021

sezruby commented Jan 20, 2021

imback82 commented Jan 20, 2021

imback82 commented Jan 26, 2021

imback82 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sezruby commented Jan 18, 2021 •

edited

Loading