-
Notifications
You must be signed in to change notification settings - Fork 115
Add new IndexLogEntryTags to cache InMemoryFileIndex #324
Conversation
9d4cc68
to
82f4706
Compare
82f4706
to
64dd13c
Compare
val newLocation = new InMemoryFileIndex(spark, filesAppended, options, None) | ||
val newLocation = index.getTagValueOrUpdate(originalPlan, | ||
IndexLogEntryTags.INMEMORYFILEINDEX_HYBRID_SCAN_APPENDED, | ||
new InMemoryFileIndex(spark, filesAppended, options, None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: is it always guaranteed that the cached file index will always have the same files (filesAppended in this case)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea because hybrid scan tags are tagged with the plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Can you add a test? It will be the opposite of what you already added; check the cached file index is not used if plan changes.
Can you attach the screenshots of jobs from Spark UI if available (before and after)? It will be very useful to check the improvements. |
Test scripts:
Result:
|
Great! Can you update the description with this info (you can just copy/paste)? Thanks! |
@sezruby Can you fix the conflicts please? |
51cc506
to
ae469e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (few minor comments), thanks @sezruby!
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/E2EHyperspaceRulesTest.scala
Outdated
Show resolved
Hide resolved
def query(df: DataFrame): DataFrame = { | ||
df.filter("c3 == 'facebook'").select("c3", "c4") | ||
} | ||
def getQueryPlanKey(df: DataFrame): LogicalPlan = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean "Key" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed getOriginalQueryPlan; used "Key" because the plan before transformation is one of hash keys for IndexLogEntryTags
src/test/scala/com/microsoft/hyperspace/index/E2EHyperspaceRulesTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/E2EHyperspaceRulesTest.scala
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/E2EHyperspaceRulesTest.scala
Outdated
Show resolved
Hide resolved
def fileIndex: InMemoryFileIndex = | ||
new InMemoryFileIndex(spark, filesToRead, Map(), None) | ||
val newLocation = if (filesToRead.length == index.content.files.size) { | ||
index.withCachedTag(IndexLogEntryTags.INMEMORYFILEINDEX_INDEX_ONLY)(fileIndex) | ||
} else { | ||
index.withCachedTag(plan, IndexLogEntryTags.INMEMORYFILEINDEX_HYBRID_SCAN)(fileIndex) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking the following to make the intention clear, but the current approach is also fine:
val tagForFileIndex = if (filesToRead.length == index.content.files.size) {
IndexLogEntryTags.INMEMORYFILEINDEX_INDEX_ONLY
} else {
IndexLogEntryTags.INMEMORYFILEINDEX_HYBRID_SCAN
}
val newLocation = index.withCachedTag(plan, tagForFileIndex) {
new InMemoryFileIndex(spark, filesToRead, Map(), None)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's because INDEX_ONLY doesn't require plan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missed that. thanks
} | ||
|
||
private def equalsRef(a: Set[FileIndex], b: Set[FileIndex]): Boolean = { | ||
a.zip(b).forall(f => f._1 eq f._2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you need a.size == b.size &&
?
c3fa396
to
a45d55b
Compare
What is the context for this pull request?
Fixes Cache InMemoryFileIndex for Index #317
What changes were proposed in this pull request?
This PR introduces 3 new IndexLogEntryTags for caching InMemoryFileIndex.
Currently, InMemoryFileIndex is created for every query execution. However, it incurs a new spark job & sometimes takes longer in cluster mode. To avoid the overhead, we could cache InMemoryFileIndex object for each index entry.
Test result with cache
Test scripts:
Result:
Spark UI:
Does this PR introduce any user-facing change?
Yes, if the cached InMemoryFileIndex object is used, we could avoid unnecessary listing files jobs for every query execution.
How was this patch tested?
Unit test