Add support for delete to index refresh #142

pirz · 2020-09-02T22:20:45Z

What is the context for this pull request?

Tracking Issue: Remove deleted records during refresh #133
Parent Issue: Add support for deletes in RefreshIndex for incremental mode #105 -> Merge incremental mode append and delete actions into index refresh #149
Dependencies: Modify logical plan to merge newly appended files and index data #165

What changes were proposed in this pull request?

This PR handles updating index during refresh w.r.t deleted source data files. Updating index for newly appended source data files will be done via a separate PR #163.

This change adds capability to Refresh index by removing index entries from any deleted source data file.

Note this Refresh Action only fixes an index w.r.t deleted source data files and does not consider new source data files (if any). If some original source data file(s) are removed between previous version of index and now, this Refresh Action updates the index as follows:

Deleted source data files are identified;
Index records' lineage is leveraged to remove any index entry coming from those deleted source data files.

Currently, this feature is protected under a Spark configuration flag: spark.hyperspace.index.refresh.delete.enabled and is disabled by default.

Why are the changes needed?

Currently, when user removes some data file(s), a full index rebuild is the only way to refresh any affected index and remove deleted records from index. This change makes incremental index refresh possible for such cases by fixing index files without any source data scan.

Does this PR introduce any user-facing change?

Yes, it changes the behavior of index refresh and helps with incremental index refresh to remove deleted index records.

Old experience:

User creates an index on some data e.g., /path/to/dataset/
User issues a query and Hyperspace is able to use the index
User deletes some files from the original data /path/to/dataset/
User issues a query but Hyperspace detects data change and decides to disable index usage
User invokes refresh to update the index
Hyperspace does a full index rebuild

New experience:
Steps 1 - 4 remain the same.

If user disabled spark.hyperspace.index.refresh.delete.enabled then Hyperspace experience remains similar to 5 and 6 above.
If user enables spark.hyperspace.index.refresh.delete.enabled then:
1. Hyperspace detects the portions of the index that need a rewrite and updates them
2. User can now issues queries and Hyperspace will use the index

How was this patch tested?

New test cases are added under new test suite RefreshIndexDeleteTests.scala.

sezruby

The approach and WIP code generally LGTM. Thanks!

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteAction.scala

src/main/scala/com/microsoft/hyperspace/index/IndexCollectionManager.scala

imback82

Is this WIP?

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteAction.scala

src/test/scala/com/microsoft/hyperspace/index/RefreshIndexDeleteTests.scala

apoorvedave1 · 2020-09-11T18:46:02Z

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteAction.scala

+    var currentFiles = Seq[String]()
+    rels.head.rootPaths.foreach { p =>
+      currentFiles ++= Content
+        .fromDirectory(path = new Path(p))
+        .files
+        .map(_.toString)
+    }


Suggested change

var currentFiles = Seq[String]()

rels.head.rootPaths.foreach { p =>

currentFiles ++= Content

.fromDirectory(path = new Path(p))

.files

.map(_.toString)

}

val currentFiles = rels.head.rootPaths.flatMap { p =>

Content

.fromDirectory(path = new Path(p))

.files

.map(_.toString)

}

nit: it might be better to explore IndexLogEntry.listLeafFiles() api for file listing here. We can move that function to PathUtils class for more generic use.

We can do that as a separate PR to keep this one simple.
(1. move listLeafFiles from IndexLogEntry to PathUtils. 2. Use listLeafFiles here instead of Content)

Thnx, I went with your suggestion and I agree on making IndexLogEntry.listLeafFiles() available to other classes as it is more like a utility function that would pop up in different scenarios (like the one above).

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteAction.scala

apoorvedave1

left some minor comments, Thanks @pirz

rapoth · 2020-09-11T23:13:59Z

@pirz I went from this issue to #133 -> #104. Can you also add a link to the uber issue that explains the entire e2e strategy of finishing this work? Thank you!

rapoth · 2020-09-11T23:22:28Z

@pirz For the following:

Does this PR introduce any user-facing change?
Yes, it changes the behavior of index refresh and helps with incremental index refresh to remove deleted index records.

Can you please add it as follows:

Old experience:

User creates an index on some data e.g., /path/to/dataset/
User issues a query and Hyperspace is able to use the index
User deletes some files from the original data /path/to/dataset/
User issues a query but Hyperspace detects data change and decides to disable index usage
User invokes refresh to update the index
Hyperspace does a full index rebuild

New experience:
Steps 1 - 4 remain the same. If user disabled spark.hyperspace.index.refresh.delete.enabled, then Hyperspace experience remains similar to 5 and 6 above.

If user enables spark.hyperspace.index.refresh.delete.enabled, then:

Hyperspace detects the portions of the index that need a rewrite and updates them
User can now issues queries and Hyperspace will use the index

sezruby · 2020-09-16T10:30:59Z

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteAction.scala

+    val indexDF = spark.read.parquet(previousIndexLogEntry.content.files.map(_.toString): _*)
+
+    ResolverUtils
+      .resolve(spark, IndexConstants.DATA_FILE_NAME_COLUMN, indexDF.schema.fieldNames) match {


It would be good to move this to IndexLogEntry - previousIndexLogEntry.hasLineageColumn as I also need this utility func :) And I think it's better to check this first just using previousIndexLogEntry.schema and skip if false, before the above spark.read.parquet(....

sure, I added def hasLineageColumn(spark: SparkSession): Boolean = {...} to IndexLogEntry.

sezruby · 2020-09-16T14:12:37Z

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteAction.scala

+   */
+  private def getDeletedFiles: Seq[String] = {
+    val rels = previousIndexLogEntry.relations
+    val originalFiles = rels.head.data.properties.content.files.map(_.toString)


Could you check the file metadata (size / modification time ) as well?

@sezruby Here we are looking for "deleted" files and file name suffices for that check. If a file is renamed we still mark it as deleted, however I am not sure if modifying content of an existing file is a valid scenario. Can you explain a bit what exactly you are suggesting? thnx

Good point. I think the user can "overwrite" the files.

I think if we detect a mismatch of metadata, we shouldn't perform the action, but suggest to perform full refresh.

@imback82 Does that mean for existing files (those which are present both in originalFiles and currentFiles) we should do a full metadata comparison here and if there is a mismatch we abort on-going refresh action?

+1 It might be better to just do a full metadata comparison and abort (with a suggestion). It is becoming increasingly clear that we should be on the safe side. :)

I think we can handle the modified file as both deleted and appended file. Aborting DeleteRefresh in this PR for now, but later with append impl. we could delete & append refresh the file properly.

In case the metadata isn't checked here, ~~signatureValid will fail.~~ It won't fail because of the refresh. Then the result might be different.

FileInfo check added along with a new test case under RefreshIndexTests.scala to validate it.

@apoorvedave1 as FYI

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteAction.scala

sezruby

LGTM Thanks @pirz !

rapoth · 2020-09-18T08:10:35Z

LGTM, thanks @pirz!

apoorvedave1

LGTM , Thanks @pirz

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteAction.scala

imback82 · 2020-09-18T16:51:40Z

src/test/scala/com/microsoft/hyperspace/index/E2EHyperspaceRulesTests.scala

+      def query(): DataFrame =
+        spark.read.parquet(location).filter("c3 == 'facebook'").select("c3", "c1")
+
+      // Verify index usage on latest version of index (v=1) after refresh.


If you are verifying this way, I would do the following:

verifyIndexUsage with version 0.

Delete the file -> verify index is not utilized

verifyIndexUsage with version 1.

I modified the test case

Well, if there is no verification before applying refresh, we are not really validating anything (it's possible that the test set up was wrong, refresh didn't work, etc.). To me, it's crucial to test this e2e scenario in E2E tests.

src/test/scala/com/microsoft/hyperspace/index/RefreshIndexTests.scala

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteAction.scala

src/test/scala/com/microsoft/hyperspace/index/RefreshIndexTests.scala

imback82

LGTM, thanks @pirz!

imback82 · 2020-09-18T22:12:04Z

Merged to master. Nice work!

rapoth · 2020-09-18T23:33:45Z

Looks awesome, thank you @pirz! 👍

Pouria Pirzadeh added 10 commits August 25, 2020 22:07

add delete support for index refresh

4d002cd

Merge branch 'master' into pouriap/refreshDelete

f8f72f2

Merge branch 'master' into pouriap/refreshDelete

aed6255

add delete support to refresh index

bf98425

add delete support to refresh index

fcb4f95

add delete support to refresh index

ba55f30

add delete support to refresh index

a15b71d

fix index manager test case

70d9416

fix index content refresh

0b26209

Merge branch 'master' into pouriap/refreshDelete

3cc1e0f

pirz self-assigned this Sep 3, 2020

pirz requested review from apoorvedave1, imback82, rapoth and sezruby September 3, 2020 18:45

pirz added the enhancement New feature or request label Sep 3, 2020

sezruby reviewed Sep 4, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteAction.scala Outdated Show resolved Hide resolved

src/main/scala/com/microsoft/hyperspace/index/IndexCollectionManager.scala Outdated Show resolved Hide resolved

imback82 reviewed Sep 4, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteAction.scala Show resolved Hide resolved

Pouria Pirzadeh added 5 commits September 9, 2020 12:03

Merge branch 'master' into pouriap/refreshDelete

bfa0543

fix merge conflicts

66fa8fb

refactor refresh code and add refresh delete

2a62d1d

check for lineage during refresh delete.

9916c25

fix refresh delete test

d808a13

pirz changed the title ~~[WIP] Add support for delete to index refresh~~ Add support for delete to index refresh Sep 10, 2020

apoorvedave1 reviewed Sep 11, 2020

View reviewed changes

src/test/scala/com/microsoft/hyperspace/index/RefreshIndexDeleteTests.scala Outdated Show resolved Hide resolved

apoorvedave1 reviewed Sep 11, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteAction.scala Show resolved Hide resolved

apoorvedave1 reviewed Sep 11, 2020

View reviewed changes

sezruby reviewed Sep 16, 2020

View reviewed changes

Merge branch 'master' into pouriap/refreshDelete

7ca9f8c

This was referenced Sep 16, 2020

Optimize Index, with "quick" and "full" modes #166

Merged

Incremental Index Maintenance for File/Partition Mutable Datasets #136

Open

add tests to refresh delete and code clean-up

46aa2a7

imback82 reviewed Sep 17, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteAction.scala Show resolved Hide resolved

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteAction.scala Show resolved Hide resolved

rapoth added this to the 0.4.0 milestone Sep 17, 2020

rapoth linked an issue Sep 17, 2020 that may be closed by this pull request

Incremental Index Maintenance for File/Partition Mutable Datasets #136

Open

6 tasks

Pouria Pirzadeh added 4 commits September 17, 2020 10:47

changes to refresh delete and its tests

ed88a36

Merge branch 'master' into pouriap/refreshDelete

c98b30b

Change validation for index refresh delete

5d08fc6

add file info check to refresh delete action

0499b45

sezruby previously approved these changes Sep 18, 2020

View reviewed changes

apoorvedave1 previously approved these changes Sep 18, 2020

View reviewed changes

imback82 reviewed Sep 18, 2020

View reviewed changes

misc changes in index refresh delete and its tests.

7137b89

pirz dismissed stale reviews from apoorvedave1 and sezruby via 7137b89 September 18, 2020 20:29

imback82 reviewed Sep 18, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteAction.scala Outdated Show resolved Hide resolved

misc fix in refresh index tests.

7be558b

imback82 reviewed Sep 18, 2020

View reviewed changes

src/test/scala/com/microsoft/hyperspace/index/RefreshIndexTests.scala Outdated Show resolved Hide resolved

fix in test code

7974b9a

imback82 approved these changes Sep 18, 2020

View reviewed changes

imback82 merged commit 1d3ac49 into microsoft:master Sep 18, 2020

apoorvedave1 mentioned this pull request Oct 1, 2020

Remove deleted records during refresh #133

Closed

pirz mentioned this pull request Oct 2, 2020

RefreshDelete should update appended files list in metadata #179

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for delete to index refresh #142

Add support for delete to index refresh #142

pirz commented Sep 2, 2020 •

edited by rapoth

Loading

sezruby left a comment

imback82 left a comment

apoorvedave1 Sep 11, 2020 •

edited

Loading

apoorvedave1 Sep 11, 2020

pirz Sep 14, 2020

apoorvedave1 left a comment

rapoth commented Sep 11, 2020

rapoth commented Sep 11, 2020

sezruby Sep 16, 2020 •

edited

Loading

pirz Sep 17, 2020

sezruby Sep 16, 2020

pirz Sep 17, 2020 •

edited

Loading

imback82 Sep 17, 2020

pirz Sep 17, 2020

rapoth Sep 17, 2020

sezruby Sep 17, 2020 •

edited

Loading

pirz Sep 18, 2020

rapoth Sep 18, 2020

sezruby left a comment

rapoth commented Sep 18, 2020

apoorvedave1 left a comment

imback82 Sep 18, 2020

pirz Sep 18, 2020 •

edited

Loading

imback82 Sep 18, 2020

imback82 left a comment

imback82 commented Sep 18, 2020

rapoth commented Sep 18, 2020

Add support for delete to index refresh #142

Add support for delete to index refresh #142

Conversation

pirz commented Sep 2, 2020 • edited by rapoth Loading

What is the context for this pull request?

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

sezruby left a comment

Choose a reason for hiding this comment

imback82 left a comment

Choose a reason for hiding this comment

apoorvedave1 Sep 11, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apoorvedave1 left a comment

Choose a reason for hiding this comment

rapoth commented Sep 11, 2020

rapoth commented Sep 11, 2020

sezruby Sep 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pirz Sep 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sezruby Sep 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sezruby left a comment

Choose a reason for hiding this comment

rapoth commented Sep 18, 2020

apoorvedave1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pirz Sep 18, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imback82 left a comment

Choose a reason for hiding this comment

imback82 commented Sep 18, 2020

rapoth commented Sep 18, 2020

pirz commented Sep 2, 2020 •

edited by rapoth

Loading

apoorvedave1 Sep 11, 2020 •

edited

Loading

sezruby Sep 16, 2020 •

edited

Loading

pirz Sep 17, 2020 •

edited

Loading

sezruby Sep 17, 2020 •

edited

Loading

pirz Sep 18, 2020 •

edited

Loading