Skip to content
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.

Add support for delete to index refresh #142

Merged
merged 31 commits into from
Sep 18, 2020

Conversation

pirz
Copy link
Contributor

@pirz pirz commented Sep 2, 2020

What is the context for this pull request?

What changes were proposed in this pull request?

This PR handles updating index during refresh w.r.t deleted source data files. Updating index for newly appended source data files will be done via a separate PR #163.

This change adds capability to Refresh index by removing index entries from any deleted source data file.

Note this Refresh Action only fixes an index w.r.t deleted source data files and does not consider new source data files (if any). If some original source data file(s) are removed between previous version of index and now, this Refresh Action updates the index as follows:

  1. Deleted source data files are identified;
  2. Index records' lineage is leveraged to remove any index entry coming from those deleted source data files.

Currently, this feature is protected under a Spark configuration flag: spark.hyperspace.index.refresh.delete.enabled and is disabled by default.

Why are the changes needed?

Currently, when user removes some data file(s), a full index rebuild is the only way to refresh any affected index and remove deleted records from index. This change makes incremental index refresh possible for such cases by fixing index files without any source data scan.

Does this PR introduce any user-facing change?

Yes, it changes the behavior of index refresh and helps with incremental index refresh to remove deleted index records.

Old experience:

  1. User creates an index on some data e.g., /path/to/dataset/
  2. User issues a query and Hyperspace is able to use the index
  3. User deletes some files from the original data /path/to/dataset/
  4. User issues a query but Hyperspace detects data change and decides to disable index usage
  5. User invokes refresh to update the index
  6. Hyperspace does a full index rebuild

New experience:
Steps 1 - 4 remain the same.

  • If user disabled spark.hyperspace.index.refresh.delete.enabled then Hyperspace experience remains similar to 5 and 6 above.
  • If user enables spark.hyperspace.index.refresh.delete.enabled then:
    1. Hyperspace detects the portions of the index that need a rewrite and updates them
    2. User can now issues queries and Hyperspace will use the index

How was this patch tested?

New test cases are added under new test suite RefreshIndexDeleteTests.scala.

@pirz pirz self-assigned this Sep 3, 2020
@pirz pirz added the enhancement New feature or request label Sep 3, 2020
Copy link
Collaborator

@sezruby sezruby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach and WIP code generally LGTM. Thanks!

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this WIP?

@pirz pirz changed the title [WIP] Add support for delete to index refresh Add support for delete to index refresh Sep 10, 2020
Comment on lines 60 to 66
var currentFiles = Seq[String]()
rels.head.rootPaths.foreach { p =>
currentFiles ++= Content
.fromDirectory(path = new Path(p))
.files
.map(_.toString)
}
Copy link
Contributor

@apoorvedave1 apoorvedave1 Sep 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var currentFiles = Seq[String]()
rels.head.rootPaths.foreach { p =>
currentFiles ++= Content
.fromDirectory(path = new Path(p))
.files
.map(_.toString)
}
val currentFiles = rels.head.rootPaths.flatMap { p =>
Content
.fromDirectory(path = new Path(p))
.files
.map(_.toString)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it might be better to explore IndexLogEntry.listLeafFiles() api for file listing here. We can move that function to PathUtils class for more generic use.

We can do that as a separate PR to keep this one simple.
(1. move listLeafFiles from IndexLogEntry to PathUtils. 2. Use listLeafFiles here instead of Content)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thnx, I went with your suggestion and I agree on making IndexLogEntry.listLeafFiles() available to other classes as it is more like a utility function that would pop up in different scenarios (like the one above).

Copy link
Contributor

@apoorvedave1 apoorvedave1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some minor comments, Thanks @pirz

@rapoth
Copy link
Contributor

rapoth commented Sep 11, 2020

@pirz I went from this issue to #133 -> #104. Can you also add a link to the uber issue that explains the entire e2e strategy of finishing this work? Thank you!

@rapoth
Copy link
Contributor

rapoth commented Sep 11, 2020

@pirz For the following:

Does this PR introduce any user-facing change?
Yes, it changes the behavior of index refresh and helps with incremental index refresh to remove deleted index records.

Can you please add it as follows:

Old experience:

  1. User creates an index on some data e.g., /path/to/dataset/
  2. User issues a query and Hyperspace is able to use the index
  3. User deletes some files from the original data /path/to/dataset/
  4. User issues a query but Hyperspace detects data change and decides to disable index usage
  5. User invokes refresh to update the index
  6. Hyperspace does a full index rebuild

New experience:
Steps 1 - 4 remain the same. If user disabled spark.hyperspace.index.refresh.delete.enabled, then Hyperspace experience remains similar to 5 and 6 above.

If user enables spark.hyperspace.index.refresh.delete.enabled, then:

  1. Hyperspace detects the portions of the index that need a rewrite and updates them
  2. User can now issues queries and Hyperspace will use the index

val indexDF = spark.read.parquet(previousIndexLogEntry.content.files.map(_.toString): _*)

ResolverUtils
.resolve(spark, IndexConstants.DATA_FILE_NAME_COLUMN, indexDF.schema.fieldNames) match {
Copy link
Collaborator

@sezruby sezruby Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to move this to IndexLogEntry - previousIndexLogEntry.hasLineageColumn as I also need this utility func :) And I think it's better to check this first just using previousIndexLogEntry.schema and skip if false, before the above spark.read.parquet(....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I added def hasLineageColumn(spark: SparkSession): Boolean = {...} to IndexLogEntry.

*/
private def getDeletedFiles: Seq[String] = {
val rels = previousIndexLogEntry.relations
val originalFiles = rels.head.data.properties.content.files.map(_.toString)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check the file metadata (size / modification time ) as well?

Copy link
Contributor Author

@pirz pirz Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sezruby Here we are looking for "deleted" files and file name suffices for that check. If a file is renamed we still mark it as deleted, however I am not sure if modifying content of an existing file is a valid scenario. Can you explain a bit what exactly you are suggesting? thnx

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think the user can "overwrite" the files.

I think if we detect a mismatch of metadata, we shouldn't perform the action, but suggest to perform full refresh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imback82 Does that mean for existing files (those which are present both in originalFiles and currentFiles) we should do a full metadata comparison here and if there is a mismatch we abort on-going refresh action?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 It might be better to just do a full metadata comparison and abort (with a suggestion). It is becoming increasingly clear that we should be on the safe side. :)

Copy link
Collaborator

@sezruby sezruby Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can handle the modified file as both deleted and appended file. Aborting DeleteRefresh in this PR for now, but later with append impl. we could delete & append refresh the file properly.

In case the metadata isn't checked here, signatureValid will fail. It won't fail because of the refresh. Then the result might be different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileInfo check added along with a new test case under RefreshIndexTests.scala to validate it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apoorvedave1 as FYI

@rapoth rapoth added this to the 0.4.0 milestone Sep 17, 2020
@rapoth rapoth linked an issue Sep 17, 2020 that may be closed by this pull request
6 tasks
sezruby
sezruby previously approved these changes Sep 18, 2020
Copy link
Collaborator

@sezruby sezruby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thanks @pirz !

@rapoth
Copy link
Contributor

rapoth commented Sep 18, 2020

LGTM, thanks @pirz!

apoorvedave1
apoorvedave1 previously approved these changes Sep 18, 2020
Copy link
Contributor

@apoorvedave1 apoorvedave1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM , Thanks @pirz

def query(): DataFrame =
spark.read.parquet(location).filter("c3 == 'facebook'").select("c3", "c1")

// Verify index usage on latest version of index (v=1) after refresh.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are verifying this way, I would do the following:

  • verifyIndexUsage with version 0.
  • Delete the file -> verify index is not utilized
  • verifyIndexUsage with version 1.

Copy link
Contributor Author

@pirz pirz Sep 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the test case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if there is no verification before applying refresh, we are not really validating anything (it's possible that the test set up was wrong, refresh didn't work, etc.). To me, it's crucial to test this e2e scenario in E2E tests.

@pirz pirz dismissed stale reviews from apoorvedave1 and sezruby via 7137b89 September 18, 2020 20:29
Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @pirz!

@imback82 imback82 merged commit 1d3ac49 into microsoft:master Sep 18, 2020
@imback82
Copy link
Contributor

Merged to master. Nice work!

@rapoth
Copy link
Contributor

rapoth commented Sep 18, 2020

Looks awesome, thank you @pirz! 👍

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incremental Index Maintenance for File/Partition Mutable Datasets
5 participants