Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-23671: Change datastore remove to a trash + emptyTrash #259

Merged
merged 8 commits into from Apr 18, 2020

Conversation

timj
Copy link
Member

@timj timj commented Apr 16, 2020

Splits the remove step into two phases. The first moves a row from the dataset_location to the dataset_location_trash table. The second step deletes all the datasets that have been trashed. This will hopefully reduce further the chances of datasets being deleted and leaving the datastore in an inconsistent state.

I need to clean up some of the commits but tests all pass so it's ready for a look.

@timj timj requested a review from TallJimbo April 16, 2020 17:37
@timj timj force-pushed the tickets/DM-23671 branch 7 times, most recently from c27be19 to aeec09e Compare April 16, 2020 18:08
This matches the API that uses that table
The trash table is identical to dataset_location but without the
foreign key.
Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a huge improvement. I've left a number of comments about possible further improvements, mostly concurrency edge cases. And most of those are about trying avoid query-then-operate logic, because I've learned that leaves one open to something unexpected happening in between.

But I'm also not sure all of those separate comments form a coherent big picture of how to make this rock-solid, at least not yet, especially when different varieties of composites are in play. And I'm not sure we need rock-solid yet - this is already enough to get us from "easy to accidentally corrupt your repo" to "very difficult to accidentally corrupt your repo", so feel free to defer any of my comments to a future ticket.

python/lsst/daf/butler/core/datasets/ref.py Outdated Show resolved Hide resolved
[table.columns.datastore_name, table.columns.dataset_id]
).where(
sqlalchemy.sql.and_(table.columns.dataset_id.in_([ref.id for ref in refs]),
table.columns.datastore_name == datastoreName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this query scales when the number of refs is very large. I don't know that it doesn't, but it might be good to get a database expert to weigh in (@andy-slac, maybe?). My first thought would be that we might want to chunk these up, but I have no idea what the chunk size ought to be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment this is only being called with a single ref and associated components so I don't think it's going to be unbounded. Of course in the future you might want to be doing bulk trashing rather than for each ref: trash.

"""
# We only want to move rows that already exist in the main table
filtered = self.checkDatasetLocations(datastoreName, refs)
self.canDeleteDatasetLocations(datastoreName, filtered)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing the query that filters down to the datasets that exist in the database separately from the insert into the trash table might be unsafe in the presence of concurrent access - I think the SELECT query could be out of date by the time the INSERT query runs, even if they're in a transaction. Unfortunately, fixing that would involve add INSERT...SELECT support to Database or punching a bit of a hole in its abstraction layer to let this code do it directly. I'm going to have to move this code around a bit anyway on DM-21764, so I could have a try at fixing it there when I do. That would need to involve merging checkDatasetLocations and canDeleteDatasetLocations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with you taking another stab at this. The reason I had to do it was that in some cases a DatasetRef can refer to components that have already been trashed (since we can't turn the component off within the DatasetRef) so switching things to only moving rows that are still there prevented me from adding rows to the trash that were already listed in the trash.

python/lsst/daf/butler/registry/_registry.py Outdated Show resolved Hide resolved
@@ -1078,12 +1074,15 @@ def prune(self, refs: Iterable[DatasetRef], *,
# to ignore already-removed-from-datastore datasets
# anyway.
if self.datastore.exists(ref):
self.datastore.remove(ref)
self.datastore.trash(ref)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could now simplify a lot of this, removing the giant TODO comment block, and all the checks under the if purge blocks (which will be handled more rigorously by transaction rollback now).

  • Start a registry transaction context (I don't think the Datastore ones are useful here).
  • Expand datasets to components recursively.
  • Do the if unstore loop that calls self.datastore.trash.
  • Do the disassociate and and possibly the purge Registry operations.
  • End the transaction context, committing those changes (or raising if something went wrong).
  • Call emptyTrash.

I'd love to have someone else check my logic on that, of course, especially on whether it's safe if there are concurrent deletion operations and how it interacts with various kinds of composites in datastore (for virtual composites, are there ever records in dataset_location)?

raise FileNotFoundError(f"Requested dataset ({ref}) does not exist")

if not self._artifact_exists(location):
raise FileNotFoundError(f"No such file: {location.uri}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect making this a silent no-op will make concurrent calls to emptyTrash safer, and maybe totally safe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Concurrent access would suggest that we should also skip the earlier test since it implies that the trashed file has also been removed from the internal registry as well as having the file removed.

python/lsst/daf/butler/datastores/fileLikeDatastore.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/datastores/genericDatastore.py Outdated Show resolved Hide resolved
log.debug("Trash %s in datastore %s", ref, self.name)

# Check that this dataset is known to datastore
self._get_dataset_info(ref)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check necessary? It'd probably be safer to make the deletion do nothing if the ref isn't actually in the datastore, to guard against the case where the a concurrent delete happens between the check and the move to trash.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case a concurrent delete should be impossible because in memory datastore is constrained to a particular python process. For the other datastores, I don't really know how to respond to multiple processes all trying to delete the same dataset. It sort of means that you have to say that trash() is never allowed to fail because you can never really tell if the file should be there or not.

self._remove_from_registry(ref)
for refId in artifactsToRemove:
log.debug("Removing artifact %s from datastore %s", realID, ref)
del self.datasets[refId]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be safer to do this operation (unlike the others) one dataset at a time, by first deleting the artifact, then deleting the internal records, and only then removing from the trash table. That means a Datastore would have to be willing to silently ignore artifact deletion if it fails because the artifact is already gone (i.e. because a previous deletion operation only got that far before it died). But I think that's reasonable, and the failure mode of the current logic is more dangerous: a file could be left around even though it'd already been completely removed from all records (even the trash).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option is to do a rename in the main trash emptying loop and then a final delete afterwards. You are right though that one reason why the file is no longer there is that some other process is emptying the trash at the same time and got to that file first (and the rename would also fail in that case as well). Multiple people emptying the trash simultaneously implies that we mostly have to suck up any errors and cross fingers. This is a problem if the file is left hanging around.

The trash emptying currently tries to delete everything. Another option is for emptyTrash to have a "delete everything mode" as now but also have a "delete only these refs and children" option (ie the refs that you only shortly before told to be trashed).

Concurrent accessing does raise the possibility that one process is emptying trash whilst another is putting the dataset that has been deleted. This will fail at the moment because the internal registry will still mention the file and the file will be on disk and since the emptyTrash is in a transaction I think that will be fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed a bit on slack, I think we're pretty much forced into the "ignore errors" behavior to deal with other aspects of concurrency.

For the last point on put collisions with deletions, I think that's a sufficiently rare case that it's fine if the put just fails, because it'll do so without doing any damage to the repository (as we've already done the work to make sure put is atomic).

This lets us reuse some internal APIs during dataset deletion
where we will not be able to ask registry for a full DatasetRef
We now have a trash() method that marks the dataset for
deletion and an emptyTrash() that does the remove itself.
This was added after I wrote some of this code in datastore so
I had not used it initially.
This makes it consistent with ucrrent style.
removal allows errors but trash by default will not.
@timj timj merged commit 673156c into master Apr 18, 2020
@timj timj deleted the tickets/DM-23671 branch April 18, 2020 04:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants