DM-23671: Change datastore remove to a trash + emptyTrash #259

timj · 2020-04-16T17:37:55Z

Splits the remove step into two phases. The first moves a row from the dataset_location to the dataset_location_trash table. The second step deletes all the datasets that have been trashed. This will hopefully reduce further the chances of datasets being deleted and leaving the datastore in an inconsistent state.

I need to clean up some of the commits but tests all pass so it's ready for a look.

This matches the API that uses that table

The trash table is identical to dataset_location but without the foreign key.

TallJimbo

This is a huge improvement. I've left a number of comments about possible further improvements, mostly concurrency edge cases. And most of those are about trying avoid query-then-operate logic, because I've learned that leaves one open to something unexpected happening in between.

But I'm also not sure all of those separate comments form a coherent big picture of how to make this rock-solid, at least not yet, especially when different varieties of composites are in play. And I'm not sure we need rock-solid yet - this is already enough to get us from "easy to accidentally corrupt your repo" to "very difficult to accidentally corrupt your repo", so feel free to defer any of my comments to a future ticket.

python/lsst/daf/butler/core/datasets/ref.py

TallJimbo · 2020-04-17T01:40:30Z

python/lsst/daf/butler/registry/_registry.py

+                [table.columns.datastore_name, table.columns.dataset_id]
+            ).where(
+                sqlalchemy.sql.and_(table.columns.dataset_id.in_([ref.id for ref in refs]),
+                                    table.columns.datastore_name == datastoreName)


I don't know if this query scales when the number of refs is very large. I don't know that it doesn't, but it might be good to get a database expert to weigh in (@andy-slac, maybe?). My first thought would be that we might want to chunk these up, but I have no idea what the chunk size ought to be.

At the moment this is only being called with a single ref and associated components so I don't think it's going to be unbounded. Of course in the future you might want to be doing bulk trashing rather than for each ref: trash.

TallJimbo · 2020-04-17T01:49:29Z

python/lsst/daf/butler/registry/_registry.py

+        """
+        # We only want to move rows that already exist in the main table
+        filtered = self.checkDatasetLocations(datastoreName, refs)
+        self.canDeleteDatasetLocations(datastoreName, filtered)


Doing the query that filters down to the datasets that exist in the database separately from the insert into the trash table might be unsafe in the presence of concurrent access - I think the SELECT query could be out of date by the time the INSERT query runs, even if they're in a transaction. Unfortunately, fixing that would involve add INSERT...SELECT support to Database or punching a bit of a hole in its abstraction layer to let this code do it directly. I'm going to have to move this code around a bit anyway on DM-21764, so I could have a try at fixing it there when I do. That would need to involve merging checkDatasetLocations and canDeleteDatasetLocations.

I'm fine with you taking another stab at this. The reason I had to do it was that in some cases a DatasetRef can refer to components that have already been trashed (since we can't turn the component off within the DatasetRef) so switching things to only moving rows that are still there prevented me from adding rows to the trash that were already listed in the trash.

python/lsst/daf/butler/registry/_registry.py

TallJimbo · 2020-04-17T02:16:51Z

python/lsst/daf/butler/_butler.py

@@ -1078,12 +1074,15 @@ def prune(self, refs: Iterable[DatasetRef], *,
                        # to ignore already-removed-from-datastore datasets
                        # anyway.
                        if self.datastore.exists(ref):
-                            self.datastore.remove(ref)
+                            self.datastore.trash(ref)


I think we could now simplify a lot of this, removing the giant TODO comment block, and all the checks under the if purge blocks (which will be handled more rigorously by transaction rollback now).

Start a registry transaction context (I don't think the Datastore ones are useful here).

Expand datasets to components recursively.

Do the if unstore loop that calls self.datastore.trash.

Do the disassociate and and possibly the purge Registry operations.

End the transaction context, committing those changes (or raising if something went wrong).

Call emptyTrash.

I'd love to have someone else check my logic on that, of course, especially on whether it's safe if there are concurrent deletion operations and how it interacts with various kinds of composites in datastore (for virtual composites, are there ever records in dataset_location)?

TallJimbo · 2020-04-17T02:22:02Z

python/lsst/daf/butler/datastores/fileLikeDatastore.py

+                raise FileNotFoundError(f"Requested dataset ({ref}) does not exist")
+
+            if not self._artifact_exists(location):
+                raise FileNotFoundError(f"No such file: {location.uri}")


I suspect making this a silent no-op will make concurrent calls to emptyTrash safer, and maybe totally safe.

Ok. Concurrent access would suggest that we should also skip the earlier test since it implies that the trashed file has also been removed from the internal registry as well as having the file removed.

python/lsst/daf/butler/datastores/fileLikeDatastore.py

python/lsst/daf/butler/datastores/genericDatastore.py

TallJimbo · 2020-04-17T02:32:05Z

python/lsst/daf/butler/datastores/inMemoryDatastore.py

+        log.debug("Trash %s in datastore %s", ref, self.name)
+
+        # Check that this dataset is known to datastore
+        self._get_dataset_info(ref)


Is this check necessary? It'd probably be safer to make the deletion do nothing if the ref isn't actually in the datastore, to guard against the case where the a concurrent delete happens between the check and the move to trash.

In this particular case a concurrent delete should be impossible because in memory datastore is constrained to a particular python process. For the other datastores, I don't really know how to respond to multiple processes all trying to delete the same dataset. It sort of means that you have to say that trash() is never allowed to fail because you can never really tell if the file should be there or not.

TallJimbo · 2020-04-17T02:41:17Z

python/lsst/daf/butler/datastores/inMemoryDatastore.py

-        self._remove_from_registry(ref)
+        for refId in artifactsToRemove:
+            log.debug("Removing artifact %s from datastore %s", realID, ref)
+            del self.datasets[refId]


I think it might be safer to do this operation (unlike the others) one dataset at a time, by first deleting the artifact, then deleting the internal records, and only then removing from the trash table. That means a Datastore would have to be willing to silently ignore artifact deletion if it fails because the artifact is already gone (i.e. because a previous deletion operation only got that far before it died). But I think that's reasonable, and the failure mode of the current logic is more dangerous: a file could be left around even though it'd already been completely removed from all records (even the trash).

One option is to do a rename in the main trash emptying loop and then a final delete afterwards. You are right though that one reason why the file is no longer there is that some other process is emptying the trash at the same time and got to that file first (and the rename would also fail in that case as well). Multiple people emptying the trash simultaneously implies that we mostly have to suck up any errors and cross fingers. This is a problem if the file is left hanging around.

The trash emptying currently tries to delete everything. Another option is for emptyTrash to have a "delete everything mode" as now but also have a "delete only these refs and children" option (ie the refs that you only shortly before told to be trashed).

Concurrent accessing does raise the possibility that one process is emptying trash whilst another is putting the dataset that has been deleted. This will fail at the moment because the internal registry will still mention the file and the file will be on disk and since the emptyTrash is in a transaction I think that will be fine.

As we discussed a bit on slack, I think we're pretty much forced into the "ignore errors" behavior to deal with other aspects of concurrency.

For the last point on put collisions with deletions, I think that's a sufficiently rare case that it's fine if the put just fails, because it'll do so without doing any damage to the repository (as we've already done the work to make sure put is atomic).

This lets us reuse some internal APIs during dataset deletion where we will not be able to ask registry for a full DatasetRef

We now have a trash() method that marks the dataset for deletion and an emptyTrash() that does the remove itself.

This was added after I wrote some of this code in datastore so I had not used it initially.

This makes it consistent with ucrrent style.

removal allows errors but trash by default will not.

timj requested a review from TallJimbo April 16, 2020 17:37

timj force-pushed the tickets/DM-23671 branch 7 times, most recently from c27be19 to aeec09e Compare April 16, 2020 18:08

Rename dataset_storage table to dataset_location

2b1d418

This matches the API that uses that table

timj force-pushed the tickets/DM-23671 branch from aeec09e to e816205 Compare April 16, 2020 18:10

Add a dataset_location_trash table

60cdcdb

The trash table is identical to dataset_location but without the foreign key.

timj force-pushed the tickets/DM-23671 branch from e816205 to 6e9d218 Compare April 16, 2020 18:17

TallJimbo approved these changes Apr 17, 2020

View reviewed changes

timj added 5 commits April 17, 2020 10:26

Add a fake dataset ref with just an integer ID

647339e

This lets us reuse some internal APIs during dataset deletion where we will not be able to ask registry for a full DatasetRef

Add new APIs for manipulating dataset_location_trash tables

21a8264

Reorganize datastore removal into two phases

66cc8e6

We now have a trash() method that marks the dataset for deletion and an emptyTrash() that does the remove itself.

Use DatasetRef.flatten in datastore

fac0ad7

This was added after I wrote some of this code in datastore so I had not used it initially.

Move datastore code from using "Dataset" to "dataset"

97edcfe

This makes it consistent with ucrrent style.

timj force-pushed the tickets/DM-23671 branch from 312906b to 97edcfe Compare April 17, 2020 17:26

Allow caller to decide whether errors should be ignored for trash

451f799

removal allows errors but trash by default will not.

timj merged commit 673156c into master Apr 18, 2020

timj deleted the tickets/DM-23671 branch April 18, 2020 04:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-23671: Change datastore remove to a trash + emptyTrash #259

DM-23671: Change datastore remove to a trash + emptyTrash #259

timj commented Apr 16, 2020

TallJimbo left a comment

TallJimbo Apr 17, 2020

timj Apr 17, 2020

TallJimbo Apr 17, 2020

timj Apr 17, 2020

TallJimbo Apr 17, 2020

TallJimbo Apr 17, 2020

timj Apr 17, 2020

TallJimbo Apr 17, 2020

timj Apr 17, 2020

TallJimbo Apr 17, 2020

timj Apr 17, 2020

TallJimbo Apr 18, 2020

DM-23671: Change datastore remove to a trash + emptyTrash #259

DM-23671: Change datastore remove to a trash + emptyTrash #259

Conversation

timj commented Apr 16, 2020

TallJimbo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment