New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-23671: Change datastore remove to a trash + emptyTrash #259
Conversation
c27be19
to
aeec09e
Compare
This matches the API that uses that table
The trash table is identical to dataset_location but without the foreign key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a huge improvement. I've left a number of comments about possible further improvements, mostly concurrency edge cases. And most of those are about trying avoid query-then-operate logic, because I've learned that leaves one open to something unexpected happening in between.
But I'm also not sure all of those separate comments form a coherent big picture of how to make this rock-solid, at least not yet, especially when different varieties of composites are in play. And I'm not sure we need rock-solid yet - this is already enough to get us from "easy to accidentally corrupt your repo" to "very difficult to accidentally corrupt your repo", so feel free to defer any of my comments to a future ticket.
[table.columns.datastore_name, table.columns.dataset_id] | ||
).where( | ||
sqlalchemy.sql.and_(table.columns.dataset_id.in_([ref.id for ref in refs]), | ||
table.columns.datastore_name == datastoreName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this query scales when the number of refs is very large. I don't know that it doesn't, but it might be good to get a database expert to weigh in (@andy-slac, maybe?). My first thought would be that we might want to chunk these up, but I have no idea what the chunk size ought to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment this is only being called with a single ref and associated components so I don't think it's going to be unbounded. Of course in the future you might want to be doing bulk trashing rather than for each ref: trash
.
""" | ||
# We only want to move rows that already exist in the main table | ||
filtered = self.checkDatasetLocations(datastoreName, refs) | ||
self.canDeleteDatasetLocations(datastoreName, filtered) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing the query that filters down to the datasets that exist in the database separately from the insert into the trash table might be unsafe in the presence of concurrent access - I think the SELECT query could be out of date by the time the INSERT query runs, even if they're in a transaction. Unfortunately, fixing that would involve add INSERT...SELECT support to Database
or punching a bit of a hole in its abstraction layer to let this code do it directly. I'm going to have to move this code around a bit anyway on DM-21764, so I could have a try at fixing it there when I do. That would need to involve merging checkDatasetLocations
and canDeleteDatasetLocations
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with you taking another stab at this. The reason I had to do it was that in some cases a DatasetRef can refer to components that have already been trashed (since we can't turn the component off within the DatasetRef) so switching things to only moving rows that are still there prevented me from adding rows to the trash that were already listed in the trash.
@@ -1078,12 +1074,15 @@ def prune(self, refs: Iterable[DatasetRef], *, | |||
# to ignore already-removed-from-datastore datasets | |||
# anyway. | |||
if self.datastore.exists(ref): | |||
self.datastore.remove(ref) | |||
self.datastore.trash(ref) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could now simplify a lot of this, removing the giant TODO comment block, and all the checks under the if purge
blocks (which will be handled more rigorously by transaction rollback now).
- Start a registry transaction context (I don't think the Datastore ones are useful here).
- Expand datasets to components recursively.
- Do the
if unstore
loop that callsself.datastore.trash
. - Do the disassociate and and possibly the purge Registry operations.
- End the transaction context, committing those changes (or raising if something went wrong).
- Call
emptyTrash
.
I'd love to have someone else check my logic on that, of course, especially on whether it's safe if there are concurrent deletion operations and how it interacts with various kinds of composites in datastore (for virtual composites, are there ever records in dataset_location
)?
raise FileNotFoundError(f"Requested dataset ({ref}) does not exist") | ||
|
||
if not self._artifact_exists(location): | ||
raise FileNotFoundError(f"No such file: {location.uri}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect making this a silent no-op will make concurrent calls to emptyTrash
safer, and maybe totally safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Concurrent access would suggest that we should also skip the earlier test since it implies that the trashed file has also been removed from the internal registry as well as having the file removed.
log.debug("Trash %s in datastore %s", ref, self.name) | ||
|
||
# Check that this dataset is known to datastore | ||
self._get_dataset_info(ref) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check necessary? It'd probably be safer to make the deletion do nothing if the ref
isn't actually in the datastore, to guard against the case where the a concurrent delete happens between the check and the move to trash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case a concurrent delete should be impossible because in memory datastore is constrained to a particular python process. For the other datastores, I don't really know how to respond to multiple processes all trying to delete the same dataset. It sort of means that you have to say that trash()
is never allowed to fail because you can never really tell if the file should be there or not.
self._remove_from_registry(ref) | ||
for refId in artifactsToRemove: | ||
log.debug("Removing artifact %s from datastore %s", realID, ref) | ||
del self.datasets[refId] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be safer to do this operation (unlike the others) one dataset at a time, by first deleting the artifact, then deleting the internal records, and only then removing from the trash table. That means a Datastore would have to be willing to silently ignore artifact deletion if it fails because the artifact is already gone (i.e. because a previous deletion operation only got that far before it died). But I think that's reasonable, and the failure mode of the current logic is more dangerous: a file could be left around even though it'd already been completely removed from all records (even the trash).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One option is to do a rename in the main trash emptying loop and then a final delete afterwards. You are right though that one reason why the file is no longer there is that some other process is emptying the trash at the same time and got to that file first (and the rename would also fail in that case as well). Multiple people emptying the trash simultaneously implies that we mostly have to suck up any errors and cross fingers. This is a problem if the file is left hanging around.
The trash emptying currently tries to delete everything. Another option is for emptyTrash to have a "delete everything mode" as now but also have a "delete only these refs and children" option (ie the refs that you only shortly before told to be trashed).
Concurrent accessing does raise the possibility that one process is emptying trash whilst another is putting the dataset that has been deleted. This will fail at the moment because the internal registry will still mention the file and the file will be on disk and since the emptyTrash is in a transaction I think that will be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed a bit on slack, I think we're pretty much forced into the "ignore errors" behavior to deal with other aspects of concurrency.
For the last point on put
collisions with deletions, I think that's a sufficiently rare case that it's fine if the put
just fails, because it'll do so without doing any damage to the repository (as we've already done the work to make sure put
is atomic).
This lets us reuse some internal APIs during dataset deletion where we will not be able to ask registry for a full DatasetRef
We now have a trash() method that marks the dataset for deletion and an emptyTrash() that does the remove itself.
This was added after I wrote some of this code in datastore so I had not used it initially.
This makes it consistent with ucrrent style.
removal allows errors but trash by default will not.
Splits the remove step into two phases. The first moves a row from the dataset_location to the dataset_location_trash table. The second step deletes all the datasets that have been trashed. This will hopefully reduce further the chances of datasets being deleted and leaving the datastore in an inconsistent state.
I need to clean up some of the commits but tests all pass so it's ready for a look.