New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-24515: improvements to deletion (and fixes for bugs found along the way) #261
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one comment (twice) about the chances of emptyTrash raising an exception.
python/lsst/daf/butler/_butler.py
Outdated
try: | ||
# Point of no return for removing artifacts | ||
self.datastore.emptyTrash() | ||
except BaseException as err: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emptyTrash is only going to fail if ctrl-C is hit since by default ignore_errors=True.
python/lsst/daf/butler/_butler.py
Outdated
# Point of no return for removing artifacts | ||
self.datastore.emptyTrash() | ||
except BaseException as err: | ||
raise IOError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again this will only happen if something like ctrl-C occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK, few comments on docstrings/comments/typos.
python/lsst/daf/butler/_butler.py
Outdated
are fully removed from the data repository. | ||
purge : `bool`, optional | ||
If `True`, permit `~CollectionType.RUN` collections to be removed, | ||
full removing datasets within them. Requires ``unstore=True`` as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
full -> fully?
if unstore: | ||
for ref in self.registry.queryDatasets(..., collections=name, deduplicate=True): | ||
if self.datastore.exists(ref): | ||
self.datastore.trash(ref) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a general note for the future - for large collections it may be more efficient to implement "bulk trash" operation in data store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely agree - and unless we change the transaction nesting approach instead, this will be a lot of SAVEPOINTs.
python/lsst/daf/butler/_butler.py
Outdated
with in `put` and `ingest`, and disassociates from in `prune` (`tuple` of | ||
`str`). | ||
with in `put` and `ingest`, and disassociates from in `pruneDatasets` | ||
(`tuple` of `str`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be:
`tuple` [`str`]
as described in https://developer.lsst.io/python/numpydoc.html#py-docstring-parameter-types-sequences. Probably not worth changing now if it's consistent within this file.
----- | ||
If this is a `~CollectionType.RUN` collection, all datasets and quanta | ||
in it are also fully removed. This requires that those datasets be | ||
removed from any datastores that hold them first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"removed" as in sent to trash? I think datastore has both remove
and trash
methods, butler does trash
and then removeCollection
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sent to trash is sufficient. Will clarify.
nullable=False, | ||
doc="A link to the associated Dataset.", | ||
nullable=True, | ||
doc="A link to the associated dataset; null if the dataset has been completed.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
completed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be "deleted"; don't know how that happened.
tests/test_butler.py
Outdated
butler.registry.associate(tag1, [ref3]) | ||
# Add a CHAINED collection that searches run1 and then run2. It | ||
# logically contains only ref1, because ref2 is shadowed due to them | ||
# having the same data ID and dataset tpe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: tpe
tests/test_butler.py
Outdated
butler.pruneCollection(tag1, purge=True, unstore=True) | ||
with self.assertRaises(TypeError): | ||
butler.pruneCollection(chain1, purge=True, unstore=True) | ||
# Remove the tagged collection with unstore=False. This should should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: should should
A parent dataset can only have one child dataset for a particular component name. It might be a little weird for it to have the same child dataset satisfy multiple components, but that's not actually a problem.
Now that dataset_location_trash can hold dataset IDs that have been deleted from the main table, we need to make sure we don't reuse those deleted ID values. Most database autoincrement behavior never does that, but SQLite's does recycle them unless you tell in not to.
We believe the implementation is now safe, even in the presence of concurrent deletes.
The len(collections) call that was removed here was a bug: it could be testing the length of a string collection name rather than the actual number of collections. The addition of code to standardize the collections expression at the top of queryDatasets doesn't quite fix that, because what we actually care about is the number of recursively-expanded child collections when CHAINED collections are in play, so the safe thing to do is just to remove that check. But I've also kept the code to standardize the collections anyway - it means we only do that once when queryDatasets self-recurses to handle different dataset types.
pruneCollection is coming.
We now just set dataset_consumer column values to NULL instead of attempting to delete full quanta when a dataset they use is deleted. This still clearly marks those quanta as incomplete while allowing us to rely on (safer, more declarative, probably more performance) in-database ON DELETE clauses rather than Python logic. There were no tests for how we handle quanta on deletion before, and I'm not adding any now, because my goal is just to simplify as much as possible ahead of the dataset-table refactor of DM-21764. We can improve our test coverage (and functionality) later when we have real code that exercises the database quantum tables.
This should save us a lot of SAVEPOINT calls, though it will still leave us with at least one per deleted dataset because the datastore deletion interfaces are not yet vectorized.
2160596
to
8ba5bd1
Compare
No description provided.