Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-24515: improvements to deletion (and fixes for bugs found along the way) #261

Merged
merged 12 commits into from Apr 24, 2020

Conversation

TallJimbo
Copy link
Member

No description provided.

Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment (twice) about the chances of emptyTrash raising an exception.

try:
# Point of no return for removing artifacts
self.datastore.emptyTrash()
except BaseException as err:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emptyTrash is only going to fail if ctrl-C is hit since by default ignore_errors=True.

# Point of no return for removing artifacts
self.datastore.emptyTrash()
except BaseException as err:
raise IOError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again this will only happen if something like ctrl-C occurs.

Copy link
Contributor

@andy-slac andy-slac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK, few comments on docstrings/comments/typos.

are fully removed from the data repository.
purge : `bool`, optional
If `True`, permit `~CollectionType.RUN` collections to be removed,
full removing datasets within them. Requires ``unstore=True`` as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full -> fully?

if unstore:
for ref in self.registry.queryDatasets(..., collections=name, deduplicate=True):
if self.datastore.exists(ref):
self.datastore.trash(ref)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a general note for the future - for large collections it may be more efficient to implement "bulk trash" operation in data store.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree - and unless we change the transaction nesting approach instead, this will be a lot of SAVEPOINTs.

with in `put` and `ingest`, and disassociates from in `prune` (`tuple` of
`str`).
with in `put` and `ingest`, and disassociates from in `pruneDatasets`
(`tuple` of `str`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be:

`tuple` [`str`]

as described in https://developer.lsst.io/python/numpydoc.html#py-docstring-parameter-types-sequences. Probably not worth changing now if it's consistent within this file.

-----
If this is a `~CollectionType.RUN` collection, all datasets and quanta
in it are also fully removed. This requires that those datasets be
removed from any datastores that hold them first.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"removed" as in sent to trash? I think datastore has both remove and trash methods, butler does trash and then removeCollection.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sent to trash is sufficient. Will clarify.

nullable=False,
doc="A link to the associated Dataset.",
nullable=True,
doc="A link to the associated dataset; null if the dataset has been completed.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

completed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "deleted"; don't know how that happened.

butler.registry.associate(tag1, [ref3])
# Add a CHAINED collection that searches run1 and then run2. It
# logically contains only ref1, because ref2 is shadowed due to them
# having the same data ID and dataset tpe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: tpe

butler.pruneCollection(tag1, purge=True, unstore=True)
with self.assertRaises(TypeError):
butler.pruneCollection(chain1, purge=True, unstore=True)
# Remove the tagged collection with unstore=False. This should should
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: should should

A parent dataset can only have one child dataset for a particular
component name.  It might be a little weird for it to have the same
child dataset satisfy multiple components, but that's not actually a
problem.
Now that dataset_location_trash can hold dataset IDs that have been
deleted from the main table, we need to make sure we don't reuse those
deleted ID values.  Most database autoincrement behavior never does
that, but SQLite's does recycle them unless you tell in not to.
We believe the implementation is now safe, even in the presence of
concurrent deletes.
The len(collections) call that was removed here was a bug: it could
be testing the length of a string collection name rather than the
actual number of collections.  The addition of code to standardize
the collections expression at the top of queryDatasets doesn't quite
fix that, because what we actually care about is the number of
recursively-expanded child collections when CHAINED collections are in
play, so the safe thing to do is just to remove that check.

But I've also kept the code to standardize the collections anyway - it
means we only do that once when queryDatasets self-recurses to handle
different dataset types.
We now just set dataset_consumer column values to NULL instead of
attempting to delete full quanta when a dataset they use is deleted.
This still clearly marks those quanta as incomplete while allowing us
to rely on (safer, more declarative, probably more performance)
in-database ON DELETE clauses rather than Python logic.

There were no tests for how we handle quanta on deletion before, and
I'm not adding any now, because my goal is just to simplify as much as
possible ahead of the dataset-table refactor of DM-21764.  We can
improve our test coverage (and functionality) later when we have real
code that exercises the database quantum tables.
This should save us a lot of SAVEPOINT calls, though it will still
leave us with at least one per deleted dataset because the datastore
deletion interfaces are not yet vectorized.
@TallJimbo TallJimbo merged commit 57de52e into master Apr 24, 2020
@TallJimbo TallJimbo deleted the tickets/DM-24515 branch April 24, 2020 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants