DM-24515: improvements to deletion (and fixes for bugs found along the way) #261

TallJimbo · 2020-04-23T16:33:51Z

No description provided.

timj

Just one comment (twice) about the chances of emptyTrash raising an exception.

timj · 2020-04-23T16:43:19Z

python/lsst/daf/butler/_butler.py

+            try:
+                # Point of no return for removing artifacts
+                self.datastore.emptyTrash()
+            except BaseException as err:


emptyTrash is only going to fail if ctrl-C is hit since by default ignore_errors=True.

timj · 2020-04-23T16:57:24Z

python/lsst/daf/butler/_butler.py

+                # Point of no return for removing artifacts
+                self.datastore.emptyTrash()
+            except BaseException as err:
+                raise IOError(


Again this will only happen if something like ctrl-C occurs.

andy-slac

Looks OK, few comments on docstrings/comments/typos.

andy-slac · 2020-04-23T18:38:28Z

python/lsst/daf/butler/_butler.py

+            are fully removed from the data repository.
+        purge : `bool`, optional
+            If `True`, permit `~CollectionType.RUN` collections to be removed,
+            full removing datasets within them.  Requires ``unstore=True`` as


full -> fully?

andy-slac · 2020-04-23T23:02:52Z

python/lsst/daf/butler/_butler.py

+            if unstore:
+                for ref in self.registry.queryDatasets(..., collections=name, deduplicate=True):
+                    if self.datastore.exists(ref):
+                        self.datastore.trash(ref)


Just a general note for the future - for large collections it may be more efficient to implement "bulk trash" operation in data store.

I completely agree - and unless we change the transaction nesting approach instead, this will be a lot of SAVEPOINTs.

andy-slac · 2020-04-23T23:18:04Z

python/lsst/daf/butler/_butler.py

-    with in `put` and `ingest`, and disassociates from in `prune` (`tuple` of
-    `str`).
+    with in `put` and `ingest`, and disassociates from in `pruneDatasets`
+    (`tuple` of `str`).


I think this should be:

`tuple` [`str`]

as described in https://developer.lsst.io/python/numpydoc.html#py-docstring-parameter-types-sequences. Probably not worth changing now if it's consistent within this file.

andy-slac · 2020-04-23T23:27:17Z

python/lsst/daf/butler/registry/_registry.py

+        -----
+        If this is a `~CollectionType.RUN` collection, all datasets and quanta
+        in it are also fully removed.  This requires that those datasets be
+        removed from any datastores that hold them first.


"removed" as in sent to trash? I think datastore has both remove and trash methods, butler does trash and then removeCollection.

Yes, sent to trash is sufficient. Will clarify.

andy-slac · 2020-04-23T23:53:47Z

python/lsst/daf/butler/registry/tables.py

-                    nullable=False,
-                    doc="A link to the associated Dataset.",
+                    nullable=True,
+                    doc="A link to the associated dataset; null if the dataset has been completed.",


Should be "deleted"; don't know how that happened.

andy-slac · 2020-04-23T23:58:48Z

tests/test_butler.py

+        butler.registry.associate(tag1, [ref3])
+        # Add a CHAINED collection that searches run1 and then run2.  It
+        # logically contains only ref1, because ref2 is shadowed due to them
+        # having the same data ID and dataset tpe.


andy-slac · 2020-04-24T00:00:21Z

tests/test_butler.py

+            butler.pruneCollection(tag1, purge=True, unstore=True)
+        with self.assertRaises(TypeError):
+            butler.pruneCollection(chain1, purge=True, unstore=True)
+        # Remove the tagged collection with unstore=False.  This should should


typo: should should

A parent dataset can only have one child dataset for a particular component name. It might be a little weird for it to have the same child dataset satisfy multiple components, but that's not actually a problem.

Now that dataset_location_trash can hold dataset IDs that have been deleted from the main table, we need to make sure we don't reuse those deleted ID values. Most database autoincrement behavior never does that, but SQLite's does recycle them unless you tell in not to.

We believe the implementation is now safe, even in the presence of concurrent deletes.

The len(collections) call that was removed here was a bug: it could be testing the length of a string collection name rather than the actual number of collections. The addition of code to standardize the collections expression at the top of queryDatasets doesn't quite fix that, because what we actually care about is the number of recursively-expanded child collections when CHAINED collections are in play, so the safe thing to do is just to remove that check. But I've also kept the code to standardize the collections anyway - it means we only do that once when queryDatasets self-recurses to handle different dataset types.

pruneCollection is coming.

We now just set dataset_consumer column values to NULL instead of attempting to delete full quanta when a dataset they use is deleted. This still clearly marks those quanta as incomplete while allowing us to rely on (safer, more declarative, probably more performance) in-database ON DELETE clauses rather than Python logic. There were no tests for how we handle quanta on deletion before, and I'm not adding any now, because my goal is just to simplify as much as possible ahead of the dataset-table refactor of DM-21764. We can improve our test coverage (and functionality) later when we have real code that exercises the database quantum tables.

This should save us a lot of SAVEPOINT calls, though it will still leave us with at least one per deleted dataset because the datastore deletion interfaces are not yet vectorized.

timj reviewed Apr 23, 2020

View reviewed changes

andy-slac approved these changes Apr 24, 2020

View reviewed changes

TallJimbo added 12 commits April 23, 2020 23:09

Fix dataset_composition primary key.

d3046b3

A parent dataset can only have one child dataset for a particular component name. It might be a little weird for it to have the same child dataset satisfy multiple components, but that's not actually a problem.

Refactor Butler.prune and remove warnings about unreliability.

e47488c

We believe the implementation is now safe, even in the presence of concurrent deletes.

Fix documentation bug in butler.prune.

1c73ac9

Don't try to import datasets into datastore when there are none.

7c0f555

Shortcut queries and subqueries that we know yield no results.

400fba1

Add support for removing collections from Registry.

7e2d90a

Rename Butler.prune to Butler.pruneDatasets.

6a03b2e

pruneCollection is coming.

Add method to remove entire collections in Butler.

44addac

Vectorize Registry dataset deletion API.

8ba5bd1

This should save us a lot of SAVEPOINT calls, though it will still leave us with at least one per deleted dataset because the datastore deletion interfaces are not yet vectorized.

TallJimbo force-pushed the tickets/DM-24515 branch from 2160596 to 8ba5bd1 Compare April 24, 2020 03:26

TallJimbo merged commit 57de52e into master Apr 24, 2020

TallJimbo deleted the tickets/DM-24515 branch April 24, 2020 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-24515: improvements to deletion (and fixes for bugs found along the way) #261

DM-24515: improvements to deletion (and fixes for bugs found along the way) #261

TallJimbo commented Apr 23, 2020

timj left a comment

timj Apr 23, 2020

timj Apr 23, 2020

andy-slac left a comment

andy-slac Apr 23, 2020

andy-slac Apr 23, 2020

TallJimbo Apr 24, 2020

andy-slac Apr 23, 2020

andy-slac Apr 23, 2020

TallJimbo Apr 24, 2020

andy-slac Apr 23, 2020

TallJimbo Apr 24, 2020

andy-slac Apr 23, 2020

andy-slac Apr 24, 2020

DM-24515: improvements to deletion (and fixes for bugs found along the way) #261

DM-24515: improvements to deletion (and fixes for bugs found along the way) #261

Conversation

TallJimbo commented Apr 23, 2020

timj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andy-slac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment