DM-21448: Clean up DatasetRef comparisons and immutability #210

TallJimbo · 2019-12-05T16:31:19Z

No description provided.

TallJimbo · 2019-12-05T19:31:20Z

Just added another commit because Jenkins failed, uncovering (in ci_hsc_gen3) some more loophole-based mutating.

timj

This looks good and useful. My main comment is that the role of components seems somewhat under-defined in terms of whether they are meant to be unresolved if the parent is unresolved and whether it's allowed to have components listed in the DatasetRef that are not present in the StorageClass.

python/lsst/daf/butler/core/datasets/ref.py

timj · 2019-12-05T17:49:24Z

python/lsst/daf/butler/core/datasets/ref.py

+        return hash(self.datasetType, self.dataId, self.id)
+
+    def __repr__(self):
+        return f"DatasetRef({self.datasetType}, {self.dataId}, id={self.id}, run={self.run})"


What about components?

I'm just maintaining the old behavior here, which was to only include them in __str__ for for brevity. Happy to go either way if you have a preference; I don't.

In some sense repr is more relevant for reporting them. I did try to compare th new version with the original to see what was previously there. If we've survived without components turning up in repr this long it's probably fine to leave them out.

I've rewritten them to make repr the long one and str even more concise. I think exec(repr(datasetRef) == datasetRef would now be true if StorageClass similarly round-tripped, but I think that's out of scope for this ticket.

timj · 2019-12-06T20:19:08Z

python/lsst/daf/butler/core/datasets/ref.py

+        run : `Run`
+            The run this dataset was associated with when it was created.
+        components : `dict`, optional
+            A dictionary mapping component name to a `DatasetRef` for that


I think this should also mention that the components supplied here overwrite any existing components but also merged with the existing ones. If this is a resolved datasetRef should we check that all the components have resolved dataset refs? Should we check they have the same run? What if there are some unresolved components and this call only overwrites some of them with resolved versions?

So, I'm pretty sure we have concrete uses for:

resolved DatasetRefs with components

resolved DatasetRefs without components

unresolved DatasetRefs without components

where by "without components" I mean that the DatasetRef doesn't know about any components it might have, even if they exist, because it's just a wasteful to do anything with them because we know they won't be used.

The dataset-insertion interplay between Registry and Datastore is also a lot simpler if we allow the components to be updated in-place in a DatasetRef (and I think that's okay w.r.t. immutability, since components aren't used in __eq__/__hash__).

Anyhow, with all that in mind, how about this:

We set components to None on an unresolved DatasetRef. We raise in __init__ if components are passed without id and run, and merge in resolved() only if re-resolving an already resolved DatasetRef.

We check that all component DatasetRefs are resolved, and that names are valid according to the storage class.

Sounds good to me.

timj · 2019-12-06T20:26:45Z

tests/test_datasets.py

@@ -196,38 +203,37 @@ class DatasetRefTestCase(unittest.TestCase):

    def setUp(self):


Are there any tests of components in DatasetRef?

There are some indirect ones over in test_sqlRegistry.py.

Can we have an explicit test here? Especially given the changes to components handling discussed above.

Done, and thanks for pushing for this; at first I'd thought the tests even caught a bug in the new code, but instead they caught a bug in the new test code (trying to pass a DatasetType instead of a DatasetRef as a component) that was worth adding a special check for to yield a better error message.

python/lsst/daf/butler/core/datasets/ref.py

TallJimbo · 2019-12-09T22:04:03Z

@timj , I've acted on all of your comments, but I'm not sure I've actually resolved all of your concerns. Any further thoughts?

timj · 2019-12-09T23:15:54Z

@TallJimbo sorry for the delay. I've made some minor follow up comments but I mostly agree with your plan.

Using run and hash in comparisons was redundant, and using components was unwise, as they weren't always populated. The new equality definition makes DatasetRefs equal if they either both have no .id or if they have the same one (and always the same dataset type and data ID).

This is the first step toward moving to only two allowable states for DatasetRef: either it knows what the Registry knows, and is tied to a particular run and ID, or it only has a dataset type and data ID.

This makes the comparison-key attributes of DatasetRef into regular attributes instead of properties, while blocking the "ref._id = x" loophole we previously exploited to modify DatasetRefs in place. That makes it safe to define __hash__, and so now we do.

We need this to fix previous loophole usage in queryDatasets.

We now require that 'id' be present in order for 'run' or 'components' to be.

__repr__ is now more complete and handles the difference in content between resolved and unresolved refs. __str__ is much more concise.

TallJimbo force-pushed the tickets/DM-21448 branch from bb6ff3a to 4619ccd Compare December 5, 2019 16:32

timj approved these changes Dec 6, 2019

View reviewed changes

TallJimbo force-pushed the tickets/DM-21448 branch from c48e0ea to 0815400 Compare December 9, 2019 22:02

TallJimbo force-pushed the tickets/DM-21448 branch from 0815400 to cdb187e Compare December 10, 2019 15:14

TallJimbo added 12 commits December 10, 2019 17:04

Move DatasetType and DatasetRef into subpackage, separate modules.

59b3083

Remove unused quantum-typed attributes from DatasetRef.

9647e5b

Add missing parameter docs to DatasetRef.

be39541

Add type annotations to DatasetRef.

0409f17

Replace DatasetRef.detach() with resolved() and unresolved().

82b2ec9

This is the first step toward moving to only two allowable states for DatasetRef: either it knows what the Registry knows, and is tied to a particular run and ID, or it only has a dataset type and data ID.

Add method to replace DatasetRef data ID with expanded one.

4ed4e0b

We need this to fix previous loophole usage in queryDatasets.

Add consistency to DatasetRef attribute presence or absence.

d815bee

We now require that 'id' be present in order for 'run' or 'components' to be.

Rewrite __str__ and __repr__ for DatasetRef.

d020192

__repr__ is now more complete and handles the difference in content between resolved and unresolved refs. __str__ is much more concise.

Add method to create DatasetType for component.

e203fa2

Improve DatasetRef tests.

e792d73

TallJimbo force-pushed the tickets/DM-21448 branch from cdb187e to e792d73 Compare December 10, 2019 22:10

TallJimbo merged commit 74a011d into master Dec 10, 2019

TallJimbo deleted the tickets/DM-21448 branch December 10, 2019 22:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-21448: Clean up DatasetRef comparisons and immutability #210

DM-21448: Clean up DatasetRef comparisons and immutability #210

TallJimbo commented Dec 5, 2019

TallJimbo commented Dec 5, 2019

timj left a comment

timj Dec 5, 2019

TallJimbo Dec 7, 2019

timj Dec 9, 2019

TallJimbo Dec 10, 2019

timj Dec 6, 2019

TallJimbo Dec 7, 2019 •

edited

timj Dec 9, 2019

timj Dec 6, 2019

TallJimbo Dec 7, 2019

timj Dec 9, 2019

TallJimbo Dec 10, 2019

TallJimbo commented Dec 9, 2019

timj commented Dec 9, 2019

		@@ -196,38 +203,37 @@ class DatasetRefTestCase(unittest.TestCase):

		def setUp(self):

DM-21448: Clean up DatasetRef comparisons and immutability #210

DM-21448: Clean up DatasetRef comparisons and immutability #210

Conversation

TallJimbo commented Dec 5, 2019

TallJimbo commented Dec 5, 2019

timj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TallJimbo Dec 7, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TallJimbo commented Dec 9, 2019

timj commented Dec 9, 2019

TallJimbo Dec 7, 2019 •

edited