Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-21448: Clean up DatasetRef comparisons and immutability #210

Merged
merged 12 commits into from Dec 10, 2019

Conversation

TallJimbo
Copy link
Member

No description provided.

@TallJimbo
Copy link
Member Author

Just added another commit because Jenkins failed, uncovering (in ci_hsc_gen3) some more loophole-based mutating.

Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good and useful. My main comment is that the role of components seems somewhat under-defined in terms of whether they are meant to be unresolved if the parent is unresolved and whether it's allowed to have components listed in the DatasetRef that are not present in the StorageClass.

python/lsst/daf/butler/core/datasets/ref.py Show resolved Hide resolved
python/lsst/daf/butler/core/datasets/ref.py Outdated Show resolved Hide resolved
return hash(self.datasetType, self.dataId, self.id)

def __repr__(self):
return f"DatasetRef({self.datasetType}, {self.dataId}, id={self.id}, run={self.run})"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about components?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just maintaining the old behavior here, which was to only include them in __str__ for for brevity. Happy to go either way if you have a preference; I don't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some sense repr is more relevant for reporting them. I did try to compare th new version with the original to see what was previously there. If we've survived without components turning up in repr this long it's probably fine to leave them out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rewritten them to make repr the long one and str even more concise. I think exec(repr(datasetRef) == datasetRef would now be true if StorageClass similarly round-tripped, but I think that's out of scope for this ticket.

run : `Run`
The run this dataset was associated with when it was created.
components : `dict`, optional
A dictionary mapping component name to a `DatasetRef` for that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should also mention that the components supplied here overwrite any existing components but also merged with the existing ones. If this is a resolved datasetRef should we check that all the components have resolved dataset refs? Should we check they have the same run? What if there are some unresolved components and this call only overwrites some of them with resolved versions?

Copy link
Member Author

@TallJimbo TallJimbo Dec 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I'm pretty sure we have concrete uses for:

  • resolved DatasetRefs with components
  • resolved DatasetRefs without components
  • unresolved DatasetRefs without components

where by "without components" I mean that the DatasetRef doesn't know about any components it might have, even if they exist, because it's just a wasteful to do anything with them because we know they won't be used.

The dataset-insertion interplay between Registry and Datastore is also a lot simpler if we allow the components to be updated in-place in a DatasetRef (and I think that's okay w.r.t. immutability, since components aren't used in __eq__/__hash__).

Anyhow, with all that in mind, how about this:

  • We set components to None on an unresolved DatasetRef. We raise in __init__ if components are passed without id and run, and merge in resolved() only if re-resolving an already resolved DatasetRef.
  • We check that all component DatasetRefs are resolved, and that names are valid according to the storage class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

@@ -196,38 +203,37 @@ class DatasetRefTestCase(unittest.TestCase):

def setUp(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any tests of components in DatasetRef?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some indirect ones over in test_sqlRegistry.py.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have an explicit test here? Especially given the changes to components handling discussed above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and thanks for pushing for this; at first I'd thought the tests even caught a bug in the new code, but instead they caught a bug in the new test code (trying to pass a DatasetType instead of a DatasetRef as a component) that was worth adding a special check for to yield a better error message.

python/lsst/daf/butler/core/datasets/ref.py Show resolved Hide resolved
python/lsst/daf/butler/core/datasets/ref.py Show resolved Hide resolved
@TallJimbo
Copy link
Member Author

@timj , I've acted on all of your comments, but I'm not sure I've actually resolved all of your concerns. Any further thoughts?

@timj
Copy link
Member

timj commented Dec 9, 2019

@TallJimbo sorry for the delay. I've made some minor follow up comments but I mostly agree with your plan.

Using run and hash in comparisons was redundant, and using components
was unwise, as they weren't always populated.

The new equality definition makes DatasetRefs equal if they either both
have no .id or if they have the same one (and always the same
dataset type and data ID).
This is the first step toward moving to only two allowable states
for DatasetRef: either it knows what the Registry knows, and is tied
to a particular run and ID, or it only has a dataset type and data ID.
This makes the comparison-key attributes of DatasetRef into regular
attributes instead of properties, while blocking the "ref._id = x"
loophole we previously exploited to modify DatasetRefs in place.

That makes it safe to define __hash__, and so now we do.
We need this to fix previous loophole usage in queryDatasets.
We now require that 'id' be present in order for 'run' or 'components'
to be.
__repr__ is now more complete and handles the difference in content
between resolved and unresolved refs.  __str__ is much more concise.
@TallJimbo TallJimbo merged commit 74a011d into master Dec 10, 2019
@TallJimbo TallJimbo deleted the tickets/DM-21448 branch December 10, 2019 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants