New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-21448: Clean up DatasetRef comparisons and immutability #210
Conversation
bb6ff3a
to
4619ccd
Compare
Just added another commit because Jenkins failed, uncovering (in ci_hsc_gen3) some more loophole-based mutating. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good and useful. My main comment is that the role of components seems somewhat under-defined in terms of whether they are meant to be unresolved if the parent is unresolved and whether it's allowed to have components listed in the DatasetRef that are not present in the StorageClass.
return hash(self.datasetType, self.dataId, self.id) | ||
|
||
def __repr__(self): | ||
return f"DatasetRef({self.datasetType}, {self.dataId}, id={self.id}, run={self.run})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about components?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just maintaining the old behavior here, which was to only include them in __str__
for for brevity. Happy to go either way if you have a preference; I don't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some sense repr
is more relevant for reporting them. I did try to compare th new version with the original to see what was previously there. If we've survived without components turning up in repr
this long it's probably fine to leave them out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've rewritten them to make repr
the long one and str
even more concise. I think exec(repr(datasetRef) == datasetRef
would now be true if StorageClass
similarly round-tripped, but I think that's out of scope for this ticket.
run : `Run` | ||
The run this dataset was associated with when it was created. | ||
components : `dict`, optional | ||
A dictionary mapping component name to a `DatasetRef` for that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should also mention that the components supplied here overwrite any existing components but also merged with the existing ones. If this is a resolved datasetRef should we check that all the components have resolved dataset refs? Should we check they have the same run? What if there are some unresolved components and this call only overwrites some of them with resolved versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I'm pretty sure we have concrete uses for:
- resolved
DatasetRefs
with components - resolved
DatasetRefs
without components - unresolved
DatasetRefs
without components
where by "without components" I mean that the DatasetRef
doesn't know about any components it might have, even if they exist, because it's just a wasteful to do anything with them because we know they won't be used.
The dataset-insertion interplay between Registry
and Datastore
is also a lot simpler if we allow the components to be updated in-place in a DatasetRef
(and I think that's okay w.r.t. immutability, since components aren't used in __eq__
/__hash__
).
Anyhow, with all that in mind, how about this:
- We set
components
toNone
on an unresolvedDatasetRef
. We raise in__init__
if components are passed withoutid
andrun
, and merge inresolved()
only if re-resolving an already resolvedDatasetRef
. - We check that all component
DatasetRefs
are resolved, and that names are valid according to the storage class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
@@ -196,38 +203,37 @@ class DatasetRefTestCase(unittest.TestCase): | |||
|
|||
def setUp(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any tests of components in DatasetRef?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some indirect ones over in test_sqlRegistry.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have an explicit test here? Especially given the changes to components handling discussed above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, and thanks for pushing for this; at first I'd thought the tests even caught a bug in the new code, but instead they caught a bug in the new test code (trying to pass a DatasetType
instead of a DatasetRef
as a component) that was worth adding a special check for to yield a better error message.
c48e0ea
to
0815400
Compare
@timj , I've acted on all of your comments, but I'm not sure I've actually resolved all of your concerns. Any further thoughts? |
@TallJimbo sorry for the delay. I've made some minor follow up comments but I mostly agree with your plan. |
0815400
to
cdb187e
Compare
Using run and hash in comparisons was redundant, and using components was unwise, as they weren't always populated. The new equality definition makes DatasetRefs equal if they either both have no .id or if they have the same one (and always the same dataset type and data ID).
This is the first step toward moving to only two allowable states for DatasetRef: either it knows what the Registry knows, and is tied to a particular run and ID, or it only has a dataset type and data ID.
This makes the comparison-key attributes of DatasetRef into regular attributes instead of properties, while blocking the "ref._id = x" loophole we previously exploited to modify DatasetRefs in place. That makes it safe to define __hash__, and so now we do.
We need this to fix previous loophole usage in queryDatasets.
We now require that 'id' be present in order for 'run' or 'components' to be.
__repr__ is now more complete and handles the difference in content between resolved and unresolved refs. __str__ is much more concise.
cdb187e
to
e792d73
Compare
No description provided.