New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-33569: Use explicit reason for dataset type subsetting failing #234
Conversation
Codecov Report
@@ Coverage Diff @@
## main #234 +/- ##
==========================================
- Coverage 71.53% 71.31% -0.22%
==========================================
Files 48 48
Lines 6168 6191 +23
Branches 1195 1205 +10
==========================================
+ Hits 4412 4415 +3
- Misses 1538 1558 +20
Partials 218 218
Continue to review full report at Codecov.
|
# The dataset type is not found. It may not be listed | ||
# or it may be that it is there with the same name | ||
# but different definition. | ||
for d in combined: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likely combined is a much longer list than the number of datasets that need special lookup. Consider saving all those in a list called something like missing
, and then after the loop over datasetTypes
, if that list is not empty, put an outer loop over combined
, and see if those match and of the DatasetTypes
in missing
, such you only hit combined
once. (I'm not sure what sizes we are talking about in general though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I was imagining that since this logic only triggers when something doesn't match and that something only won't match if we are in a transition period and then a little slow down won't really matter (we don't have thousands of dataset types here). There is a possibility that there will be a dataset type that has matched combined
and so a deferred match of only things that didn't match won't work -- we can probably avoid that by storing the candidate dataset types in a dict by name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of something like
missing = []
for datasetType in datasetTypes:
if datasetType in combined:
_dict[datasetType] = combined[datasetType]
else:
missing.append(datasetType)
if missing:
for existingType in combined:
for datasetType in missing:
# check if they do match, if so add to _dict
but as you say, I dont know how many DatasetTypes in combined we are talking about here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly even something like
_dict = {datasetType: combined[datasetType] for datasetType in (combined.keys() - datasetTypes)}
# don't do anything if all is well. Less branching logic
if len(_dict) < (datasetTypes):
missing = set(datasetTypes) - combined.keys()
for existingType in (combined.keys() - _dict.keys()):
for datasetType in missing:
# check if they do match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been pondering this and I'm not convinced that this approach really saves us anything because you still have to check everything in combined because it's possible that one of the mismatches is also defined the other way. I think what might help is to use a dict that maps name to key in combined
because then we wouldn't need to loop over combined each time -- just get the name match and compare it directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the pre-fill the dict with known matches logic and a dict look up rather than second loop in the branch that does the compatibility check.
If there is a storage class incompatibility the input should win because that is what the Task python code is expecting.
Report the missing dataset type rather than a more opaque default KeyError. Also allow the incompatibility to be accepted if the storage classes are compatible. This can happen if a dataset type definition defined in a connections class that is not yet in registry, does not match the definition added elsewhere. The example that triggered this was TaskMetadata which was used in a connection and added by the pipeline infrastructure.
a21551a
to
f419cb9
Compare
Do the fast assignment of what we know is present and only do the slower checks if we are missing some dataset types.
Report the missing dataset type rather than a more
opaque default KeyError.
Also allow the incompatibility to be accepted if the storage
classes are compatible.
This can happen if a dataset type definition defined in a
connections class that is not yet in registry, does not match the
definition added elsewhere. The example that triggered this
was TaskMetadata which was used in a connection and added
by the pipeline infrastructure.
Checklist
doc/changes