DM-33569: Use explicit reason for dataset type subsetting failing #234

timj · 2022-02-04T20:18:41Z

Report the missing dataset type rather than a more
opaque default KeyError.

Also allow the incompatibility to be accepted if the storage
classes are compatible.

This can happen if a dataset type definition defined in a
connections class that is not yet in registry, does not match the
definition added elsewhere. The example that triggered this
was TaskMetadata which was used in a connection and added
by the pipeline infrastructure.

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes

codecov · 2022-02-04T20:25:08Z

Codecov Report

Merging #234 (ee3fb58) into main (2ab1b5b) will decrease coverage by 0.21%.
The diff coverage is 12.50%.

@@            Coverage Diff             @@
##             main     #234      +/-   ##
==========================================
- Coverage   71.53%   71.31%   -0.22%     
==========================================
  Files          48       48              
  Lines        6168     6191      +23     
  Branches     1195     1205      +10     
==========================================
+ Hits         4412     4415       +3     
- Misses       1538     1558      +20     
  Partials      218      218

Impacted Files	Coverage Δ
python/lsst/pipe/base/pipeline.py	`63.70% <ø> (ø)`
python/lsst/pipe/base/graphBuilder.py	`64.73% <12.50%> (-4.00%)`	⬇️
...thon/lsst/pipe/base/graph/_versionDeserializers.py	`61.47% <0.00%> (+0.43%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2ab1b5b...ee3fb58. Read the comment docs.

python/lsst/pipe/base/graphBuilder.py

natelust · 2022-02-07T19:17:46Z

python/lsst/pipe/base/graphBuilder.py

+                # The dataset type is not found. It may not be listed
+                # or it may be that it is there with the same name
+                # but different definition.
+                for d in combined:


likely combined is a much longer list than the number of datasets that need special lookup. Consider saving all those in a list called something like missing, and then after the loop over datasetTypes, if that list is not empty, put an outer loop over combined, and see if those match and of the DatasetTypes in missing, such you only hit combined once. (I'm not sure what sizes we are talking about in general though).

Yes. I was imagining that since this logic only triggers when something doesn't match and that something only won't match if we are in a transition period and then a little slow down won't really matter (we don't have thousands of dataset types here). There is a possibility that there will be a dataset type that has matched combined and so a deferred match of only things that didn't match won't work -- we can probably avoid that by storing the candidate dataset types in a dict by name.

I was thinking of something like

missing = [] for datasetType in datasetTypes: if datasetType in combined: _dict[datasetType] = combined[datasetType] else: missing.append(datasetType) if missing: for existingType in combined: for datasetType in missing: # check if they do match, if so add to _dict

but as you say, I dont know how many DatasetTypes in combined we are talking about here

possibly even something like

_dict = {datasetType: combined[datasetType] for datasetType in (combined.keys() - datasetTypes)} # don't do anything if all is well. Less branching logic if len(_dict) < (datasetTypes): missing = set(datasetTypes) - combined.keys() for existingType in (combined.keys() - _dict.keys()): for datasetType in missing: # check if they do match

I've been pondering this and I'm not convinced that this approach really saves us anything because you still have to check everything in combined because it's possible that one of the mismatches is also defined the other way. I think what might help is to use a dict that maps name to key in combined because then we wouldn't need to loop over combined each time -- just get the name match and compare it directly.

I added the pre-fill the dict with known matches logic and a dict look up rather than second loop in the branch that does the compatibility check.

If there is a storage class incompatibility the input should win because that is what the Task python code is expecting.

Report the missing dataset type rather than a more opaque default KeyError. Also allow the incompatibility to be accepted if the storage classes are compatible. This can happen if a dataset type definition defined in a connections class that is not yet in registry, does not match the definition added elsewhere. The example that triggered this was TaskMetadata which was used in a connection and added by the pipeline infrastructure.

Do the fast assignment of what we know is present and only do the slower checks if we are missing some dataset types.

natelust approved these changes Feb 7, 2022

View reviewed changes

timj added 2 commits February 9, 2022 14:27

Prefer input storage classes over outputs for intermediates

4e526ba

If there is a storage class incompatibility the input should win because that is what the Task python code is expecting.

timj force-pushed the tickets/DM-33569 branch from a21551a to f419cb9 Compare February 9, 2022 21:28

Slightly rearrange dataset type look up logic

ee3fb58

Do the fast assignment of what we know is present and only do the slower checks if we are missing some dataset types.

natelust approved these changes Feb 9, 2022

View reviewed changes

timj merged commit 5e4d6b4 into main Feb 10, 2022

timj deleted the tickets/DM-33569 branch February 10, 2022 02:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-33569: Use explicit reason for dataset type subsetting failing #234

DM-33569: Use explicit reason for dataset type subsetting failing #234

timj commented Feb 4, 2022 •

edited

codecov bot commented Feb 4, 2022 •

edited

natelust Feb 7, 2022

timj Feb 7, 2022 •

edited

natelust Feb 7, 2022

natelust Feb 7, 2022

timj Feb 9, 2022

timj Feb 9, 2022

DM-33569: Use explicit reason for dataset type subsetting failing #234

DM-33569: Use explicit reason for dataset type subsetting failing #234

Conversation

timj commented Feb 4, 2022 • edited

Checklist

codecov bot commented Feb 4, 2022 • edited

Codecov Report

natelust Feb 7, 2022

Choose a reason for hiding this comment

timj Feb 7, 2022 • edited

Choose a reason for hiding this comment

natelust Feb 7, 2022

Choose a reason for hiding this comment

natelust Feb 7, 2022

Choose a reason for hiding this comment

timj Feb 9, 2022

Choose a reason for hiding this comment

timj Feb 9, 2022

Choose a reason for hiding this comment

timj commented Feb 4, 2022 •

edited

codecov bot commented Feb 4, 2022 •

edited

timj Feb 7, 2022 •

edited