Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-33569: Use explicit reason for dataset type subsetting failing #234

Merged
merged 3 commits into from Feb 10, 2022

Conversation

timj
Copy link
Member

@timj timj commented Feb 4, 2022

Report the missing dataset type rather than a more
opaque default KeyError.

Also allow the incompatibility to be accepted if the storage
classes are compatible.

This can happen if a dataset type definition defined in a
connections class that is not yet in registry, does not match the
definition added elsewhere. The example that triggered this
was TaskMetadata which was used in a connection and added
by the pipeline infrastructure.

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

@codecov
Copy link

codecov bot commented Feb 4, 2022

Codecov Report

Merging #234 (ee3fb58) into main (2ab1b5b) will decrease coverage by 0.21%.
The diff coverage is 12.50%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #234      +/-   ##
==========================================
- Coverage   71.53%   71.31%   -0.22%     
==========================================
  Files          48       48              
  Lines        6168     6191      +23     
  Branches     1195     1205      +10     
==========================================
+ Hits         4412     4415       +3     
- Misses       1538     1558      +20     
  Partials      218      218              
Impacted Files Coverage Δ
python/lsst/pipe/base/pipeline.py 63.70% <ø> (ø)
python/lsst/pipe/base/graphBuilder.py 64.73% <12.50%> (-4.00%) ⬇️
...thon/lsst/pipe/base/graph/_versionDeserializers.py 61.47% <0.00%> (+0.43%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2ab1b5b...ee3fb58. Read the comment docs.

python/lsst/pipe/base/graphBuilder.py Show resolved Hide resolved
# The dataset type is not found. It may not be listed
# or it may be that it is there with the same name
# but different definition.
for d in combined:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likely combined is a much longer list than the number of datasets that need special lookup. Consider saving all those in a list called something like missing, and then after the loop over datasetTypes, if that list is not empty, put an outer loop over combined, and see if those match and of the DatasetTypes in missing, such you only hit combined once. (I'm not sure what sizes we are talking about in general though).

Copy link
Member Author

@timj timj Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I was imagining that since this logic only triggers when something doesn't match and that something only won't match if we are in a transition period and then a little slow down won't really matter (we don't have thousands of dataset types here). There is a possibility that there will be a dataset type that has matched combined and so a deferred match of only things that didn't match won't work -- we can probably avoid that by storing the candidate dataset types in a dict by name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of something like

missing = []
for datasetType in datasetTypes:
    if datasetType in combined:
        _dict[datasetType] = combined[datasetType]
   else:
        missing.append(datasetType)
if missing:
    for existingType in combined:
        for datasetType in missing:
            # check if they do match, if so add to _dict

but as you say, I dont know how many DatasetTypes in combined we are talking about here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly even something like

_dict = {datasetType: combined[datasetType] for datasetType in (combined.keys() - datasetTypes)}

# don't do anything if all is well. Less branching logic
if len(_dict) < (datasetTypes):
    missing = set(datasetTypes) - combined.keys()
    for existingType in (combined.keys() - _dict.keys()):
        for datasetType in missing:
            # check if they do match

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been pondering this and I'm not convinced that this approach really saves us anything because you still have to check everything in combined because it's possible that one of the mismatches is also defined the other way. I think what might help is to use a dict that maps name to key in combined because then we wouldn't need to loop over combined each time -- just get the name match and compare it directly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the pre-fill the dict with known matches logic and a dict look up rather than second loop in the branch that does the compatibility check.

If there is a storage class incompatibility the input
should win because that is what the Task python code
is expecting.
Report the missing dataset type rather than a more
opaque default KeyError.

Also allow the incompatibility to be accepted if the storage
classes are compatible.

This can happen if a dataset type definition defined in a
connections class that is not yet in registry, does not match the
definition added elsewhere. The example that triggered this
was TaskMetadata which was used in a connection and added
by the pipeline infrastructure.
Do the fast assignment of what we know is present and only
do the slower checks if we are missing some dataset types.
@timj timj merged commit 5e4d6b4 into main Feb 10, 2022
@timj timj deleted the tickets/DM-33569 branch February 10, 2022 02:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants