-
Notifications
You must be signed in to change notification settings - Fork 0
DM-39514: Add validity-range workaround to prep_butler #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d58f702
to
7d30afa
Compare
This is a crude function for vetting calibs by their validity range after an initial list has been returned by queryDatasets.
The call is done from _filter_datasets to make it possible in principle to detect when there are no valid calibs. In practice, there are always unbounded calibs, making it hard to tell that the limited-duration calibs are missing.
Since the default is for all logs from the PP framework itself to be at DEBUG level, the "trace" logs are attacked to a separate logger to turn them off unless specifically requested in the service config.
f165244
to
089cf09
Compare
t = astropy.time.Time(date, scale='utc') | ||
# DatasetAssociation.timespan guaranteed not None | ||
return [ref for ref in unfiltered_calibs | ||
if ref in associations and associations[ref].timespan.contains(t)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If more than one CALIBRATION-type collection is passed in (either directly as part of a CHAINED collection), this can return more than one DatasetRef
of the same type and data ID (there's no logic like findFirst=True
argument to queryDatasets
happening here). It looks like that's okay, since you export all of the collections you'll want to search, too, and that will export the search order so later lookups do the right thing, but I wanted to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this has actually already happened (test runs picked up multiple sky
datasets with overlapping validity ranges).
Since (according to a TODO) we can't use findFirst
with calibs, is there some other way to get _export_calibs
to return unique calibs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
queryDatasetAssociations
will return results in the right order (I don't think this is documented but I'd be happy to make it a guarantee), so after you restrict the results to one instant in time, you could deduplicate manually on dataset type + data ID by using the first one returned.
The only other option at present is to use Registry.findDataset
on one dataset type + data ID combination at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think deduplication should be in scope for _filter_calibs_by_date
, since the only way it can produce duplicate output is if it has duplicate input; it's not an artifact of using queryDatasetAssociations
internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, on second thought, wouldn't using findDataset
make queryDatasetAssociations
unnecessary? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a net gain. My understanding is that with this PR's changes all calibration lookups are still correct, but you export more than you have to (especially in the limit that there are many unnecessary CALIBRATION collections in the chain). With findDataset
you'd be doing many small database queries (to the point where the per-query latency overhead can become a performance problem; this is what makes step1 QG generation slow). I don't know if that's a good trade either overall or on the basis of where the time is spent.
I also think I saw a comment saying that you don't know which dataset types will be needed up-front, and that probably rules out going with findDataset
right now anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the proposal was to use findDataset
to filter the results of queryDatasets
, in which case I have (a superset of) the dataset types already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was thinking of findDataset
as an alternative to queryDatasetAssociations
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, doing a full queryDatasets
with findFirst=False
to find the candidate dataset refs and then following up with findDatasets
would indeed work. If time spent here in prep_butler
is "cheap" compared to time spent doing the export or import, that's a pretty good plan.
This PR uses
queryDatasetAssociations
to filter out expired calibs during local repo setup. This is not a foolproof solution; in particular, error detection is quite poor because the current code needs to be agnostic to exactly which calibs the pipeline(s) need(s). Error detection will be improved once DM-40245 rewrites calib loading to be more dynamic.