Skip to content

Conversation

kfindeisen
Copy link
Member

@kfindeisen kfindeisen commented Jul 31, 2023

This PR uses queryDatasetAssociations to filter out expired calibs during local repo setup. This is not a foolproof solution; in particular, error detection is quite poor because the current code needs to be agnostic to exactly which calibs the pipeline(s) need(s). Error detection will be improved once DM-40245 rewrites calib loading to be more dynamic.

@kfindeisen kfindeisen force-pushed the tickets/DM-39514 branch 2 times, most recently from d58f702 to 7d30afa Compare August 17, 2023 19:08
This is a crude function for vetting calibs by their validity range
after an initial list has been returned by queryDatasets.
The call is done from _filter_datasets to make it possible in principle
to detect when there are no valid calibs. In practice, there are always
unbounded calibs, making it hard to tell that the limited-duration
calibs are missing.
Since the default is for all logs from the PP framework itself to be
at DEBUG level, the "trace" logs are attacked to a separate logger to
turn them off unless specifically requested in the service config.
@kfindeisen kfindeisen marked this pull request as ready for review August 17, 2023 22:40
@kfindeisen kfindeisen requested a review from TallJimbo August 17, 2023 23:21
t = astropy.time.Time(date, scale='utc')
# DatasetAssociation.timespan guaranteed not None
return [ref for ref in unfiltered_calibs
if ref in associations and associations[ref].timespan.contains(t)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If more than one CALIBRATION-type collection is passed in (either directly as part of a CHAINED collection), this can return more than one DatasetRef of the same type and data ID (there's no logic like findFirst=True argument to queryDatasets happening here). It looks like that's okay, since you export all of the collections you'll want to search, too, and that will export the search order so later lookups do the right thing, but I wanted to check.

Copy link
Member Author

@kfindeisen kfindeisen Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this has actually already happened (test runs picked up multiple sky datasets with overlapping validity ranges).

Since (according to a TODO) we can't use findFirst with calibs, is there some other way to get _export_calibs to return unique calibs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

queryDatasetAssociations will return results in the right order (I don't think this is documented but I'd be happy to make it a guarantee), so after you restrict the results to one instant in time, you could deduplicate manually on dataset type + data ID by using the first one returned.

The only other option at present is to use Registry.findDataset on one dataset type + data ID combination at a time.

Copy link
Member Author

@kfindeisen kfindeisen Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think deduplication should be in scope for _filter_calibs_by_date, since the only way it can produce duplicate output is if it has duplicate input; it's not an artifact of using queryDatasetAssociations internally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, on second thought, wouldn't using findDataset make queryDatasetAssociations unnecessary? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a net gain. My understanding is that with this PR's changes all calibration lookups are still correct, but you export more than you have to (especially in the limit that there are many unnecessary CALIBRATION collections in the chain). With findDataset you'd be doing many small database queries (to the point where the per-query latency overhead can become a performance problem; this is what makes step1 QG generation slow). I don't know if that's a good trade either overall or on the basis of where the time is spent.

I also think I saw a comment saying that you don't know which dataset types will be needed up-front, and that probably rules out going with findDataset right now anyway.

Copy link
Member Author

@kfindeisen kfindeisen Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the proposal was to use findDataset to filter the results of queryDatasets, in which case I have (a superset of) the dataset types already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking of findDataset as an alternative to queryDatasetAssociations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, doing a full queryDatasets with findFirst=False to find the candidate dataset refs and then following up with findDatasets would indeed work. If time spent here in prep_butler is "cheap" compared to time spent doing the export or import, that's a pretty good plan.

@kfindeisen kfindeisen merged commit 798174d into main Aug 18, 2023
@kfindeisen kfindeisen deleted the tickets/DM-39514 branch August 18, 2023 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants