Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-42825: Prompt Processing does slow calib queries against /repo/embargo #153

Merged
merged 5 commits into from
Apr 15, 2024

Conversation

kfindeisen
Copy link
Member

@kfindeisen kfindeisen commented Apr 15, 2024

This PR does some optimization of Registry.queryDatasets calls against the central repo, moving data IDs out of where clauses and avoiding the use of ... as a dataset type identifier. ... is still used to implement the "cache nothing" hack (which is against the relatively small local repo, and needs to pick up absolutely everything to work) and as an explicit argument to queryDatasetTypes (where unfiltered queries are the norm).

This PR also uses experience gained with queryDatasetTypes to remove the explicit list of refcat dataset types (which was itself a workaround for the impracticality of ...); since Middleware still doesn't have an explicit "this is a refcat" concept, the new implementation uses a heuristic filter on dimensions and storage class.

Dataset queries done using fixed IDs are 2-3 times faster than queries
done using the where keyword, a significant savings when querying the
central repo. However, queries against a set of IDs cannot be converted
to this form and must continue using where.
The comment justifies why we can't filter by validity range, but we
have been doing so for a while.
Butler registry queries against all dataset types are an order of
magnitude less efficient than an equivalent set of queries that
enumerates all possible types.
@@ -638,11 +638,15 @@ def _export_calibs(self, detector_id, filter):
"""
# TAI observation start time should be used for calib validity range.
calib_date = astropy.time.Time(self.visit.private_sndStamp, format="unix_tai")
# Querying by specific types is much faster than querying by ...
# Some calibs have an exposure ID (of the source dataset?), but these can't be used in AP.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, calibs with an exposure dimension don't make sense to me either. I wonder if they're accidents.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be a mix of things, some more reasonable than others:

  • Produced by CPP:
    • cpBiasProc, biasIsr_metadata, biasIsr_log
    • cpDarkIsr, darkIsr_metadata, darkIsr_log
    • cpDarkProc, cpDark_metadata, cpDark_log
    • cpDarkProcForDefects, darkIsrForDefects_metadata, darkIsrForDefects_log
    • cpFlatProc, flatIsr_metadata, flatIsr_log
    • flatStats, cpFlatMeasure_metadata, cpFlatMeasure_log
    • cpDefectsProc
    • cpPartialDefects, measureDefects_metadata, measureDefects_log
    • cpPtcProc, ptcIsr_metadata, ptcIsr_log
    • cpPtcExtract
    • cpCrosstalkProc
    • crosstalkRatios, crosstalkFluxes, cpCrosstalkRatio, crosstalkExtract_metadata, crosstalkExtract_log
  • Produced in the verification counterparts of the above:
    • verifyBiasProc, verifyBiasApply_metadata, verifyBiasApply_log
    • verifyBiasDetStats, verifyMeasureStats_metadata, verifyMeasureStats_log
    • verifyBiasExpStats, verifyExposureStats_log, verifyExposureStats_metadata
    • verifyDarkProc, verifyDarkApply_metadata, verifyDarkApply_log
    • verifyDarkDetStats, verifyDarkChip_metadata, verifyDarkChip_log
    • verifyDarkExpStats, verifyDarkExp_metadata, verifyDarkExp_log
    • verifyFlatProc, verifyFlatApply_metadata, verifyFlatApply_log
    • verifyFlatDetStats, verifyFlatChip_metadata, verifyFlatChip_log
    • verifyFlatExpStats, verifyFlatExp_metadata, verifyFlatExp_log
    • verifyDefectProc, verifyDefectApply_metadata, verifyDefectApply_log
    • verifyDefectDetStats
    • verifyDefectExpStats
    • verifyDefectIsr, verifyDefectIsr_metadata, verifyDefectIsr_log
    • verifyUncorrDefectIsr, verifyUncorrectedDefectIsr_metadata, verifyUncorrectedDefectIsr_log
    • verifyCrosstalkProc, verifyCrosstalkApply_metadata, verifyCrosstalkApply_log
    • verifyCrosstalkRatio, verifyCrosstalkExtract_metadata, verifyCrosstalkExtract_log
    • verifyLinearityProc, verifyLinearityApply_metadata, verifyLinearityApply_log
    • verifyLinearityPtcExtract
  • Miscellaneous:
    • raw
    • rawSpectrum
    • postISRCCD, isrStatistics, isr_metadata, isr_log
    • timing_isr_metadata, timing_isr_log, cputiming_isr_metadata, cputiming_isr_log [Did somebody run ApVerify.yaml on /repo/embargo???]
  • Unidentified (possibly obsolete):
    • quickLookExp, quickLookTask_metadata, quickLookTask_log
    • crosstalkSignals, crosstalkResults, crosstalkBackgroundXTilts, crosstalkBackgroundYTilts, crosstalkBackgroundZOffsets, crosstalkCoefficients, crosstalkCoefficientErrors, crosstalkNonLinearCoefficients, crosstalkTask_metadata, crosstalkTask_log
    • backgroundResults
    • plotPtcTask_metadata, plotPtcTask_log
    • cpCRisr, cpCRproc
    • spotSrc, characterizeSpots_metadata, characterizeSpots_log
    • eoFlatIsr, eoFlatStats, eoFlatMeasure_metadata, eoFlatMeasure_log
    • eoDarkIsr, eoDarkProc, eoDarkTask_log, eoDarkTask_metadata
    • myIsr_metadata, myIsr_log
    • verifyBiasExpMatrix, verifyBiasExpResults, verifyBiasExp_metadata, verifyBiasExp_log
    • verifyBiasDetMatrix, verifyBiasDetResults, verifyBiasChip_metadata, verifyBiasChip_log
    • verifyDarkExpResults, verifyDarkDetResults
    • verifyFlatDetResults, verifyFlatExpResults

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect bias/dark/flat calibrations to all be exposure calibrations and not visit.

@@ -610,6 +610,8 @@ def _export_skymap_and_templates(self, center, detector, wcs, filter):
collections=self._collection_template,
instrument=self.instrument.getName(),
skymap=self.skymap_name,
tract=tract.tract_id,
physical_filter=filter,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own curiousity, do you see this going meaningfully faster in a single query (or a small number of queries), or is it only when doing many queries that it becomes noticeable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not "meaningfully", no -- templates went from 0.0801, 0.0834 to 0.0697, 0.0678, 0.0840, 0.0846, 0.0721. But if nothing else, it will keep the code from getting cargo-culted to cases where it does matter.

Creating a function that returns the types makes it possible to replace
the hardcoded list with something more flexible later.
This still forces us to load refcats the pipelines don't use, but at
least prevents us from having to manually curate a list of known types.
@kfindeisen kfindeisen merged commit 924a223 into main Apr 15, 2024
5 checks passed
@kfindeisen kfindeisen deleted the tickets/DM-42825 branch April 15, 2024 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants