DM-42825: Prompt Processing does slow calib queries against /repo/embargo #153

kfindeisen · 2024-04-15T20:09:03Z

This PR does some optimization of Registry.queryDatasets calls against the central repo, moving data IDs out of where clauses and avoiding the use of ... as a dataset type identifier. ... is still used to implement the "cache nothing" hack (which is against the relatively small local repo, and needs to pick up absolutely everything to work) and as an explicit argument to queryDatasetTypes (where unfiltered queries are the norm).

This PR also uses experience gained with queryDatasetTypes to remove the explicit list of refcat dataset types (which was itself a workaround for the impracticality of ...); since Middleware still doesn't have an explicit "this is a refcat" concept, the new implementation uses a heuristic filter on dimensions and storage class.

Dataset queries done using fixed IDs are 2-3 times faster than queries done using the where keyword, a significant savings when querying the central repo. However, queries against a set of IDs cannot be converted to this form and must continue using where.

The comment justifies why we can't filter by validity range, but we have been doing so for a while.

Butler registry queries against all dataset types are an order of magnitude less efficient than an equivalent set of queries that enumerates all possible types.

TallJimbo · 2024-04-15T21:06:09Z

python/activator/middleware_interface.py

@@ -638,11 +638,15 @@ def _export_calibs(self, detector_id, filter):
        """
        # TAI observation start time should be used for calib validity range.
        calib_date = astropy.time.Time(self.visit.private_sndStamp, format="unix_tai")
+        # Querying by specific types is much faster than querying by ...
+        # Some calibs have an exposure ID (of the source dataset?), but these can't be used in AP.


FWIW, calibs with an exposure dimension don't make sense to me either. I wonder if they're accidents.

It seems to be a mix of things, some more reasonable than others:

Produced by CPP:

cpBiasProc, biasIsr_metadata, biasIsr_log

cpDarkIsr, darkIsr_metadata, darkIsr_log

cpDarkProc, cpDark_metadata, cpDark_log

cpDarkProcForDefects, darkIsrForDefects_metadata, darkIsrForDefects_log

cpFlatProc, flatIsr_metadata, flatIsr_log

flatStats, cpFlatMeasure_metadata, cpFlatMeasure_log

cpDefectsProc

cpPartialDefects, measureDefects_metadata, measureDefects_log

cpPtcProc, ptcIsr_metadata, ptcIsr_log

cpPtcExtract

cpCrosstalkProc

crosstalkRatios, crosstalkFluxes, cpCrosstalkRatio, crosstalkExtract_metadata, crosstalkExtract_log

Produced in the verification counterparts of the above:

verifyBiasProc, verifyBiasApply_metadata, verifyBiasApply_log

verifyBiasDetStats, verifyMeasureStats_metadata, verifyMeasureStats_log

verifyBiasExpStats, verifyExposureStats_log, verifyExposureStats_metadata

verifyDarkProc, verifyDarkApply_metadata, verifyDarkApply_log

verifyDarkDetStats, verifyDarkChip_metadata, verifyDarkChip_log

verifyDarkExpStats, verifyDarkExp_metadata, verifyDarkExp_log

verifyFlatProc, verifyFlatApply_metadata, verifyFlatApply_log

verifyFlatDetStats, verifyFlatChip_metadata, verifyFlatChip_log

verifyFlatExpStats, verifyFlatExp_metadata, verifyFlatExp_log

verifyDefectProc, verifyDefectApply_metadata, verifyDefectApply_log

verifyDefectDetStats

verifyDefectExpStats

verifyDefectIsr, verifyDefectIsr_metadata, verifyDefectIsr_log

verifyUncorrDefectIsr, verifyUncorrectedDefectIsr_metadata, verifyUncorrectedDefectIsr_log

verifyCrosstalkProc, verifyCrosstalkApply_metadata, verifyCrosstalkApply_log

verifyCrosstalkRatio, verifyCrosstalkExtract_metadata, verifyCrosstalkExtract_log

verifyLinearityProc, verifyLinearityApply_metadata, verifyLinearityApply_log

verifyLinearityPtcExtract

Miscellaneous:

raw

rawSpectrum

postISRCCD, isrStatistics, isr_metadata, isr_log

timing_isr_metadata, timing_isr_log, cputiming_isr_metadata, cputiming_isr_log [Did somebody run ApVerify.yaml on /repo/embargo???]

Unidentified (possibly obsolete):

quickLookExp, quickLookTask_metadata, quickLookTask_log

crosstalkSignals, crosstalkResults, crosstalkBackgroundXTilts, crosstalkBackgroundYTilts, crosstalkBackgroundZOffsets, crosstalkCoefficients, crosstalkCoefficientErrors, crosstalkNonLinearCoefficients, crosstalkTask_metadata, crosstalkTask_log

backgroundResults

plotPtcTask_metadata, plotPtcTask_log

cpCRisr, cpCRproc

spotSrc, characterizeSpots_metadata, characterizeSpots_log

eoFlatIsr, eoFlatStats, eoFlatMeasure_metadata, eoFlatMeasure_log

eoDarkIsr, eoDarkProc, eoDarkTask_log, eoDarkTask_metadata

myIsr_metadata, myIsr_log

verifyBiasExpMatrix, verifyBiasExpResults, verifyBiasExp_metadata, verifyBiasExp_log

verifyBiasDetMatrix, verifyBiasDetResults, verifyBiasChip_metadata, verifyBiasChip_log

verifyDarkExpResults, verifyDarkDetResults

verifyFlatDetResults, verifyFlatExpResults

I would expect bias/dark/flat calibrations to all be exposure calibrations and not visit.

TallJimbo · 2024-04-15T21:11:09Z

python/activator/middleware_interface.py

@@ -610,6 +610,8 @@ def _export_skymap_and_templates(self, center, detector, wcs, filter):
                collections=self._collection_template,
                instrument=self.instrument.getName(),
                skymap=self.skymap_name,
+                tract=tract.tract_id,
+                physical_filter=filter,


For my own curiousity, do you see this going meaningfully faster in a single query (or a small number of queries), or is it only when doing many queries that it becomes noticeable?

Not "meaningfully", no -- templates went from 0.0801, 0.0834 to 0.0697, 0.0678, 0.0840, 0.0846, 0.0721. But if nothing else, it will keep the code from getting cargo-culted to cases where it does matter.

Creating a function that returns the types makes it possible to replace the hardcoded list with something more flexible later.

This still forces us to load refcats the pipelines don't use, but at least prevents us from having to manually curate a list of known types.

kfindeisen requested a review from TallJimbo April 15, 2024 21:00

kfindeisen added 2 commits April 15, 2024 14:02

Remove outdated comment.

b645f18

The comment justifies why we can't filter by validity range, but we have been doing so for a while.

Avoid queries against ... dataset types.

749167b

Butler registry queries against all dataset types are an order of magnitude less efficient than an equivalent set of queries that enumerates all possible types.

TallJimbo approved these changes Apr 15, 2024

View reviewed changes

kfindeisen added 2 commits April 15, 2024 14:13

Factor out enumeration of refcat dataset types.

3efcb2d

Creating a function that returns the types makes it possible to replace the hardcoded list with something more flexible later.

Replace hardcoded refcat names with a query.

e886784

This still forces us to load refcats the pipelines don't use, but at least prevents us from having to manually curate a list of known types.

kfindeisen force-pushed the tickets/DM-42825 branch from e936c99 to e886784 Compare April 15, 2024 22:24

kfindeisen merged commit 924a223 into main Apr 15, 2024
5 checks passed

kfindeisen deleted the tickets/DM-42825 branch April 15, 2024 22:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-42825: Prompt Processing does slow calib queries against /repo/embargo #153

DM-42825: Prompt Processing does slow calib queries against /repo/embargo #153

kfindeisen commented Apr 15, 2024 •

edited

TallJimbo Apr 15, 2024

kfindeisen Apr 15, 2024 •

edited

timj Apr 15, 2024

TallJimbo Apr 15, 2024

kfindeisen Apr 15, 2024

DM-42825: Prompt Processing does slow calib queries against /repo/embargo #153

DM-42825: Prompt Processing does slow calib queries against /repo/embargo #153

Conversation

kfindeisen commented Apr 15, 2024 • edited

TallJimbo Apr 15, 2024

Choose a reason for hiding this comment

kfindeisen Apr 15, 2024 • edited

Choose a reason for hiding this comment

timj Apr 15, 2024

Choose a reason for hiding this comment

TallJimbo Apr 15, 2024

Choose a reason for hiding this comment

kfindeisen Apr 15, 2024

Choose a reason for hiding this comment

kfindeisen commented Apr 15, 2024 •

edited

kfindeisen Apr 15, 2024 •

edited