-
Notifications
You must be signed in to change notification settings - Fork 0
DM-37072: Expand HSC calibs and templates in central repo #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2e37073
to
cf5f45e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks reasonable, but MiddlewareInterface
makes two assumptions that will break with the new repo. One is a hack that should be easy to patch, but the other is a fundamental assumption in how _export_skymap_and_templates
handles tracts/patches. I'm not sure what's the best way to deal with that; can you let me know what you think?
skymap=self.skymap_name, | ||
where=template_where)) | ||
where=template_where, | ||
findFirst=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to add an (unticketed) TODO comment for adding findFirst
to the calibs query? I think we would want to add it as soon as it's available.
bin.src/make_hsc_rc2_export.py
Outdated
# Need all detectors, even those without data, for visit definition | ||
contents.saveDataIds( | ||
butler.registry.queryDataIds( | ||
{"detector"}, | ||
collections="HSC/RC2/defaults", | ||
datasets="raw", | ||
).expanded() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this block is necessary here -- the other file had it because ap_verify_ci_cosmos_pdr2
(and similar ap_verify
datasets) avoid using full focal planes to keep the download small.
logging.debug("Selecting refcats datasets") | ||
records = butler.registry.queryDatasets( | ||
datasetType=..., collections="refcats/DM-*" | ||
) | ||
contents.saveDatasets(records) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the collections filter on this query. Why not ask for specific collections, like for coadds, or just everything in the refcats
chain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want just everything in the refcats
chain. So it's now modified and just uses the chain instead of picking the runs.
bin.src/make_hsc_rc2_export.py
Outdated
|
||
# Save calibration collection | ||
for collection in butler.registry.queryCollections( | ||
expression=re.compile("^(HSC).*"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest just searching for "HSC/calib*" -- it's a bit less permissive than "HSC*" (we don't actually need a regular expression to represent this filter...)
logging.debug("Selecting datasets in HSC/calib") | ||
records = butler.registry.queryDatasets( | ||
datasetType=..., collections=re.compile("HSC/calib") | ||
) | ||
contents.saveDatasets(records) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this is what I asked for, but out of curiosity, how much space do we need to copy all the calibs? Did you look into that with your local test repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also, unnecessary re.compile
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I estimate ~825G for these calibs
"HSC/calib/gen2/20180117", | ||
"HSC/calib/DM-28636", | ||
"HSC/calib/gen2/20180117/unbounded", | ||
"HSC/calib/DM-28636/unbounded", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two "unbounded" collections are runs, not calibration collections; can you check whether you actually need them in the chain? I thought I remembered that they were for something DRP-specific, but now I'm wondering whether it's the problem we tried to solve at https://github.com/lsst-dm/prompt_prototype/blob/main/python/activator/middleware_interface.py#L168 (that code will need to be updated either way, since right now it assumes there's an HSC/calib/unbounded
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the other specific collections be replaced with a query for calibration collections that start with HSC/calib/
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the other specific collections be replaced with a query for calibration collections that start with HSC/calib/?
Not really, because there are two other CALIBRATION collections in /repo/main
with that prefix, and including them would mean including data that aren't needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I don't really need the "unbounded" RUN collections. Datasets in the CALIBRATION collections have pointers to other RUN collections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From this slack thread it looks to me that https://github.com/lsst-dm/prompt_prototype/blob/main/python/activator/middleware_interface.py#L168 is the right way to go for the moment. To be consistent I'm adding the chain to the new repo (7a817ed )
It also means I'll make prompt processing's central repo more similar to /repo/main , and less similar to ap_verify_ci_cosmos_pdr2.
# Chain rerun collections to templates | ||
current = butler.registry.getCollectionChain("templates") | ||
addition = butler.registry.queryCollections("HSC/runs/*", | ||
collectionTypes=CollectionType.RUN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest adding a comment that the export script should have guaranteed that there are only coadds in these collections.
bin.src/make_hsc_rc2_export.py
Outdated
logging.debug("Selecting skymaps datasets") | ||
records = butler.registry.queryDatasets( | ||
datasetType=..., collections="skymaps") | ||
contents.saveDatasets(records) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, MiddlewareInterface
assumes that there is exactly one skymap in the central repository. Would it be possible to edit this script to return a specific skymap used by RC2? (Alternatively, you could edit MiddlewareInterface
to be less restrictive, but since that involves rewriting how we identify which tract we want in _export_skymap_and_templates
, it would be much harder.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it exports only the skymap used by HSC-RC2.
9fdc7a1
to
b47d98b
Compare
Sorry, the standup meeting reminded me of one more issue: can you please update https://github.com/lsst-dm/prompt_prototype/blob/main/pipelines/HSC/ApPipe.yaml to use |
c1ef512
to
b181ff5
Compare
About the repo assumptions from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes! My only remaining comment is about the pipeline: I think we need to get rid of the calibrate
override and might be able to get rid of the isr
one, but please double-check my reasoning.
pipelines/HSC/ApPipe.yaml
Outdated
coaddName: goodSeeing | ||
tasks: | ||
isr: | ||
class: lsst.ip.isr.IsrTask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having looked a bit closer, I'm pretty sure the calibrate
override in this file is not appropriate for the new refcats -- the included file looks for catalogs called gaia
and panstarrs
, which is the ap_verify
convention. I'm not sure about the isr
override -- do we have all the brighter-fatter and transmission curve HSC calibrations now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I confirm that the new repo will have all those calibration data and special config overrides are no longer needed.
I'm leaving the DECam part of the pipeline configs untouched. Despite that we don't use DECam data for testing at the moment, we might in the future (?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use the DECam pipeline for unit(ish) testing; the repository for it is in tests/data/central_repo
.
Sorry, I just spotted one potential problem with the calibs change: it looks like |
Using findFirst so that we can store multiple versions of templates in one chain collection and let butler search for the dataset to use. New collections with the updated templates can be added to the chain collection. The order in the chain collection determines the query results, like the usual butler convention. findFirst is not used in querying calibration because findFirst query in CALIBRATION-type collection is not supported yet.
It makes a butler export file for selected HSC-RC2 dataset.
The current convention is to have a ticket number in the collection names as a workaround to version the unbounded datasets, and have the "HSC/calib/unbounded" (i.e. instrument.makeUnboundedCalibrationRunName()) CHAINED collection point to the latest set.
Throughout the codebase two methods are used and effectively they result in the same name (such as "HSC/defaults"). Just using one same method makes it less confusing.
Previously the top-level calibration collection (e.g. "HSC/calib") in the test repo has been a CALIBRATION collection, so it got exported with other CALIBRATION collections in the above lines. In the new repo, the top-level calibration collection will be a CHAINED collections of selected CALIBRATION collections, so it needs to be exported separatedly.
7a817ed
to
b9d12d0
Compare
The new butler repo will have more calibration data in it. Therefore, config overrides in isr and calibrate are no longer needed. We can turn on the default calibrations, and just use ps1_pv3_3pi_20170110 as the reference catalog like in the default ap_pipe configs.
No description provided.