Skip to content

Conversation

hsinfang
Copy link
Collaborator

No description provided.

@hsinfang hsinfang force-pushed the tickets/DM-37072 branch 3 times, most recently from 2e37073 to cf5f45e Compare December 19, 2022 22:42
Copy link
Member

@kfindeisen kfindeisen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks reasonable, but MiddlewareInterface makes two assumptions that will break with the new repo. One is a hack that should be easy to patch, but the other is a fundamental assumption in how _export_skymap_and_templates handles tracts/patches. I'm not sure what's the best way to deal with that; can you let me know what you think?

skymap=self.skymap_name,
where=template_where))
where=template_where,
findFirst=True))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add an (unticketed) TODO comment for adding findFirst to the calibs query? I think we would want to add it as soon as it's available.

Comment on lines 72 to 79
# Need all detectors, even those without data, for visit definition
contents.saveDataIds(
butler.registry.queryDataIds(
{"detector"},
collections="HSC/RC2/defaults",
datasets="raw",
).expanded()
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this block is necessary here -- the other file had it because ap_verify_ci_cosmos_pdr2 (and similar ap_verify datasets) avoid using full focal planes to keep the download small.

Comment on lines 88 to 82
logging.debug("Selecting refcats datasets")
records = butler.registry.queryDatasets(
datasetType=..., collections="refcats/DM-*"
)
contents.saveDatasets(records)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the collections filter on this query. Why not ask for specific collections, like for coadds, or just everything in the refcats chain?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want just everything in the refcats chain. So it's now modified and just uses the chain instead of picking the runs.


# Save calibration collection
for collection in butler.registry.queryCollections(
expression=re.compile("^(HSC).*"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest just searching for "HSC/calib*" -- it's a bit less permissive than "HSC*" (we don't actually need a regular expression to represent this filter...)

Comment on lines 99 to 93
logging.debug("Selecting datasets in HSC/calib")
records = butler.registry.queryDatasets(
datasetType=..., collections=re.compile("HSC/calib")
)
contents.saveDatasets(records)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is what I asked for, but out of curiosity, how much space do we need to copy all the calibs? Did you look into that with your local test repo?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also, unnecessary re.compile).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I estimate ~825G for these calibs

"HSC/calib/gen2/20180117",
"HSC/calib/DM-28636",
"HSC/calib/gen2/20180117/unbounded",
"HSC/calib/DM-28636/unbounded",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two "unbounded" collections are runs, not calibration collections; can you check whether you actually need them in the chain? I thought I remembered that they were for something DRP-specific, but now I'm wondering whether it's the problem we tried to solve at https://github.com/lsst-dm/prompt_prototype/blob/main/python/activator/middleware_interface.py#L168 (that code will need to be updated either way, since right now it assumes there's an HSC/calib/unbounded).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the other specific collections be replaced with a query for calibration collections that start with HSC/calib/?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the other specific collections be replaced with a query for calibration collections that start with HSC/calib/?

Not really, because there are two other CALIBRATION collections in /repo/main with that prefix, and including them would mean including data that aren't needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I don't really need the "unbounded" RUN collections. Datasets in the CALIBRATION collections have pointers to other RUN collections.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this slack thread it looks to me that https://github.com/lsst-dm/prompt_prototype/blob/main/python/activator/middleware_interface.py#L168 is the right way to go for the moment. To be consistent I'm adding the chain to the new repo (7a817ed )

It also means I'll make prompt processing's central repo more similar to /repo/main , and less similar to ap_verify_ci_cosmos_pdr2.

# Chain rerun collections to templates
current = butler.registry.getCollectionChain("templates")
addition = butler.registry.queryCollections("HSC/runs/*",
collectionTypes=CollectionType.RUN)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding a comment that the export script should have guaranteed that there are only coadds in these collections.

Comment on lines 94 to 87
logging.debug("Selecting skymaps datasets")
records = butler.registry.queryDatasets(
datasetType=..., collections="skymaps")
contents.saveDatasets(records)
Copy link
Member

@kfindeisen kfindeisen Jan 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, MiddlewareInterface assumes that there is exactly one skymap in the central repository. Would it be possible to edit this script to return a specific skymap used by RC2? (Alternatively, you could edit MiddlewareInterface to be less restrictive, but since that involves rewriting how we identify which tract we want in _export_skymap_and_templates, it would be much harder.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it exports only the skymap used by HSC-RC2.

@kfindeisen
Copy link
Member

Sorry, the standup meeting reminded me of one more issue: can you please update https://github.com/lsst-dm/prompt_prototype/blob/main/pipelines/HSC/ApPipe.yaml to use coaddName: goodSeeing? I suspect the task configurations in that pipeline also don't match the new calibs and refcats. (The parameters labeled as workarounds for DM-30210 don't need to be updated -- they are obsolete and can be safely deleted.)

@hsinfang hsinfang force-pushed the tickets/DM-37072 branch 2 times, most recently from c1ef512 to b181ff5 Compare January 11, 2023 20:58
@hsinfang
Copy link
Collaborator Author

hsinfang commented Jan 11, 2023

About the repo assumptions from MiddlewareInterface, by importing just one skymap and adding a HSC/calib/unbounded chain collection, I think they should be okay now. Sounds reasonable?

Copy link
Member

@kfindeisen kfindeisen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes! My only remaining comment is about the pipeline: I think we need to get rid of the calibrate override and might be able to get rid of the isr one, but please double-check my reasoning.

coaddName: goodSeeing
tasks:
isr:
class: lsst.ip.isr.IsrTask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having looked a bit closer, I'm pretty sure the calibrate override in this file is not appropriate for the new refcats -- the included file looks for catalogs called gaia and panstarrs, which is the ap_verify convention. I'm not sure about the isr override -- do we have all the brighter-fatter and transmission curve HSC calibrations now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I confirm that the new repo will have all those calibration data and special config overrides are no longer needed.

I'm leaving the DECam part of the pipeline configs untouched. Despite that we don't use DECam data for testing at the moment, we might in the future (?).

Copy link
Member

@kfindeisen kfindeisen Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the DECam pipeline for unit(ish) testing; the repository for it is in tests/data/central_repo.

@kfindeisen
Copy link
Member

kfindeisen commented Jan 12, 2023

Sorry, I just spotted one potential problem with the calibs change: it looks like MiddlewareInterface exports HSC/defaults and exports the calibration collections, but it does not export HSC/calib now that it's a chained collection. I suggest adding a line to https://github.com/lsst-dm/prompt_prototype/blob/tickets/DM-37072/python/activator/middleware_interface.py#L458.

Using findFirst so that we can store multiple versions of templates
in one chain collection and let butler search for the dataset to use.

New collections with the updated templates can be added to the chain
collection.  The order in the chain collection determines the query
results, like the usual butler convention.

findFirst is not used in querying calibration because findFirst query
in CALIBRATION-type collection is not supported yet.
It makes a butler export file for selected HSC-RC2 dataset.
The current convention is to have a ticket number in the collection
names as a workaround to version the unbounded datasets, and have the
"HSC/calib/unbounded" (i.e. instrument.makeUnboundedCalibrationRunName())
CHAINED collection point to the latest set.
Throughout the codebase two methods are used and effectively
they result in the same name (such as "HSC/defaults").
Just using one same method makes it less confusing.
Previously the top-level calibration collection (e.g. "HSC/calib")
in the test repo has been a CALIBRATION collection, so it got
exported with other CALIBRATION collections in the above lines.

In the new repo, the top-level calibration collection will be a
CHAINED collections of selected CALIBRATION collections, so it
needs to be exported separatedly.
The new butler repo will have more calibration data in it.
Therefore, config overrides in isr and calibrate are no longer
needed. We can turn on the default calibrations, and just
use ps1_pv3_3pi_20170110 as the reference catalog like in the
default ap_pipe configs.
@hsinfang hsinfang merged commit 8d18aa6 into main Jan 13, 2023
@hsinfang hsinfang deleted the tickets/DM-37072 branch January 13, 2023 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants