New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-13530: Generalize ingestion to non-HiTS data #26
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Most of my comments are small. I'm really glad we have this handy ingestion utility now 💯
python/lsst/ap/verify/ingestion.py
Outdated
_exclude = exclude if exclude is not None else [] | ||
|
||
allFiles = set() | ||
# TODO: can replace with recursive glob once Python 2 support is withdrawn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be OK to do this now, especially if it will allow you to get rid of the nested for loops.
python/lsst/ap/verify/ingestion.py
Outdated
def _getOutputRepo(outputRoot, outputDir): | ||
"""Return location on disk for one output repository used by ``ap_pipe``. | ||
def _findMatchingFiles(basePath, include, exclude=None): | ||
"""Recursively identify files matching one set of patterns and not matching another. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a handy little function!
The name of a .tar.gz file that contains all the compressed | ||
defect images. | ||
""" | ||
# TODO: clean up implementation after DM-5467 resolved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this comment, I presume defect ingestion is still necessarily specific to DECam. It would be good to say so, and maybe provide the option to skip it to support future non-DECam ingestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind re. skipping it - looks like it will indeed skip this step if there's no defectTarball provided. Good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I glanced briefly at DM-5467 and thought it related to the need to have a tar file and/or a dedicated call to IngestCalibsTask
instead of just keeping the defects with the other calibs. Since you were involved in that discussion, can you double-check whether or not it's relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if something in the current code is DECam-specific, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I say DECam-specific, I mostly mean that this assumes defects are fits files that need to be ingested at all. If defects exist in a different format (e.g., a text file like HSC https://github.com/lsst/obs_subaru/tree/master/hsc/defects) then they won't be handled at all with this ingestion routine. But that should be OK since HSC's defects aren't "ingested" in the first place. It is true that ingesting DECam defects requires a separate call to IngestCalibsTask
.
python/lsst/ap/verify/ingestion.py
Outdated
workspace.dataRepo, workspace.calibRepo)) | ||
|
||
def _doIngestCalibs(self, repo, calibRepo, calibDataFiles): | ||
"""Ingest callibration images into a calibration repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo - "calibration"
python/lsst/ap/verify/ingestion.py
Outdated
|
||
defectIngester = pexConfig.ConfigurableField( | ||
target = IngestCalibsTask, | ||
doc = "Task used to ingest defects.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good place to mention the format defects are expected in (and/or in the defectTarball doc)
|
||
def _isRepo(repoDir): | ||
"""Test whether a directory has been set up as a repository. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you asked Butler people whether this is an acceptable way to see if something is a repo or not? If this could preemptively agree with whatever gen3 Butler plans to do that would be nice, but not essential.
python/lsst/ap/verify/ingestion.py
Outdated
doc = "Names of calib files (no path; wildcards allowed) to ingest from the dataset.", | ||
) | ||
calibBadFiles = pexConfig.ListField( | ||
dtype = str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, possibly calibIgnoreFiles
python/lsst/ap/verify/ingestion.py
Outdated
# TODO: --output is workaround for DM-11668 | ||
args = [repo, "--calib", calibRepo, "--output", os.path.join(calibRepo, "dummy"), | ||
"--mode", "link", "--validity", "999"] | ||
args.extend(calibDataFiles) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DM-11668 bug is... something else. More relevant, should validity
be a config parameter? It sets something about the valid date range for configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something else? I got exactly the same error as Colin describes on DM-11668, and it started after I let the raw repo be new-style.
I just copied --validity
from your code, so if you think it should be configurable, I'll do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Something else" as in "wow." Must be a new-style butler thing, yes. Sorry for the noise there.
Since so many other things are configurable now, it seems reasonable to not have validity hard-coded either. But a default value of 999 is fine. This may well be another DECam-ism; in other words, other cameras may rely on the validity parameter to set registry values during ingestion more than DECam typically does.
python/lsst/ap/verify/ingestion.py
Outdated
# TODO: --output is workaround for DM-11668 | ||
defectargs = [repo, "--calib", calibRepo, "--output", os.path.join(calibRepo, "dummy"), | ||
"--calibType", "defect", | ||
"--mode", "skip", "--validity", "999"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty sure the validity
argument can be omitted for defect ingestion. For more on this, look at pipe.tasks.ingestCalibs.fixSubsetValidity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you remember why you included it? Was it based on a tutorial, or because it didn't work if you didn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the README here, originally. https://github.com/lsst/obs_decam/blob/master/README.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out it can't be omitted: https://github.com/lsst/pipe_tasks/blob/master/python/lsst/pipe/tasks/ingestCalibs.py#L190
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging that up. A default value it is, then.
python/lsst/ap/verify/ingestion.py
Outdated
keytype = str, | ||
itemtype = str, | ||
default = {}, | ||
doc = "Map from a refcat name to a tar.gz file containing the sharded catalog. May be empty.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what situation would an empty refcats make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a dataset doesn't come with refcats?
Also, I can't really default to anything else, since the choice of catalogs and especially filenames is going to be dataset-specific. Arguably this info shouldn't be in a config at all, but I don't see another way to let the dataset set the values.
We no longer deviate from PEP8 on keyword whitespace.
Using each task's parser through its IngestTask.ArgumentParser field will make it easier to configure ingestion, because the user only needs to specify the ingestion task. For each IngestionTask implementation there is only one correct of parser, so automating the choice is a good idea anyway.
The fact that ingestion must parse a fictional command line is difficult to work with. Hide everything except the argument list in a dedicated function.
Refcats are preprocessed, preingested data, so they should be handled similarly to templates.
The config contains all information that the dataset would need to provide to allow it to be ingested. This includes links to obs package specializations of ingestion tasks, but not data provided by obs packages directly.
This commit updates documentation (in ap_verify) corresponding to updates to ap_verify_dataset_template.
Previously, ingestion could exclude bad calibs but not bad science images.
The task contains the code that was previously in free functions, but does not yet use subtasks or have all the unneccessary boilerplate removed.
Despite not being a CmdLineTask, DatasetIngestTask will respect obs-package- and camera-specific override files following standard Stack conventions.
The use of subtasks lets us remove all metadata management from the ingestion code. However, since ingestion tasks were not designed for use as subtasks, some manual work is needed to retain obs package overrides for them (this work has, for now, been absorbed into the dataset-specific config file).
The old method unnecessarily conflated calib (flat, bias, etc.) and defect handling, making the code harder to understand.
Now _ingest* is responsible for skipping logic, dataset enumeration, and status messages. _doIngest* is responsible for the mechanics of ingestion and error handling.
The three main ingestion steps now use the same code to identify files to process. This greatly simplifies _doIngestDefects as well.
Changing Dataset.makeCompatibleRepo from a filesystem copy to a Butler chaining runs afoul of a couple of Butler 2 bugs, but it overall reduces the amount of low-level repository manipulation in ap_verify.
24a6479
to
9fcd047
Compare
This PR rewrites the ingestion module as a task, allowing for DECam-specific settings to be pushed into configs. As side effects the ingestion code has been standardized and simplified, and repository management has been made more abstract.
This PR may not be merged before lsst-dm/legacy-ap_verify_hits2015#6 and lsst/obs_decam#82.