Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-13530: Generalize ingestion to non-HiTS data #26

Merged
merged 19 commits into from Apr 23, 2018
Merged

Conversation

kfindeisen
Copy link
Member

This PR rewrites the ingestion module as a task, allowing for DECam-specific settings to be pushed into configs. As side effects the ingestion code has been standardized and simplified, and repository management has been made more abstract.

This PR may not be merged before lsst-dm/legacy-ap_verify_hits2015#6 and lsst/obs_decam#82.

@kfindeisen kfindeisen requested a review from mrawls April 5, 2018 20:03
Copy link
Contributor

@mrawls mrawls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Most of my comments are small. I'm really glad we have this handy ingestion utility now 💯

_exclude = exclude if exclude is not None else []

allFiles = set()
# TODO: can replace with recursive glob once Python 2 support is withdrawn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be OK to do this now, especially if it will allow you to get rid of the nested for loops.

def _getOutputRepo(outputRoot, outputDir):
"""Return location on disk for one output repository used by ``ap_pipe``.
def _findMatchingFiles(basePath, include, exclude=None):
"""Recursively identify files matching one set of patterns and not matching another.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a handy little function!

The name of a .tar.gz file that contains all the compressed
defect images.
"""
# TODO: clean up implementation after DM-5467 resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this comment, I presume defect ingestion is still necessarily specific to DECam. It would be good to say so, and maybe provide the option to skip it to support future non-DECam ingestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind re. skipping it - looks like it will indeed skip this step if there's no defectTarball provided. Good!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I glanced briefly at DM-5467 and thought it related to the need to have a tar file and/or a dedicated call to IngestCalibsTask instead of just keeping the defects with the other calibs. Since you were involved in that discussion, can you double-check whether or not it's relevant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if something in the current code is DECam-specific, please let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I say DECam-specific, I mostly mean that this assumes defects are fits files that need to be ingested at all. If defects exist in a different format (e.g., a text file like HSC https://github.com/lsst/obs_subaru/tree/master/hsc/defects) then they won't be handled at all with this ingestion routine. But that should be OK since HSC's defects aren't "ingested" in the first place. It is true that ingesting DECam defects requires a separate call to IngestCalibsTask.

workspace.dataRepo, workspace.calibRepo))

def _doIngestCalibs(self, repo, calibRepo, calibDataFiles):
"""Ingest callibration images into a calibration repository.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo - "calibration"


defectIngester = pexConfig.ConfigurableField(
target = IngestCalibsTask,
doc = "Task used to ingest defects.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good place to mention the format defects are expected in (and/or in the defectTarball doc)


def _isRepo(repoDir):
"""Test whether a directory has been set up as a repository.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you asked Butler people whether this is an acceptable way to see if something is a repo or not? If this could preemptively agree with whatever gen3 Butler plans to do that would be nice, but not essential.

doc = "Names of calib files (no path; wildcards allowed) to ingest from the dataset.",
)
calibBadFiles = pexConfig.ListField(
dtype = str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, possibly calibIgnoreFiles

# TODO: --output is workaround for DM-11668
args = [repo, "--calib", calibRepo, "--output", os.path.join(calibRepo, "dummy"),
"--mode", "link", "--validity", "999"]
args.extend(calibDataFiles)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DM-11668 bug is... something else. More relevant, should validity be a config parameter? It sets something about the valid date range for configs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something else? I got exactly the same error as Colin describes on DM-11668, and it started after I let the raw repo be new-style.

I just copied --validity from your code, so if you think it should be configurable, I'll do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Something else" as in "wow." Must be a new-style butler thing, yes. Sorry for the noise there.

Since so many other things are configurable now, it seems reasonable to not have validity hard-coded either. But a default value of 999 is fine. This may well be another DECam-ism; in other words, other cameras may rely on the validity parameter to set registry values during ingestion more than DECam typically does.

# TODO: --output is workaround for DM-11668
defectargs = [repo, "--calib", calibRepo, "--output", os.path.join(calibRepo, "dummy"),
"--calibType", "defect",
"--mode", "skip", "--validity", "999"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure the validity argument can be omitted for defect ingestion. For more on this, look at pipe.tasks.ingestCalibs.fixSubsetValidity

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you remember why you included it? Was it based on a tutorial, or because it didn't work if you didn't?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging that up. A default value it is, then.

keytype = str,
itemtype = str,
default = {},
doc = "Map from a refcat name to a tar.gz file containing the sharded catalog. May be empty.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situation would an empty refcats make sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a dataset doesn't come with refcats?

Also, I can't really default to anything else, since the choice of catalogs and especially filenames is going to be dataset-specific. Arguably this info shouldn't be in a config at all, but I don't see another way to let the dataset set the values.

We no longer deviate from PEP8 on keyword whitespace.
Using each task's parser through its IngestTask.ArgumentParser field
will make it easier to configure ingestion, because the user only
needs to specify the ingestion task. For each IngestionTask
implementation there is only one correct of parser, so automating the
choice is a good idea anyway.
The fact that ingestion must parse a fictional command line is
difficult to work with. Hide everything except the argument list in a
dedicated function.
Refcats are preprocessed, preingested data, so they should be handled
similarly to templates.
The config contains all information that the dataset would need to provide to
allow it to be ingested. This includes links to obs package specializations of
ingestion tasks, but not data provided by obs packages directly.
This commit updates documentation (in ap_verify) corresponding to
updates to ap_verify_dataset_template.
Previously, ingestion could exclude bad calibs but not bad science images.
The task contains the code that was previously in free functions, but
does not yet use subtasks or have all the unneccessary boilerplate
removed.
Despite not being a CmdLineTask, DatasetIngestTask will respect
obs-package- and camera-specific override files following standard
Stack conventions.
The use of subtasks lets us remove all metadata management from the
ingestion code. However, since ingestion tasks were not designed for
use as subtasks, some manual work is needed to retain obs package
overrides for them (this work has, for now, been absorbed into the
dataset-specific config file).
The old method unnecessarily conflated calib (flat, bias, etc.) and
defect handling, making the code harder to understand.
Now _ingest* is responsible for skipping logic, dataset enumeration, and
status messages. _doIngest* is responsible for the mechanics of ingestion and
error handling.
The three main ingestion steps now use the same code to identify files
to process. This greatly simplifies _doIngestDefects as well.
Changing Dataset.makeCompatibleRepo from a filesystem copy to a Butler
chaining runs afoul of a couple of Butler 2 bugs, but it overall
reduces the amount of low-level repository manipulation in ap_verify.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants