DM-13530: Generalize ingestion to non-HiTS data #26

kfindeisen · 2018-04-05T20:03:08Z

This PR rewrites the ingestion module as a task, allowing for DECam-specific settings to be pushed into configs. As side effects the ingestion code has been standardized and simplified, and repository management has been made more abstract.

This PR may not be merged before lsst-dm/legacy-ap_verify_hits2015#6 and lsst/obs_decam#82.

mrawls

This looks great. Most of my comments are small. I'm really glad we have this handy ingestion utility now 💯

mrawls · 2018-04-19T20:00:42Z

python/lsst/ap/verify/ingestion.py

+    _exclude = exclude if exclude is not None else []
+
+    allFiles = set()
+    # TODO: can replace with recursive glob once Python 2 support is withdrawn


Should be OK to do this now, especially if it will allow you to get rid of the nested for loops.

mrawls · 2018-04-19T20:01:15Z

python/lsst/ap/verify/ingestion.py

-def _getOutputRepo(outputRoot, outputDir):
-    """Return location on disk for one output repository used by ``ap_pipe``.
+def _findMatchingFiles(basePath, include, exclude=None):
+    """Recursively identify files matching one set of patterns and not matching another.



This is a handy little function!

mrawls · 2018-04-19T20:19:13Z

python/lsst/ap/verify/ingestion.py

+            The name of a .tar.gz file that contains all the compressed
+            defect images.
+        """
+        # TODO: clean up implementation after DM-5467 resolved


Given this comment, I presume defect ingestion is still necessarily specific to DECam. It would be good to say so, and maybe provide the option to skip it to support future non-DECam ingestion.

Nevermind re. skipping it - looks like it will indeed skip this step if there's no defectTarball provided. Good!

Hmm... I glanced briefly at DM-5467 and thought it related to the need to have a tar file and/or a dedicated call to IngestCalibsTask instead of just keeping the defects with the other calibs. Since you were involved in that discussion, can you double-check whether or not it's relevant?

And if something in the current code is DECam-specific, please let me know.

When I say DECam-specific, I mostly mean that this assumes defects are fits files that need to be ingested at all. If defects exist in a different format (e.g., a text file like HSC https://github.com/lsst/obs_subaru/tree/master/hsc/defects) then they won't be handled at all with this ingestion routine. But that should be OK since HSC's defects aren't "ingested" in the first place. It is true that ingesting DECam defects requires a separate call to IngestCalibsTask.

mrawls · 2018-04-19T20:21:00Z

python/lsst/ap/verify/ingestion.py

+                workspace.dataRepo, workspace.calibRepo))
+
+    def _doIngestCalibs(self, repo, calibRepo, calibDataFiles):
+        """Ingest callibration images into a calibration repository.


Typo - "calibration"

mrawls · 2018-04-19T20:27:58Z

python/lsst/ap/verify/ingestion.py

+
+    defectIngester = pexConfig.ConfigurableField(
+        target = IngestCalibsTask,
+        doc = "Task used to ingest defects.",


Good place to mention the format defects are expected in (and/or in the defectTarball doc)

mrawls · 2018-04-19T20:33:35Z

python/lsst/ap/verify/dataset.py

+
+def _isRepo(repoDir):
+    """Test whether a directory has been set up as a repository.
+    """


Have you asked Butler people whether this is an acceptable way to see if something is a repo or not? If this could preemptively agree with whatever gen3 Butler plans to do that would be nice, but not essential.

mrawls · 2018-04-19T20:34:05Z

python/lsst/ap/verify/ingestion.py

+        doc = "Names of calib files (no path; wildcards allowed) to ingest from the dataset.",
+    )
+    calibBadFiles = pexConfig.ListField(
+        dtype = str,


Again, possibly calibIgnoreFiles

mrawls · 2018-04-19T20:40:46Z

python/lsst/ap/verify/ingestion.py

+        # TODO: --output is workaround for DM-11668
+        args = [repo, "--calib", calibRepo, "--output", os.path.join(calibRepo, "dummy"),
+                "--mode", "link", "--validity", "999"]
+        args.extend(calibDataFiles)


The DM-11668 bug is... something else. More relevant, should validity be a config parameter? It sets something about the valid date range for configs.

Something else? I got exactly the same error as Colin describes on DM-11668, and it started after I let the raw repo be new-style.

I just copied --validity from your code, so if you think it should be configurable, I'll do that.

"Something else" as in "wow." Must be a new-style butler thing, yes. Sorry for the noise there.

Since so many other things are configurable now, it seems reasonable to not have validity hard-coded either. But a default value of 999 is fine. This may well be another DECam-ism; in other words, other cameras may rely on the validity parameter to set registry values during ingestion more than DECam typically does.

mrawls · 2018-04-19T20:43:44Z

python/lsst/ap/verify/ingestion.py

+        # TODO: --output is workaround for DM-11668
+        defectargs = [repo, "--calib", calibRepo, "--output", os.path.join(calibRepo, "dummy"),
+                      "--calibType", "defect",
+                      "--mode", "skip", "--validity", "999"]


I am pretty sure the validity argument can be omitted for defect ingestion. For more on this, look at pipe.tasks.ingestCalibs.fixSubsetValidity

Do you remember why you included it? Was it based on a tutorial, or because it didn't work if you didn't?

From the README here, originally. https://github.com/lsst/obs_decam/blob/master/README.md

Turns out it can't be omitted: https://github.com/lsst/pipe_tasks/blob/master/python/lsst/pipe/tasks/ingestCalibs.py#L190

Thanks for digging that up. A default value it is, then.

mrawls · 2018-04-19T20:47:18Z

python/lsst/ap/verify/ingestion.py

+        keytype = str,
+        itemtype = str,
+        default = {},
+        doc = "Map from a refcat name to a tar.gz file containing the sharded catalog. May be empty.",


In what situation would an empty refcats make sense?

If a dataset doesn't come with refcats?

Also, I can't really default to anything else, since the choice of catalogs and especially filenames is going to be dataset-specific. Arguably this info shouldn't be in a config at all, but I don't see another way to let the dataset set the values.

We no longer deviate from PEP8 on keyword whitespace.

Using each task's parser through its IngestTask.ArgumentParser field will make it easier to configure ingestion, because the user only needs to specify the ingestion task. For each IngestionTask implementation there is only one correct of parser, so automating the choice is a good idea anyway.

The fact that ingestion must parse a fictional command line is difficult to work with. Hide everything except the argument list in a dedicated function.

Refcats are preprocessed, preingested data, so they should be handled similarly to templates.

The config contains all information that the dataset would need to provide to allow it to be ingested. This includes links to obs package specializations of ingestion tasks, but not data provided by obs packages directly.

This commit updates documentation (in ap_verify) corresponding to updates to ap_verify_dataset_template.

Previously, ingestion could exclude bad calibs but not bad science images.

The task contains the code that was previously in free functions, but does not yet use subtasks or have all the unneccessary boilerplate removed.

Despite not being a CmdLineTask, DatasetIngestTask will respect obs-package- and camera-specific override files following standard Stack conventions.

The use of subtasks lets us remove all metadata management from the ingestion code. However, since ingestion tasks were not designed for use as subtasks, some manual work is needed to retain obs package overrides for them (this work has, for now, been absorbed into the dataset-specific config file).

The old method unnecessarily conflated calib (flat, bias, etc.) and defect handling, making the code harder to understand.

Now _ingest* is responsible for skipping logic, dataset enumeration, and status messages. _doIngest* is responsible for the mechanics of ingestion and error handling.

The three main ingestion steps now use the same code to identify files to process. This greatly simplifies _doIngestDefects as well.

Changing Dataset.makeCompatibleRepo from a filesystem copy to a Butler chaining runs afoul of a couple of Butler 2 bugs, but it overall reduces the amount of low-level repository manipulation in ap_verify.

kfindeisen requested a review from mrawls April 5, 2018 20:03

mrawls approved these changes Apr 19, 2018

View reviewed changes

kfindeisen added 19 commits April 20, 2018 16:27

Update flake8 config.

c2fb620

We no longer deviate from PEP8 on keyword whitespace.

Factor out argument parsing.

283094b

The fact that ingestion must parse a fictional command line is difficult to work with. Hide everything except the argument list in a dedicated function.

Separate refcats from science data.

24c8caf

Refcats are preprocessed, preingested data, so they should be handled similarly to templates.

Define DatasetIngestConfig.

a5a6720

The config contains all information that the dataset would need to provide to allow it to be ingested. This includes links to obs package specializations of ingestion tasks, but not data provided by obs packages directly.

Add dataset-specific ingestion config to dataset format.

934d77e

This commit updates documentation (in ap_verify) corresponding to updates to ap_verify_dataset_template.

Defer task configuration to DatasetIngestConfig.

1e31a43

Make defect ingestion optional.

11a4c61

Allow science file exclusion.

839c254

Previously, ingestion could exclude bad calibs but not bad science images.

Allow calib validity periods.

823d39b

Create DatasetIngestTask.

f63333e

The task contains the code that was previously in free functions, but does not yet use subtasks or have all the unneccessary boilerplate removed.

Support config overrides for DatasetIngestTask.

e144a7e

Despite not being a CmdLineTask, DatasetIngestTask will respect obs-package- and camera-specific override files following standard Stack conventions.

Remove _doIngestCalibs.

8772e08

The old method unnecessarily conflated calib (flat, bias, etc.) and defect handling, making the code harder to understand.

Standardize DatasetIngestTask's methods.

49daa5e

Now _ingest* is responsible for skipping logic, dataset enumeration, and status messages. _doIngest* is responsible for the mechanics of ingestion and error handling.

Remove redundant code from _ingestDefects.

36891a7

Standardize ingestion file handling.

17fcaac

The three main ingestion steps now use the same code to identify files to process. This greatly simplifies _doIngestDefects as well.

Remove references to DECam data products.

ab5af45

Chain template repo to input repo instead of linking.

9fcd047

Changing Dataset.makeCompatibleRepo from a filesystem copy to a Butler chaining runs afoul of a couple of Butler 2 bugs, but it overall reduces the amount of low-level repository manipulation in ap_verify.

kfindeisen force-pushed the tickets/DM-13530 branch from 24a6479 to 9fcd047 Compare April 23, 2018 19:07

kfindeisen merged commit 9fcd047 into master Apr 23, 2018

kfindeisen mentioned this pull request Aug 9, 2018

DM-14848: ingest_dataset.py must be run from final working directory #44

Merged

kfindeisen deleted the tickets/DM-13530 branch November 30, 2018 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-13530: Generalize ingestion to non-HiTS data #26

DM-13530: Generalize ingestion to non-HiTS data #26

kfindeisen commented Apr 5, 2018

mrawls left a comment

mrawls Apr 19, 2018

mrawls Apr 19, 2018

mrawls Apr 19, 2018

mrawls Apr 19, 2018

kfindeisen Apr 19, 2018

kfindeisen Apr 19, 2018 •

edited

mrawls Apr 19, 2018

mrawls Apr 19, 2018

mrawls Apr 19, 2018

mrawls Apr 19, 2018

mrawls Apr 19, 2018

mrawls Apr 19, 2018

kfindeisen Apr 19, 2018

mrawls Apr 19, 2018

mrawls Apr 19, 2018

kfindeisen Apr 19, 2018

mrawls Apr 19, 2018

kfindeisen Apr 19, 2018

mrawls Apr 19, 2018

mrawls Apr 19, 2018

kfindeisen Apr 19, 2018

DM-13530: Generalize ingestion to non-HiTS data #26

DM-13530: Generalize ingestion to non-HiTS data #26

Conversation

kfindeisen commented Apr 5, 2018

mrawls left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfindeisen Apr 19, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfindeisen Apr 19, 2018 •

edited