DM-13163: Refactor ap_pipe to use CmdLineTask primitives #17

kfindeisen · 2018-02-28T01:30:18Z

This is a complete refactoring of ap_pipe, preserving existing functionality as much as possible (some changes to the command-line interface were unavoidable). No attempt was made to make use of features enabled by CmdLineTask, such as parallelism support or obs_ package handling; such changes are out of scope for the ticket.

ApPipeTask requires both a custom runner and a custom parser to support the source association database and input template repositories. Unfortunately, due to limited support for subclassing, this meant duplicating a large amount of code from pipe.base.TaskRunner and pipe.base.ArgumentParser. I hope we can remove the duplicate code quickly as the indicated tickets get resolved.

r-owen

Please look into fixing DM-111865. I think it would be much less code than you had to add here to work around the problem.

Also, if you have not already done so, please also look into using task metadata instead of returning fullMetadata in various places.

Other than that and a few minor documentation issues it looks fine.

r-owen · 2018-03-20T21:01:01Z

python/lsst/ap/pipe/apPipeParser.py

+        namespace = argparse.Namespace()
+        namespace.input = _fixPath(DEFAULT_INPUT_NAME, args[0])
+        if not os.path.isdir(namespace.input):
+            self.error("Error: input=%r not found" % (namespace.input,))


"not found or not a directory" would be more accurate

r-owen · 2018-03-20T21:09:40Z

python/lsst/ap/pipe/apPipeParser.py

+        args : `list`, optional
+            Argument list; if `None` then ``sys.argv[1:]`` is used.
+        log : `lsst.log.Log`, optional
+            `~lsst.log.Log` instance; if `None` use the default log.


Why the twiddle before lsst.log and various other classes in the doc string?

I don't know. That was copied verbatim from pipe_base.

It's probably a typo there. I suggest fixing it here.

Found it, it suppresses the namespace in the link text: https://developer.lsst.io/restructuredtext/style.html#customizing-the-link-text

r-owen · 2018-03-20T21:50:48Z

python/lsst/ap/pipe/apPipeParser.py

+
+class ApPipeParser(pipeBase.ArgumentParser):
+    """Custom argument parser to handle multiple input repos.
+    """


This seems like far more work than adding multiple input repo support to lsst.pipe.base.ArgumentParser as per DM-11865. Furthermore, resolving that ticket would result in more general, reusable code. If you have not already done so please look into how difficult it would be to add something like ArgumentParser.addRepoArgument to add an additional argument to specify an input (or input/output) repository. With any luck it would require just copying some of this code and ditching almost all the rest of this code. I am willing to help.

Actually, it's exactly the same amount of work, less the overhead of an RFC and the effort of making sure the solution is generic.

I agree that putting this functionality in pipe_base would be a better solution in the long run. However, I'm not comfortable having this ticket, which blocks all ap_pipe work, be itself blocked on the discussion, implementation, and review of a change to pipe_base, especially if we do get significant (and, I agree, unreasonable) pushback as predicted. I can't imagine the whole process taking less than two weeks.

Given that I already have a workaround, and that it would be easy to delete it once a central solution is available (remove all but __init__, and express --template in terms of addRepoArgument; no interface changes needed), I'd prefer to commit the existing code and then push for changes to pipe_base.

I'm quite surprised, but if so, then I guess it's OK. I though you had to duplicate a lot of code from the argument parser that would not have to be duplicated if you added the feature to the standard argument parser. I think the RFC and code to fix this would be quick, but it might take a bit of work finding or creating a suitable extra repo to use to test the change.

r-owen · 2018-03-20T21:52:02Z

python/lsst/ap/pipe/apPipeTaskRunner.py

+class ApPipeTaskRunner(pipeBase.ButlerInitializedTaskRunner):
+
+    def makeTask(self, parsedCmd=None, args=None):
+        """Construct an ApPipeTask with both a Butler and an database.


"and an database" -> "and a database"

r-owen · 2018-03-20T21:52:49Z

python/lsst/ap/pipe/apPipeTaskRunner.py

@@ -0,0 +1,152 @@
+#
+# LSST Data Management System
+# Copyright 2017 LSST Corporation.


Please do a quick scan for outdated copyright dates in modified files and either add 2018 or switch to the new format.

r-owen · 2018-03-20T22:44:31Z

python/lsst/ap/pipe/ap_pipe.py

+        -----
+        The input repository corresponding to ``sensorRef`` must already contain the refcats.
+        By default, the configuration for astrometric reference catalogs uses Gaia
+        and the configuration for photometry reference catalogs uses Pan-STARRS.


I suggest removing this remark about defaults as it is a duplicate of information found elsewhere and so could easily become outdated. For defaults users should read the config.

r-owen · 2018-03-20T22:47:06Z

python/lsst/ap/pipe/ap_pipe.py

+            A list of parsed data IDs for templates to use. Only used if
+            ``config.differencer.getTemplate`` is configured to do so.
+            Data IDs must contain only a visit. Specific implementations of
+            ``getTemplate`` may impose additional restrictions.


Why not allow additional keys? I'm amused by "Specific implementations of getTemplate may impose additional restrictions." because this already seems so draconian that I wonder what's left? What if a different implementation of differencing needs extra keys in order to identify a template?

This requirement is imposed by ImageDifferenceTask: https://github.com/lsst/pipe_tasks/blob/master/python/lsst/pipe/tasks/imageDifference.py#L792. They don't bother documenting this parameter at all in run, but I don't see why they'd impose it at the command-line parser level if the implementation didn't (potentially?) require it.

This may be a reference to DM-11865: "ImageDifference should allow coadd templates and output to live in different repos"? Templates are treated rather like static data products in ap_pipe and that can create some interesting situations.

This is the data ID used for (so far) calexp templates. I don't think DM-11865 is relevant.

On second thought, this may all be moot, because it occurs to me that I'm using inconsistent levels of abstraction by assuming ImageDifferenceTask but not GetCalexpAsTemplateTask. If I replace the last two sentences with "config.differencer or its subtasks may restrict the allowed dataIds", would there be any problem, Russell?

@kfindeisen I think the help text you linked to is saying that only visit is used; other arguments are ignored. If you are suggesting replacing the text that begins "Data IDs must contain..." then yes that would be fine.

r-owen · 2018-03-20T22:56:29Z

python/lsst/ap/pipe/ap_pipe.py

+        -------
+        fullMetadata : `lsst.daf.base.PropertySet`
+            The metadata produced by the run. Intended as a transitional API
+            for ap_verify, and may be removed later.


How is fullMetadata any different than the usual task metadata that is available to every task as attribute metadata? If this task calls subtasks in the usual way (as it seems to) then they can write to their own metadata attribute and there's no need to pass metadata around and it's all kept in the obvious hierarchy.

This comment applies widely to this ticket as I see fullMetadata in many places.

Task.metadata does not include subtask metadata, and also uses an incompatible convention for keys. It's very different from Task.getFullMetadata().

The problem is that depending on whether ap_verify tries to call pipeline steps individually, calls ApPipeTask.parseAndRun(), and/or tries to recover the metadata from the repository using the broken API for that, we may or may not have access to fullMetadata(). The best way for ap_verify to interact with ap_pipe, including how to handle metadata, is an ongoing design discussion, so at the moment I'm just preserving the old behavior.

Actually, I think I see a way to preserve ap_verify's current functionality without polluting ap_pipe's interface. If it works I can remove these bits.

... and done.

r-owen · 2018-03-20T22:57:18Z

python/lsst/ap/pipe/ap_pipe.py

+            for ap_verify, and may be removed later.
+        taskResults : `lsst.pipe.base.Struct`
+            The output of the task assigned to ``config.differencer`
+            (by default, `lsst.pipe.tasks.ImageDifferenceTask.run`)


I think "the output of the task" is too vague; it's the output of differencer.run, where run is crucial for looking up the expected output. Also, again, it worries me to document a default in multiple places, and the config is canonical.

I would personally say "the output of differencer.run" and leave it at that.

Unfortunately it is not "the output of config.differencer.run" because config.differencer is not an instantiation of the task. If feel you need to teach users about subtasks, perhaps you could say something like "the output of differencer.run where differencer is the image differencing task specified in the config".

This shows up a lot. I won't note it again, but whatever changes you make here, please look through the method docs to change it everywhere.

"The output of differencer.run" would be fine.

r-owen · 2018-03-20T23:29:22Z

python/lsst/ap/pipe/ap_pipe.py

+            for ap_verify, and may be removed later.
+        taskResults : `lsst.pipe.base.Struct`
+            The output of the task assigned to ``config.ccdProcessor`
+            (by default, `lsst.pipe.tasks.ProcessCcdTask.run`)


It actually returns an lsst.pipe.base.Struct containing the specified named fields. I'm afraid you have omitted any mention of Struct here and in most or all of the other Returns documentation for this class.

The config settings from _doProcessCcd, _doDiffIm, and _doAssociation have been refactored into ApPipeConfig's defaults. The one exception is _doAssociation's output database file, which must be provided at run time.

doProcessCcd now accepts a dataRef pointing to the raw data. Repository management and skip logic have been centralized to runPipelineAlone.

doDiffIm now accepts a dataRef pointing to the calibrated data. Repository management and skip logic have been centralized to runPipelineAlone.

doAssociation now accepts a dataRef pointing to the differenced data. Repository management has been centralized to runPipelineAlone, although doAssociation still needs a path to initialize the database.

CmdLineTask will impose a single output repository (which may also be the input repository) at parse time. The current implementation of runPipelineAlone is not quite as flexible as CmdLineTask, but it's close enough for development purposes.

This minimizes the amount of information that needs to be provided to ApPipeTask to run it. The remainder will be flagged as a workaround pending implementation of RFC-352.

This hack has been replaced with a multi-input Butler in ap_pipe. ap_verify still needs it in order to simulate operations repositories.

The new tests no longer require knowledge of repository locations, and as a bonus are less likely to be confused by partial runs or multiple CCDs.

The template dataRefs can now be given entirely in terms of the science dataRefs.

The current version of the task does not use most of the features of CmdLineTask (e.g., subtasks); these will be factored in in future commits.

runAssociation uses a local task instead of a subtask because the output database location depends on the output repository yet must be specified by config.

The DB is an output of the task, so it should be associated with the task object like the Butler is (also, allowing each run to put results in a different association database rather defeats the purpose). This design is also forward-compatible with the use of the Butler to manage association output; just delete the extra argument. Some hacking was needed to get around the fact that currently the DB location must be specified by config, but that code can be removed with no API changes once AssociationTask doesn't need config-time specification. As an immediate benefit, AssociationTask can now be treated as an ordinary subtask.

calexpTemplates.py will be necessary once ap_pipe is run as a CmdLineTask.

This change will prevent bugs caused by picking a different output for ImageDifferenceTask.

The current runner has some hardcoded hacks to deal with the lack of a custom command line parser. These will be removed in future commits. It also has some hacks to deal with the database argument to ApPipeTask, though these will likely be longer-lived.

The implementation has considerable code duplication with pipe_base in order to support template repositories. I may try to push for moving this support to pipe.base.ArgumentParser in the future.

The --reuse-outputs-from argument is conventional for CmdLineTasks, and provides greater flexibility.

ap_verify has access to the task object at the moment, so it can query the full metadata directly. Changes to its task management may require support on the ap_pipe side in the future, but that's best deferred until we have a clear picture of what ap_verify needs.

kfindeisen requested a review from r-owen February 28, 2018 01:30

kfindeisen mentioned this pull request Feb 28, 2018

DM-13163: Refactor ap_pipe to use CmdLineTask primitives lsst/ap_verify#22

Merged

r-owen requested changes Mar 20, 2018

View reviewed changes

kfindeisen added 23 commits March 21, 2018 17:27

Create ApPipeConfig.

b1832c9

The config settings from _doProcessCcd, _doDiffIm, and _doAssociation have been refactored into ApPipeConfig's defaults. The one exception is _doAssociation's output database file, which must be provided at run time.

Butlerize doProcessCcd.

09f3fe4

doProcessCcd now accepts a dataRef pointing to the raw data. Repository management and skip logic have been centralized to runPipelineAlone.

Factor out dataId parsing.

4b7cb42

Butlerize doDiffIm.

8dc0b60

doDiffIm now accepts a dataRef pointing to the calibrated data. Repository management and skip logic have been centralized to runPipelineAlone.

Butlerize doAssociation.

1279b97

doAssociation now accepts a dataRef pointing to the differenced data. Repository management has been centralized to runPipelineAlone, although doAssociation still needs a path to initialize the database.

Merge output repositories.

4cff71c

CmdLineTask will impose a single output repository (which may also be the input repository) at parse time. The current implementation of runPipelineAlone is not quite as flexible as CmdLineTask, but it's close enough for development purposes.

Minimize number of independent dataRefs.

68e53fb

This minimizes the amount of information that needs to be provided to ApPipeTask to run it. The remainder will be flagged as a workaround pending implementation of RFC-352.

Move doIngestTemplates to ap_verify.

67712a9

This hack has been replaced with a multi-input Butler in ap_pipe. ap_verify still needs it in order to simulate operations repositories.

Butlerize skip tests.

7fca6e2

The new tests no longer require knowledge of repository locations, and as a bonus are less likely to be confused by partial runs or multiple CCDs.

Decouple template preprocessing from dataId parsing.

1f66e02

The template dataRefs can now be given entirely in terms of the science dataRefs.

Remove manual visit parsing.

925e01d

Skip messages should be info level.

249a4c1

Move pipeline code into ApPipeTask.

0f1e469

The current version of the task does not use most of the features of CmdLineTask (e.g., subtasks); these will be factored in in future commits.

Use ApPipeTask subtasks.

074f40d

runAssociation uses a local task instead of a subtask because the output database location depends on the output repository yet must be specified by config.

Move templateId configs to their own file.

0785131

calexpTemplates.py will be necessary once ap_pipe is run as a CmdLineTask.

Use diffIm output self-consistently.

834f765

This change will prevent bugs caused by picking a different output for ImageDifferenceTask.

Delegate to ApPipeTaskRunner.

14ea18a

The current runner has some hardcoded hacks to deal with the lack of a custom command line parser. These will be removed in future commits. It also has some hacks to deal with the database argument to ApPipeTask, though these will likely be longer-lived.

Implement ApPipeParser.

bf18265

The implementation has considerable code duplication with pipe_base in order to support template repositories. I may try to push for moving this support to pipe.base.ArgumentParser in the future.

Replace --skip with --reuse-outputs-from.

9e9a4f1

The --reuse-outputs-from argument is conventional for CmdLineTasks, and provides greater flexibility.

Code style cleanup.

70bd355

Support flake8 testing with Travis.

9777884

kfindeisen force-pushed the tickets/DM-13163 branch from 4bc47c8 to 8a34909 Compare March 22, 2018 18:29

kfindeisen merged commit 8a34909 into master Mar 22, 2018

kfindeisen deleted the tickets/DM-13163 branch February 25, 2019 19:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-13163: Refactor ap_pipe to use CmdLineTask primitives #17

DM-13163: Refactor ap_pipe to use CmdLineTask primitives #17

kfindeisen commented Feb 28, 2018

r-owen left a comment

r-owen Mar 20, 2018

r-owen Mar 20, 2018

kfindeisen Mar 21, 2018

r-owen Mar 22, 2018

kfindeisen Mar 22, 2018

r-owen Mar 20, 2018

kfindeisen Mar 21, 2018

r-owen Mar 22, 2018

r-owen Mar 20, 2018

r-owen Mar 20, 2018

r-owen Mar 20, 2018

r-owen Mar 20, 2018

kfindeisen Mar 21, 2018 •

edited

mrawls Mar 21, 2018

kfindeisen Mar 21, 2018 •

edited

kfindeisen Mar 22, 2018 •

edited

r-owen Mar 22, 2018

r-owen Mar 20, 2018

kfindeisen Mar 21, 2018 •

edited

kfindeisen Mar 22, 2018 •

edited

r-owen Mar 20, 2018

kfindeisen Mar 21, 2018

r-owen Mar 20, 2018

DM-13163: Refactor ap_pipe to use CmdLineTask primitives #17

DM-13163: Refactor ap_pipe to use CmdLineTask primitives #17

Conversation

kfindeisen commented Feb 28, 2018

r-owen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfindeisen Mar 21, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfindeisen Mar 21, 2018 • edited

Choose a reason for hiding this comment

kfindeisen Mar 22, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfindeisen Mar 21, 2018 • edited

Choose a reason for hiding this comment

kfindeisen Mar 22, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfindeisen Mar 21, 2018 •

edited

kfindeisen Mar 21, 2018 •

edited

kfindeisen Mar 22, 2018 •

edited

kfindeisen Mar 21, 2018 •

edited

kfindeisen Mar 22, 2018 •

edited