Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-13163: Refactor ap_pipe to use CmdLineTask primitives #17

Merged
merged 23 commits into from Mar 22, 2018

Conversation

kfindeisen
Copy link
Member

This is a complete refactoring of ap_pipe, preserving existing functionality as much as possible (some changes to the command-line interface were unavoidable). No attempt was made to make use of features enabled by CmdLineTask, such as parallelism support or obs_ package handling; such changes are out of scope for the ticket.

ApPipeTask requires both a custom runner and a custom parser to support the source association database and input template repositories. Unfortunately, due to limited support for subclassing, this meant duplicating a large amount of code from pipe.base.TaskRunner and pipe.base.ArgumentParser. I hope we can remove the duplicate code quickly as the indicated tickets get resolved.

Copy link
Contributor

@r-owen r-owen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please look into fixing DM-111865. I think it would be much less code than you had to add here to work around the problem.

Also, if you have not already done so, please also look into using task metadata instead of returning fullMetadata in various places.

Other than that and a few minor documentation issues it looks fine.

namespace = argparse.Namespace()
namespace.input = _fixPath(DEFAULT_INPUT_NAME, args[0])
if not os.path.isdir(namespace.input):
self.error("Error: input=%r not found" % (namespace.input,))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"not found or not a directory" would be more accurate

args : `list`, optional
Argument list; if `None` then ``sys.argv[1:]`` is used.
log : `lsst.log.Log`, optional
`~lsst.log.Log` instance; if `None` use the default log.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the twiddle before lsst.log and various other classes in the doc string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. That was copied verbatim from pipe_base.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably a typo there. I suggest fixing it here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found it, it suppresses the namespace in the link text: https://developer.lsst.io/restructuredtext/style.html#customizing-the-link-text


class ApPipeParser(pipeBase.ArgumentParser):
"""Custom argument parser to handle multiple input repos.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like far more work than adding multiple input repo support to lsst.pipe.base.ArgumentParser as per DM-11865. Furthermore, resolving that ticket would result in more general, reusable code. If you have not already done so please look into how difficult it would be to add something like ArgumentParser.addRepoArgument to add an additional argument to specify an input (or input/output) repository. With any luck it would require just copying some of this code and ditching almost all the rest of this code. I am willing to help.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it's exactly the same amount of work, less the overhead of an RFC and the effort of making sure the solution is generic.

I agree that putting this functionality in pipe_base would be a better solution in the long run. However, I'm not comfortable having this ticket, which blocks all ap_pipe work, be itself blocked on the discussion, implementation, and review of a change to pipe_base, especially if we do get significant (and, I agree, unreasonable) pushback as predicted. I can't imagine the whole process taking less than two weeks.

Given that I already have a workaround, and that it would be easy to delete it once a central solution is available (remove all but __init__, and express --template in terms of addRepoArgument; no interface changes needed), I'd prefer to commit the existing code and then push for changes to pipe_base.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite surprised, but if so, then I guess it's OK. I though you had to duplicate a lot of code from the argument parser that would not have to be duplicated if you added the feature to the standard argument parser. I think the RFC and code to fix this would be quick, but it might take a bit of work finding or creating a suitable extra repo to use to test the change.

class ApPipeTaskRunner(pipeBase.ButlerInitializedTaskRunner):

def makeTask(self, parsedCmd=None, args=None):
"""Construct an ApPipeTask with both a Butler and an database.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"and an database" -> "and a database"

@@ -0,0 +1,152 @@
#
# LSST Data Management System
# Copyright 2017 LSST Corporation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do a quick scan for outdated copyright dates in modified files and either add 2018 or switch to the new format.

-----
The input repository corresponding to ``sensorRef`` must already contain the refcats.
By default, the configuration for astrometric reference catalogs uses Gaia
and the configuration for photometry reference catalogs uses Pan-STARRS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest removing this remark about defaults as it is a duplicate of information found elsewhere and so could easily become outdated. For defaults users should read the config.

A list of parsed data IDs for templates to use. Only used if
``config.differencer.getTemplate`` is configured to do so.
Data IDs must contain only a visit. Specific implementations of
``getTemplate`` may impose additional restrictions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not allow additional keys? I'm amused by "Specific implementations of getTemplate may impose additional restrictions." because this already seems so draconian that I wonder what's left? What if a different implementation of differencing needs extra keys in order to identify a template?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requirement is imposed by ImageDifferenceTask: https://github.com/lsst/pipe_tasks/blob/master/python/lsst/pipe/tasks/imageDifference.py#L792. They don't bother documenting this parameter at all in run, but I don't see why they'd impose it at the command-line parser level if the implementation didn't (potentially?) require it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a reference to DM-11865: "ImageDifference should allow coadd templates and output to live in different repos"? Templates are treated rather like static data products in ap_pipe and that can create some interesting situations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the data ID used for (so far) calexp templates. I don't think DM-11865 is relevant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, this may all be moot, because it occurs to me that I'm using inconsistent levels of abstraction by assuming ImageDifferenceTask but not GetCalexpAsTemplateTask. If I replace the last two sentences with "config.differencer or its subtasks may restrict the allowed dataIds", would there be any problem, Russell?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kfindeisen I think the help text you linked to is saying that only visit is used; other arguments are ignored. If you are suggesting replacing the text that begins "Data IDs must contain..." then yes that would be fine.

-------
fullMetadata : `lsst.daf.base.PropertySet`
The metadata produced by the run. Intended as a transitional API
for ap_verify, and may be removed later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is fullMetadata any different than the usual task metadata that is available to every task as attribute metadata? If this task calls subtasks in the usual way (as it seems to) then they can write to their own metadata attribute and there's no need to pass metadata around and it's all kept in the obvious hierarchy.

This comment applies widely to this ticket as I see fullMetadata in many places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task.metadata does not include subtask metadata, and also uses an incompatible convention for keys. It's very different from Task.getFullMetadata().

The problem is that depending on whether ap_verify tries to call pipeline steps individually, calls ApPipeTask.parseAndRun(), and/or tries to recover the metadata from the repository using the broken API for that, we may or may not have access to fullMetadata(). The best way for ap_verify to interact with ap_pipe, including how to handle metadata, is an ongoing design discussion, so at the moment I'm just preserving the old behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think I see a way to preserve ap_verify's current functionality without polluting ap_pipe's interface. If it works I can remove these bits.

... and done.

for ap_verify, and may be removed later.
taskResults : `lsst.pipe.base.Struct`
The output of the task assigned to ``config.differencer`
(by default, `lsst.pipe.tasks.ImageDifferenceTask.run`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "the output of the task" is too vague; it's the output of differencer.run, where run is crucial for looking up the expected output. Also, again, it worries me to document a default in multiple places, and the config is canonical.

I would personally say "the output of differencer.run" and leave it at that.

Unfortunately it is not "the output of config.differencer.run" because config.differencer is not an instantiation of the task. If feel you need to teach users about subtasks, perhaps you could say something like "the output of differencer.run where differencer is the image differencing task specified in the config".

This shows up a lot. I won't note it again, but whatever changes you make here, please look through the method docs to change it everywhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The output of differencer.run" would be fine.

for ap_verify, and may be removed later.
taskResults : `lsst.pipe.base.Struct`
The output of the task assigned to ``config.ccdProcessor`
(by default, `lsst.pipe.tasks.ProcessCcdTask.run`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually returns an lsst.pipe.base.Struct containing the specified named fields. I'm afraid you have omitted any mention of Struct here and in most or all of the other Returns documentation for this class.

The config settings from _doProcessCcd, _doDiffIm, and _doAssociation
have been refactored into ApPipeConfig's defaults. The one exception
is _doAssociation's output database file, which must be provided at
run time.
doProcessCcd now accepts a dataRef pointing to the raw data.
Repository management and skip logic have been centralized
to runPipelineAlone.
doDiffIm now accepts a dataRef pointing to the calibrated data.
Repository management and skip logic have been centralized
to runPipelineAlone.
doAssociation now accepts a dataRef pointing to the differenced
data. Repository management has been centralized to runPipelineAlone,
although doAssociation still needs a path to initialize the database.
CmdLineTask will impose a single output repository (which may also be
the input repository) at parse time. The current implementation of
runPipelineAlone is not quite as flexible as CmdLineTask, but it's
close enough for development purposes.
This minimizes the amount of information that needs to be provided
to ApPipeTask to run it. The remainder will be flagged as a workaround
pending implementation of RFC-352.
This hack has been replaced with a multi-input Butler in ap_pipe.
ap_verify still needs it in order to simulate operations repositories.
The new tests no longer require knowledge of repository locations,
and as a bonus are less likely to be confused by partial runs or
multiple CCDs.
The template dataRefs can now be given entirely in terms of the
science dataRefs.
The current version of the task does not use most of the features of
CmdLineTask (e.g., subtasks); these will be factored in in
future commits.
runAssociation uses a local task instead of a subtask because the
output database location depends on the output repository yet must be
specified by config.
The DB is an output of the task, so it should be associated with the
task object like the Butler is (also, allowing each run to put results
in a different association database rather defeats the purpose). This
design is also forward-compatible with the use of the Butler to manage
association output; just delete the extra argument.

Some hacking was needed to get around the fact that currently the DB
location must be specified by config, but that code can be removed
with no API changes once AssociationTask doesn't need config-time
specification.

As an immediate benefit, AssociationTask can now be treated as an
ordinary subtask.
calexpTemplates.py will be necessary once ap_pipe is run
as a CmdLineTask.
This change will prevent bugs caused by picking a different output
for ImageDifferenceTask.
The current runner has some hardcoded hacks to deal with the lack of
a custom command line parser. These will be removed in future commits.
It also has some hacks to deal with the database argument to
ApPipeTask, though these will likely be longer-lived.
The implementation has considerable code duplication with pipe_base in
order to support template repositories. I may try to push for moving
this support to pipe.base.ArgumentParser in the future.
The --reuse-outputs-from argument is conventional for CmdLineTasks,
and provides greater flexibility.
ap_verify has access to the task object at the moment, so it can query
the full metadata directly. Changes to its task management may require
support on the ap_pipe side in the future, but that's best deferred
until we have a clear picture of what ap_verify needs.
@kfindeisen kfindeisen merged commit 8a34909 into master Mar 22, 2018
@kfindeisen kfindeisen deleted the tickets/DM-13163 branch February 25, 2019 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants