DM-14823: Simplify PipelineTask API for common use case #60

andy-slac · 2018-08-02T18:18:41Z

The run() metod interface was changed to only accept input data as
keyword arguments and does not have arguments for output data. This
should simplify task implementation for most common use case with a
single output DataId (per dataset type). More complex cases will be
handled by re-implementing new method runWithOutputs() which has
arguments for both input data and output DataIds.

Piece of runQuantum() code was refactored into a separate method
_saveStruct() which can also be re-implemented in a subclass.

TallJimbo

Some small comments, and one request to add another convenience change to the nature of run (and runWithOutputs) args.

TallJimbo · 2018-08-03T12:53:39Z

python/lsst/pipe/base/pipelineTask.py

+        inputData : `dict`
+            Dictionary whose keys are the names of the configuration fields
+            describing input dataset types and values are lists of
+            Python-domain data objects retrieved from data butler.


It'd be really nice if we could make the dictionary value a single item instead of a list in cases where we know the list couldn't have more than one element. We could try to guess when that's the case by comparing the DataUnits of the Quantum with the DataUnits of this DatasetType, but it's probably clearer and less error-prone to just have a way for PipelineTask to declare that a DatasetType is a scalar (e.g. in InputDatasetConfig.

This is more important for run than runWithData, of course, but I think it makes sense to put that transformation in runQuantum so runWithData would see it as well.

And of course the same is true for outputs as well as inputs.

So, I guess the proposal is to add an extra attribute scalar to In/OutputDatasetConfig and make it False by default?

Yup, sounds good. @natelust, will that require any extra work to be included in the syntactic sugar for those config objects?

I think it would just be one small change to add that to the wrapper function signature, everything else would propagate though

Yes, I did that and also added something to docstring for the wrapper method..

TallJimbo · 2018-08-03T13:12:38Z

python/lsst/pipe/base/pipelineTask.py

-        For input dataset types argument values will be lists of the data
-        object retrieved from data butler. For output dataset types argument
-        values will be the lists of units from DataRefs in a Quantum.
+        This method is called by :py:meth:`runQuantum` to operate on input


You should just be able to say runQuantum without the :py:meth: business; I'm not sure where the latter is required (@jonathansick?) but I'm pretty sure links in docstrings will work fine without it as long as there are no namespace-qualification issues (and there shouldn't be for a method of the same class).

(Several other instances of this throughout the changeset that should also be fixed.)

I think that is correct, only the backticks should be needed

TallJimbo · 2018-08-03T13:15:45Z

python/lsst/pipe/base/pipelineTask.py

+        # store produced ouput data
+        self._saveStruct(struct, outputDataRefs, butler)
+
+    def _saveStruct(self, struct, outputDataRefs, butler):


I don't see a good reason to treat _saveStruct and runWithOutputs differently w.r.t. leading underscores - in C++, I'd have called them both (non-pure) virtual protected, and while I don't have a strong opinion about whether that merits an underscore in Python, I think they should be treated consistently.

In my (twisted) mental model _saveStruct() is indeed protected (or even private) and not to be called by clients directly, while runWithOutputs() is public because it is the same as run(), only better :) Anyways, I'll make thing uniform-no-underscores.

natelust · 2018-08-03T14:31:50Z

python/lsst/pipe/base/pipelineTask.py

@@ -309,24 +364,47 @@ def runQuantum(self, quantum, butler):
                inputs[key] = [butler.get(dataRef.datasetType.name, dataRef.dataId)
                               for dataRef in dataRefs]



There is no good way to put a comment in a non diff line, so I am putting this here. The changes made require that the documentation in the docstring for run quantum need to be updated. The described behavior is not what happens any more.

Also in the doc string "must be lists of data bjects corresponding"
bjects -> objects

natelust · 2018-08-03T14:35:55Z

python/lsst/pipe/base/pipelineTask.py

@@ -244,22 +244,77 @@ def getDatasetTypes(cls, config, configClass):
                dsTypes[key] = cls.makeDatasetType(value)
        return dsTypes

-    def run(self, *args, **kwargs):
+    def runWithOutputs(self, inputData, outputDataIds):


I have very strong reservations about a method called runWithOutputs that in the default implementation does nothing with the outputDataIds variable, or anything output related

TallJimbo · 2018-08-03T15:46:33Z

@natelust and I discussed this a bit in person, and we came up with the following recommendation:

Let's rename runWithOutputs to adaptArgsAndRun (open to counter proposals, but we think that better captures what it's doing).
Let's add an inputDataIds argument to adaptArgsAndRun containing a dictionary of input data IDs, with the same structure as outputDataIds. In all of the real cases I could think of where the concrete PipelineTask wants output IDs, they'd want the input IDs, too. We'd guarantee that the lists of IDs in inputDataIds and the lists of unpersisted objects in inputData would be zip-iterable in case an override of adaptArgsAndRun needed. We think it'd be better to make those separate-but-related data structures so the default implementation of adaptArgsAndRun simple (as it is now).

I still think it'd be good to do that "make guaranteed single-element lists into single objects" transformation I brought up elsewhere on this ticket into runQuantum rather than adaptArgsAndRun (so the dicts passed to adaptArgsAndRun would sometimes have single objects as dict values rather than lists).

andy-slac · 2018-08-03T18:51:01Z

@TallJimbo, @natelust, I think I have addressed all issues, including adding scalar field. Could you check quickly that it looks OK now?

TallJimbo

Two very minor comments. Happy for you to merge after addressing those; I don't need another look unless you have a question for me.

TallJimbo · 2018-08-03T18:55:18Z

python/lsst/pipe/base/pipelineTask.py

@@ -30,6 +30,22 @@
 from .task import Task


+class ScalarError(TypeError):
+    """Exception raised when satset type is configured as scalar


typo: "satset"

TallJimbo · 2018-08-03T19:04:08Z

python/lsst/pipe/base/pipelineTask.py

-        outUnits = {}
+                dataIds = [dataRef.dataId for dataRef in dataRefs]
+                data = [butler.get(dataRef.datasetType.name, dataRef.dataId)
+                        for dataRef in dataRefs]


After DM-14822 this can just be

data = [butler.get(dataRef) for dataRef in dataRefs]

andy-slac · 2018-08-03T22:06:56Z

Rebased to master after DM-15220 was merged

The run() metod interface was changed to only accept input data as keyword arguments and does not have arguments for output data. This should simplify task implementation for most common use case with a single output DataId (per dataset type). More complex cases will be handled by re-implementing new method runWithOutputs() which has arguments for both input data and output DataIds.

If scalar field is set to True in configuration then for corresponding datset type we always unpack corresponding DataId or data lists. And we do opposite when saving data produced by task.

TallJimbo approved these changes Aug 3, 2018

View reviewed changes

natelust reviewed Aug 3, 2018

View reviewed changes

andy-slac force-pushed the tickets/DM-14823 branch 2 times, most recently from e42428b to af17776 Compare August 3, 2018 18:44

TallJimbo approved these changes Aug 3, 2018

View reviewed changes

andy-slac force-pushed the tickets/DM-14823 branch 2 times, most recently from 3318df4 to 4f1c34c Compare August 3, 2018 22:06

andy-slac changed the base branch from tickets/DM-15220 to master August 3, 2018 22:06

andy-slac added 2 commits August 3, 2018 16:35

Add scalar field to dataset type config.

41ad3e3

If scalar field is set to True in configuration then for corresponding datset type we always unpack corresponding DataId or data lists. And we do opposite when saving data produced by task.

andy-slac force-pushed the tickets/DM-14823 branch from 4f1c34c to 41ad3e3 Compare August 3, 2018 23:36

andy-slac merged commit 41f434e into master Aug 4, 2018

ktlim deleted the tickets/DM-14823 branch August 25, 2018 06:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-14823: Simplify PipelineTask API for common use case #60

DM-14823: Simplify PipelineTask API for common use case #60

andy-slac commented Aug 2, 2018

TallJimbo left a comment

TallJimbo Aug 3, 2018

andy-slac Aug 3, 2018

TallJimbo Aug 3, 2018

natelust Aug 3, 2018

andy-slac Aug 3, 2018

TallJimbo Aug 3, 2018

natelust Aug 3, 2018

TallJimbo Aug 3, 2018

andy-slac Aug 3, 2018

natelust Aug 3, 2018

natelust Aug 3, 2018

TallJimbo commented Aug 3, 2018

andy-slac commented Aug 3, 2018

TallJimbo left a comment

TallJimbo Aug 3, 2018

TallJimbo Aug 3, 2018

andy-slac commented Aug 3, 2018

		@@ -309,24 +364,47 @@ def runQuantum(self, quantum, butler):
		inputs[key] = [butler.get(dataRef.datasetType.name, dataRef.dataId)
		for dataRef in dataRefs]

DM-14823: Simplify PipelineTask API for common use case #60

DM-14823: Simplify PipelineTask API for common use case #60

Conversation

andy-slac commented Aug 2, 2018

TallJimbo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TallJimbo commented Aug 3, 2018

andy-slac commented Aug 3, 2018

TallJimbo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andy-slac commented Aug 3, 2018