Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-14823: Simplify PipelineTask API for common use case #60

Merged
merged 2 commits into from Aug 4, 2018

Conversation

andy-slac
Copy link
Contributor

The run() metod interface was changed to only accept input data as
keyword arguments and does not have arguments for output data. This
should simplify task implementation for most common use case with a
single output DataId (per dataset type). More complex cases will be
handled by re-implementing new method runWithOutputs() which has
arguments for both input data and output DataIds.

Piece of runQuantum() code was refactored into a separate method
_saveStruct() which can also be re-implemented in a subclass.

Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments, and one request to add another convenience change to the nature of run (and runWithOutputs) args.

inputData : `dict`
Dictionary whose keys are the names of the configuration fields
describing input dataset types and values are lists of
Python-domain data objects retrieved from data butler.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be really nice if we could make the dictionary value a single item instead of a list in cases where we know the list couldn't have more than one element. We could try to guess when that's the case by comparing the DataUnits of the Quantum with the DataUnits of this DatasetType, but it's probably clearer and less error-prone to just have a way for PipelineTask to declare that a DatasetType is a scalar (e.g. in InputDatasetConfig.

This is more important for run than runWithData, of course, but I think it makes sense to put that transformation in runQuantum so runWithData would see it as well.

And of course the same is true for outputs as well as inputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I guess the proposal is to add an extra attribute scalar to In/OutputDatasetConfig and make it False by default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, sounds good. @natelust, will that require any extra work to be included in the syntactic sugar for those config objects?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would just be one small change to add that to the wrapper function signature, everything else would propagate though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did that and also added something to docstring for the wrapper method..

For input dataset types argument values will be lists of the data
object retrieved from data butler. For output dataset types argument
values will be the lists of units from DataRefs in a Quantum.
This method is called by :py:meth:`runQuantum` to operate on input
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should just be able to say runQuantum without the :py:meth: business; I'm not sure where the latter is required (@jonathansick?) but I'm pretty sure links in docstrings will work fine without it as long as there are no namespace-qualification issues (and there shouldn't be for a method of the same class).

(Several other instances of this throughout the changeset that should also be fixed.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is correct, only the backticks should be needed

# store produced ouput data
self._saveStruct(struct, outputDataRefs, butler)

def _saveStruct(self, struct, outputDataRefs, butler):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a good reason to treat _saveStruct and runWithOutputs differently w.r.t. leading underscores - in C++, I'd have called them both (non-pure) virtual protected, and while I don't have a strong opinion about whether that merits an underscore in Python, I think they should be treated consistently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my (twisted) mental model _saveStruct() is indeed protected (or even private) and not to be called by clients directly, while runWithOutputs() is public because it is the same as run(), only better :) Anyways, I'll make thing uniform-no-underscores.

@@ -309,24 +364,47 @@ def runQuantum(self, quantum, butler):
inputs[key] = [butler.get(dataRef.datasetType.name, dataRef.dataId)
for dataRef in dataRefs]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no good way to put a comment in a non diff line, so I am putting this here. The changes made require that the documentation in the docstring for run quantum need to be updated. The described behavior is not what happens any more.

Also in the doc string "must be lists of data bjects corresponding"
bjects -> objects

@@ -244,22 +244,77 @@ def getDatasetTypes(cls, config, configClass):
dsTypes[key] = cls.makeDatasetType(value)
return dsTypes

def run(self, *args, **kwargs):
def runWithOutputs(self, inputData, outputDataIds):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have very strong reservations about a method called runWithOutputs that in the default implementation does nothing with the outputDataIds variable, or anything output related

@TallJimbo
Copy link
Member

@natelust and I discussed this a bit in person, and we came up with the following recommendation:

  • Let's rename runWithOutputs to adaptArgsAndRun (open to counter proposals, but we think that better captures what it's doing).
  • Let's add an inputDataIds argument to adaptArgsAndRun containing a dictionary of input data IDs, with the same structure as outputDataIds. In all of the real cases I could think of where the concrete PipelineTask wants output IDs, they'd want the input IDs, too. We'd guarantee that the lists of IDs in inputDataIds and the lists of unpersisted objects in inputData would be zip-iterable in case an override of adaptArgsAndRun needed. We think it'd be better to make those separate-but-related data structures so the default implementation of adaptArgsAndRun simple (as it is now).

I still think it'd be good to do that "make guaranteed single-element lists into single objects" transformation I brought up elsewhere on this ticket into runQuantum rather than adaptArgsAndRun (so the dicts passed to adaptArgsAndRun would sometimes have single objects as dict values rather than lists).

@andy-slac andy-slac force-pushed the tickets/DM-14823 branch 2 times, most recently from e42428b to af17776 Compare August 3, 2018 18:44
@andy-slac
Copy link
Contributor Author

@TallJimbo, @natelust, I think I have addressed all issues, including adding scalar field. Could you check quickly that it looks OK now?

Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two very minor comments. Happy for you to merge after addressing those; I don't need another look unless you have a question for me.

@@ -30,6 +30,22 @@
from .task import Task


class ScalarError(TypeError):
"""Exception raised when satset type is configured as scalar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "satset"

outUnits = {}
dataIds = [dataRef.dataId for dataRef in dataRefs]
data = [butler.get(dataRef.datasetType.name, dataRef.dataId)
for dataRef in dataRefs]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After DM-14822 this can just be

data = [butler.get(dataRef) for dataRef in dataRefs]

@andy-slac andy-slac force-pushed the tickets/DM-14823 branch 2 times, most recently from 3318df4 to 4f1c34c Compare August 3, 2018 22:06
@andy-slac andy-slac changed the base branch from tickets/DM-15220 to master August 3, 2018 22:06
@andy-slac
Copy link
Contributor Author

Rebased to master after DM-15220 was merged

The run() metod interface was changed to only accept input data as
keyword arguments and does not have arguments for output data. This
should simplify task implementation for most common use case with a
single output DataId (per dataset type). More complex cases will be
handled by re-implementing new method runWithOutputs() which has
arguments for both input data and output DataIds.
If scalar field is set to True in configuration then for corresponding
datset type we always unpack corresponding DataId or data lists. And we
do opposite when saving data produced by task.
@andy-slac andy-slac merged commit 41f434e into master Aug 4, 2018
@ktlim ktlim deleted the tickets/DM-14823 branch August 25, 2018 06:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants