New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-14823: Simplify PipelineTask API for common use case #60
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments, and one request to add another convenience change to the nature of run
(and runWithOutputs
) args.
inputData : `dict` | ||
Dictionary whose keys are the names of the configuration fields | ||
describing input dataset types and values are lists of | ||
Python-domain data objects retrieved from data butler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be really nice if we could make the dictionary value a single item instead of a list in cases where we know the list couldn't have more than one element. We could try to guess when that's the case by comparing the DataUnits of the Quantum with the DataUnits of this DatasetType, but it's probably clearer and less error-prone to just have a way for PipelineTask
to declare that a DatasetType is a scalar (e.g. in InputDatasetConfig
.
This is more important for run
than runWithData
, of course, but I think it makes sense to put that transformation in runQuantum
so runWithData
would see it as well.
And of course the same is true for outputs as well as inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I guess the proposal is to add an extra attribute scalar
to In/OutputDatasetConfig and make it False
by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, sounds good. @natelust, will that require any extra work to be included in the syntactic sugar for those config objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would just be one small change to add that to the wrapper function signature, everything else would propagate though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I did that and also added something to docstring for the wrapper method..
For input dataset types argument values will be lists of the data | ||
object retrieved from data butler. For output dataset types argument | ||
values will be the lists of units from DataRefs in a Quantum. | ||
This method is called by :py:meth:`runQuantum` to operate on input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should just be able to say runQuantum
without the :py:meth:
business; I'm not sure where the latter is required (@jonathansick?) but I'm pretty sure links in docstrings will work fine without it as long as there are no namespace-qualification issues (and there shouldn't be for a method of the same class).
(Several other instances of this throughout the changeset that should also be fixed.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is correct, only the backticks should be needed
# store produced ouput data | ||
self._saveStruct(struct, outputDataRefs, butler) | ||
|
||
def _saveStruct(self, struct, outputDataRefs, butler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a good reason to treat _saveStruct
and runWithOutputs
differently w.r.t. leading underscores - in C++, I'd have called them both (non-pure) virtual protected, and while I don't have a strong opinion about whether that merits an underscore in Python, I think they should be treated consistently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my (twisted) mental model _saveStruct()
is indeed protected (or even private) and not to be called by clients directly, while runWithOutputs()
is public because it is the same as run()
, only better :) Anyways, I'll make thing uniform-no-underscores.
@@ -309,24 +364,47 @@ def runQuantum(self, quantum, butler): | |||
inputs[key] = [butler.get(dataRef.datasetType.name, dataRef.dataId) | |||
for dataRef in dataRefs] | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no good way to put a comment in a non diff line, so I am putting this here. The changes made require that the documentation in the docstring for run quantum need to be updated. The described behavior is not what happens any more.
Also in the doc string "must be lists of data bjects corresponding"
bjects -> objects
@@ -244,22 +244,77 @@ def getDatasetTypes(cls, config, configClass): | |||
dsTypes[key] = cls.makeDatasetType(value) | |||
return dsTypes | |||
|
|||
def run(self, *args, **kwargs): | |||
def runWithOutputs(self, inputData, outputDataIds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have very strong reservations about a method called runWithOutputs that in the default implementation does nothing with the outputDataIds variable, or anything output related
@natelust and I discussed this a bit in person, and we came up with the following recommendation:
I still think it'd be good to do that "make guaranteed single-element lists into single objects" transformation I brought up elsewhere on this ticket into |
e42428b
to
af17776
Compare
@TallJimbo, @natelust, I think I have addressed all issues, including adding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two very minor comments. Happy for you to merge after addressing those; I don't need another look unless you have a question for me.
@@ -30,6 +30,22 @@ | |||
from .task import Task | |||
|
|||
|
|||
class ScalarError(TypeError): | |||
"""Exception raised when satset type is configured as scalar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "satset"
outUnits = {} | ||
dataIds = [dataRef.dataId for dataRef in dataRefs] | ||
data = [butler.get(dataRef.datasetType.name, dataRef.dataId) | ||
for dataRef in dataRefs] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After DM-14822 this can just be
data = [butler.get(dataRef) for dataRef in dataRefs]
3318df4
to
4f1c34c
Compare
Rebased to master after DM-15220 was merged |
The run() metod interface was changed to only accept input data as keyword arguments and does not have arguments for output data. This should simplify task implementation for most common use case with a single output DataId (per dataset type). More complex cases will be handled by re-implementing new method runWithOutputs() which has arguments for both input data and output DataIds.
If scalar field is set to True in configuration then for corresponding datset type we always unpack corresponding DataId or data lists. And we do opposite when saving data produced by task.
4f1c34c
to
41ad3e3
Compare
The run() metod interface was changed to only accept input data as
keyword arguments and does not have arguments for output data. This
should simplify task implementation for most common use case with a
single output DataId (per dataset type). More complex cases will be
handled by re-implementing new method runWithOutputs() which has
arguments for both input data and output DataIds.
Piece of
runQuantum()
code was refactored into a separate method_saveStruct()
which can also be re-implemented in a subclass.