DM-20205: Update to Final PipelineTasks API #98

natelust · 2019-08-15T18:40:56Z

No description provided.

andy-slac

With all travel I managed to look at just one file today, will continue reviewing later but want to checkpoint what has been done so far.

andy-slac · 2019-08-16T18:04:16Z

python/lsst/pipe/base/butlerQuantumContext.py

+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+"""Module defining a butler like object specified to a specific quantum.


specified -> specialized?

andy-slac · 2019-08-16T18:07:27Z

python/lsst/pipe/base/butlerQuantumContext.py

+    in practice is that the only gets and puts that this class allows
+    are DatasetRefs that are contained in the quantum.
+
+    In the future this class will also be used to record providence on


I wish we had providence as a part of butler, but I guess you mean provenance? 👼

andy-slac · 2019-08-16T18:08:47Z

python/lsst/pipe/base/butlerQuantumContext.py

+    butler : `lsst.daf.butler.Butler`
+        Butler object from/to which datasets will be get/put
+    quantum : `lsst.daf.butler.core.Quantum`
+        Quantum object that describes the data-sets which will


data-sets -> datasets?

andy-slac · 2019-08-17T01:34:08Z

python/lsst/pipe/base/butlerQuantumContext.py

+
+        Parameters
+        ----------
+        dataset : `InputQuantizedConnection` or `list` of `lsst.daf.butler.DatasetRef`


I think our latest convention is to write it as

`list` [`lsst.daf.butler.DatasetRef`]

andy-slac · 2019-08-17T01:36:17Z

python/lsst/pipe/base/butlerQuantumContext.py

+        Raises
+        ------
+        ValueError
+            If a `DatasetRef` is passed to get that is not defined in the quantum object


Needs module name for DatasetRef. I'd add a ~ for each of those too to reduce verbosity in generated docs.

andy-slac · 2019-08-17T01:40:46Z

python/lsst/pipe/base/butlerQuantumContext.py

+            all the inputs of a quantum, a list of `lsst.daf.butler.DatasetRef`, or
+            a single `lsst.daf.butler.DatasetRef`. The function will get and return
+            the corresponding datasets from the butler.
+


Returns section would be useful too.

andy-slac · 2019-08-17T01:42:32Z

python/lsst/pipe/base/butlerQuantumContext.py

+        elif isinstance(dataset, DatasetRef) or isinstance(dataset, DeferredDatasetRef):
+            return self._get(dataset)
+        else:
+            raise TypeError("Dataset argument is not a type that can be used to get")


I'm not a big fan of the isinstance-based switch, it feels a bit fragile for my taste. But I'm not sure anything can be done to avoid it?

In this case I dont think so if we are going to try and preserve this sort of overload. That is until we get to python 3.8. If (when?) we can take that as the base line, there is a nice built in single dispatch for methods. That will make the code cleaner and more extensible, but fundamentally we want this method overload based on the type of the input.

andy-slac · 2019-08-17T01:45:57Z

python/lsst/pipe/base/butlerQuantumContext.py

+        if isinstance(dataset, OutputQuantizedConnection):
+            if not isinstance(values, Struct):
+                raise ValueError("dataset is a OutputQuantizedConnection, a Struct with corresponding"
+                                 "attributes must be passed as the values to put")


Needs a space before "attributes".

andy-slac · 2019-08-17T01:50:35Z

python/lsst/pipe/base/butlerQuantumContext.py

+                    for i, ref in enumerate(refs):
+                        self._put(valuesAttribute[i], ref)
+                else:
+                    self._put(valuesAttribute, refs)


I think this could be slightly simplified if _put() supported lists.

I debated back and forth on this, if it was better to have a simple function that did one thing, or do duplicate the handling the lists and single values. I also debated having an intermediate function to handle lists. Ultimately I decided on a simpler _put, but I don't really have strong feelings.

andy-slac · 2019-08-17T01:50:54Z

python/lsst/pipe/base/butlerQuantumContext.py

+        if inout is RefDirection.OUT:
+            for r in ref:
+                if (r.datasetType, r.dataId) not in self.allOutputs:
+                    raise ValueError("DatasetRef is not part of the Quantum being processed")


There is repetition in this method, I think you can simplify it if you just pass self.allInputs or self.allOutputs as a parameter instead of enum (and enum will not even have to exist in that case).

Really good point, I changed this so many times I could not see the forest for the trees

andy-slac

I have looked at few more files, not finished reviewing yet. Will try to finish it tomorrow if nothing urgent happens.

andy-slac · 2019-08-17T18:51:27Z

python/lsst/pipe/base/config.py

-           "InitOutputDatasetConfig", "InitOutputDatasetField",
-           "ResourceConfig", "QuantumConfig", "PipelineTaskConfig"]
+__all__ = ["ResourceConfig", "PipelineTaskConfig",
+           "PipelineTaskConnections"]


PipelineTaskConnections lives in connections, do you want to export it from this module too?

no, was just something left over from before I moved it to its own file, good catch

andy-slac · 2019-08-17T18:53:53Z

python/lsst/pipe/base/config.py

-        Possibly raises a TypeError if attempting to create a factory function
-        from an incompatible type
+    This metaclass ensures a `PipelineTaskConnections` class is specified in the class construction
+    parameters with a parameter name of pipelineConnections. Using The supplied connection class,


andy-slac · 2019-08-17T18:58:04Z

python/lsst/pipe/base/config.py

+    this metaclass constructs a `lsst.pex.config.Config` instance which can be used to configure the
+    connections class. This config is added to the config class under declaration with the name
+    "connections" used as an identifier. The connections config also has a reference to the connections
+    class used in its construction with associated with an atttribute named "connectionsClass". Finally


Extraneous "with". I'd put connectionsClass in backticks here.

andy-slac · 2019-08-17T18:58:49Z

python/lsst/pipe/base/config.py

+    connections class. This config is added to the config class under declaration with the name
+    "connections" used as an identifier. The connections config also has a reference to the connections
+    class used in its construction with associated with an atttribute named "connectionsClass". Finally
+    The newly constructed config class (not an instance of it) is assigned to the Config class under


andy-slac · 2019-08-17T19:00:25Z

python/lsst/pipe/base/config.py

+    "connections" used as an identifier. The connections config also has a reference to the connections
+    class used in its construction with associated with an atttribute named "connectionsClass". Finally
+    The newly constructed config class (not an instance of it) is assigned to the Config class under
+    construction with the attribute name "ConnectionsConfigClass".


ConnectionsConfigClass in backticks here too.
WE have a limit on docstring/comment line length at 79 characters, I think this docstring exceeds it.

andy-slac · 2019-08-19T02:34:17Z

python/lsst/pipe/base/connections.py

+
+        Parameters
+        ----------
+        quantum : `lsst.daf.butler.core.Quantum`


I think lsst.daf.butler.Quantum should be used here.

andy-slac · 2019-08-19T03:53:46Z

python/lsst/pipe/base/pipeline.py

        """Extract and classify the dataset types from a single `PipelineTask`.

        Parameters
        ----------
-        taskClass: `type`
-            A concrete `PipelineTask` subclass.
        config: `PipelineTaskConfig`


config was removed too, add description for connectionsInstance parameter

andy-slac · 2019-08-19T04:02:34Z

python/lsst/pipe/base/pipelineTask.py

+        connection fields describing input dataset types. Argument values will
+        be data objects retrieved from data butler. If a dataset type is
+        configured with ``multiple`` field set to ``True`` then the argument
+        value will be a list of objects, otherwise it will be a single objects.


single objects - > single object

andy-slac · 2019-08-19T04:12:31Z

tests/test_connections.py

+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+"""Simple unit test for ResourceConfig.


ResourceConfig -> connections?

andy-slac · 2019-08-19T04:13:11Z

tests/test_connections.py

+import lsst.pipe.base as pipeBase
+
+
+class TestConncetionsClass(unittest.TestCase):


Conncetions -> Connections

andy-slac

It looks OK (if my understanding of how it work is correct), though a bit too much meta-magic for my taste. I think it probably needs good user manual or maybe more code examples in docstrings.

andy-slac · 2019-08-19T18:51:45Z

python/lsst/pipe/base/connections.py

+            try:
+                dct['dimensions'] = set(kwargs['dimensions'])
+            except TypeError:
+                raise dimensionsValueError


raise ... from ... may be better here for less confusing message. from None will suppress TypeError traceback if you prefer.

andy-slac · 2019-08-19T19:04:23Z

python/lsst/pipe/base/connections.py

+    lsst.pipe.base.connectionTypes and are listed as follows:
+
+    * InitInput - Defines connections in a quantum graph which are used as inputs to the __init__ function
+                  of the PipelineTask corresponding to this class


The lines should not be indented or they will become indented in sphinx output too. I think you want this:

* InitInput - Defines connections in a quantum graph which are used as inputs to the __init__ function of the PipelineTask corresponding to this class

(but also properly wrapped at 79 characters).

You can also try try definition list syntax: http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#definition-lists

andy-slac · 2019-08-19T19:07:29Z

python/lsst/pipe/base/connections.py

+    match dimensions that exist in the butler registry which will be used in executing the corresponding
+    `PipelineTask`.
+
+    The second parameter is labeled defaultTemplates and is conditionally optional. The name attributes of


I'd put defaultTemplates in double backticks.

andy-slac · 2019-08-19T19:07:52Z

python/lsst/pipe/base/connections.py

+    The second parameter is labeled defaultTemplates and is conditionally optional. The name attributes of
+    connections can be specified as python format strings, with named format arguments. If any of the
+    name parameters on connections defined in a `PipelineTaskConnections` class contain a template, then
+    a default template value must be specified in the defaultTemplate argument. This is done by passing


defaultTemplate -> defaultTemplates?

andy-slac · 2019-08-19T19:13:10Z

python/lsst/pipe/base/connections.py

+    """PipelineTaskConnections is a class used to declare desired IO when a PipelineTask is run by an
+    activator
+
+    PipelineTaskConnection classes are created by declaring class attributes of types defined in


This long description should probably be moved into Notes section after Parameters.

andy-slac · 2019-08-19T19:16:06Z

python/lsst/pipe/base/connections.py

+    Once a `PipelineTaskConnections` class is created, it is used in the creation of a
+    `PipelineTaskConfig`. This is further documented in the documentation of `PipelineTaskConfig`. For the
+    purposes of this documentation, the relevant information is that the config class allows configuration
+    of connection names by users when running a pipeline.


One small example of defining connections class and task config class would be super-helpful here.

natelust · 2019-08-21T16:16:06Z

It looks OK (if my understanding of how it work is correct), though a bit too much meta-magic for my taste. I think it probably needs good user manual or maybe more code examples in docstrings.

I am working on as a separate body of work (as to not hold this up) documentation that will go in the package documentation talking about how this works, and including a long tutorial on how this all gets used to make PipelineTasks.

This commit overhauls the framework for writing and using PipelineTasks. It introduces a new class for defining all the connections a PipelineTask expects during execution. It is connected to a PipelineTaskConfig at declaration at which point relevant configuration are added to the Config class. The connections class allows for better introspection from the execution framework, and simplifies the implementation of a PipelineTask.

TallJimbo · 2019-08-27T18:15:09Z

python/lsst/pipe/base/config.py

+                                                                               doc=docString,
+                                                                               default=default)
+            # add a reference to the connection class used to create this sub config
+            configConnectionsNamespace['connectionsClass'] = connectionsClass


While the right capitalization for variables that refer to types is never very clear, "connectionsClass" here is sufficiently similar to "Task.ConfigClass" in usage and context that I think it's a pretty serious problem to capitalize it differently.

andy-slac reviewed Aug 17, 2019

View reviewed changes

andy-slac reviewed Aug 19, 2019

View reviewed changes

andy-slac approved these changes Aug 19, 2019

View reviewed changes

timj changed the title ~~Tickets/DM-20205 Update to Final PipelineTasks API~~ DM-20205: Update to Final PipelineTasks API Aug 19, 2019

natelust force-pushed the tickets/DM-20205 branch 3 times, most recently from 2056f2b to 532716c Compare August 27, 2019 13:21

natelust added 3 commits August 27, 2019 08:28

Update unit tests to new connections API

fc28609

Handle component inputs when building pipelines

a0fa78a

natelust force-pushed the tickets/DM-20205 branch from 532716c to a0fa78a Compare August 27, 2019 13:29

TallJimbo reviewed Aug 27, 2019

View reviewed changes

Change connections class name

a69ba53

TallJimbo merged commit 6c9d669 into master Aug 30, 2019

TallJimbo deleted the tickets/DM-20205 branch August 30, 2019 12:15

		import lsst.pipe.base as pipeBase


		class TestConncetionsClass(unittest.TestCase):

DM-20205: Update to Final PipelineTasks API #98

DM-20205: Update to Final PipelineTasks API #98

Conversation

natelust commented Aug 15, 2019

andy-slac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andy-slac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andy-slac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

natelust commented Aug 21, 2019

TallJimbo Aug 27, 2019 • edited

Choose a reason for hiding this comment

TallJimbo Aug 27, 2019 •

edited