Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "DM-38953: Dynamic connection support and miscellaneous cleanups" #332

Merged
merged 1 commit into from
May 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
5 changes: 0 additions & 5 deletions doc/changes/DM-38953.feature.md

This file was deleted.

3 changes: 0 additions & 3 deletions doc/changes/DM-38953.misc.md

This file was deleted.

95 changes: 46 additions & 49 deletions doc/lsst.pipe.base/creating-a-pipelinetask.rst
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ connection classes make use of connection types defined in
a |PipelineTask| will expect to make use of. Each of these connections documents
what the connection is, what dimensions represent this data product (in this
case they are the same as the task itself, in the
:ref:`PipelineTask-processing-multiple-datasets` section we will cover when
:ref:`PipelineTask-processing-multiple-data-sets` section we will cover when
they are different), what kind of storage class represents this data type on
disk, and the name of the data type itself. In this connections class you have
defined two inputs and an output. The inputs the Task will make use of are
Expand Down Expand Up @@ -276,9 +276,9 @@ catalog, and insert the new measurements right into the output catalog.

One thing to note about `~lsst.pipe.base.connectionTypes.InitInput` connections
is that they do not take any dimensions. This is because the sort of data loaded
will correspond to a given dataset type produced by a task, and not by
(possibly multiple) executions of a run method over datasets that have
dimensions. In other words, these datasets are unique to the task itself and
will correspond to a given data-set type produced by a task, and not by
(possibly multiple) executions of a run method over data-sets that have
dimensions. In other words, these data-sets are unique to the task itself and
not tied to the unit of work that the task operates on.

In the same way the input schema from some previous stage of processing was
Expand Down Expand Up @@ -323,14 +323,14 @@ activator uses this shared name to know that variable should be persisted.

The complete updated example can be found in :ref:`pipeline-appendix-b`.

Optional and Dynamic Connections
================================
Sometimes it is useful to have a task that optionally uses a dataset. In the
Optional Datasets
==================
Sometimes it is useful to have a task that optionally uses a data-set. In the
case of the example task you have been building, this might be a background
model that was previously removed. You may want your task to add back in the
background so that it can do a new local background estimate. To start add
the background dataset to our connection class like you did for your other
datasets.
the background data-set to our connection class like you did for your other
data-sets.

.. code-block:: python

Expand All @@ -344,8 +344,8 @@ datasets.
...

Now your `PipelineTask` will load the background each time the task is run.
How do you make this optional?
The simplest approach is to add a configuration field in your config class to allow the user to specify if it is to be loaded:
How do you make this optional? First, add a configuration field in your
config class to allow the user to specify if it is to be loaded like thus

.. code-block:: python

Expand Down Expand Up @@ -374,30 +374,27 @@ various configuration options.
"skymap"))
...

def __init__(self, *, config=None):
if not config.doLocalBackground:
del self.background

Your connection class now looks at the value of ``doLocalBackground`` on the ``config`` object and if it is ``False``, removes that attribute.
Attributes can also be replaced entirely to change (e.g.) their dimensions, storage class, or other properties (connection objects are immutable, so they cannot be modified in place).
New connection attributes can even be added programmatically in ``__init__``, but these cannot have configurations for their dataset type names added automatically, and are recommended only when it is not possible to define a placeholder in the connections class to be modified later, such as when even the number of connections is dynamic.
The dimensions of the task itself can also be modified in ``__init__``, where ``self.dimensions`` is a mutable set that can be modified in place or replaced completely.

.. note::
In previous versions, connection attributes could not be deleted or modified directly, and the preferred pattern was to instead remove the name of the connection from a special set, after first delegating to `super`:

.. code-block:: python

def __init__(self, *, config=None):
super().__init__(config=config)
if not config.doLocalBackground:
self.inputs.remove("background")

This still works, but it is more verbose and less intuitive than the now-recommended attribute-manipulation approach.
It is now no longer to delegate to `super` in either case, since the base class `super` does nothing (the first steps of initialization are instead handled by a metaclass).

if config.doLocalBackground is False:
self.inputs.remove("background")

The last step in modifying this task will be to update the ``run`` method to
Your connection class now looks at the value of ``doLocalBackground`` on the
``config`` object and if it is ``False``, removes it from the connection
instances list of input connections. Connection classes keep track of what
connections are defined in sets. Each set contains the variable names of a
connection, and the sets themselves are identified by the type of connection
they contain. In the example you are modifying the set of input connections. The
names for each of the sets are as follows:

* ``initInputs``
* ``initOutputs``
* ``inputs``
* ``prerequisiteInputs``
* ``outputs``

The last step in modifying your task will be to update the ``run`` method to
take into account that a background may or may not be supplied.

.. code-block:: python
Expand Down Expand Up @@ -462,7 +459,7 @@ Dataset name configuration and templates
Now that you have the option to control results of processing with a
configuration option (turning on and off local background subtraction) it may
be useful for a user who turns on local background subtraction to change the
name of the dataset produced so as to tell what the configuration was
name of the data-set produced so as to tell what the configuration was
without looking at the persisted configurations. The user may make a
configuration override file that looks something like the following:

Expand All @@ -480,25 +477,25 @@ the variable names of the connections, with values of those fields corresponding
to the name of a connection. So by default ``config.connections.outputCatalog``
would be ``customAperture`` and ``config.connections.exposure`` would be
``calexp`` etc. Assigning to these `~lsst.pex.config.Field`\ s has the effect of
changing the name of the dataset type defined in the connection.
changing the name of the data-set type defined in the connection.

In this config file you are changing the name of the dataset type that will
In this config file you are changing the name of the data-set type that will
be persisted to include the information that local background subtraction was
done. It is interesting to note that there are no hard coded dataset type
done. It is interesting to note that there are no hard coded data-set type
names that must be adhered to, the user is free to pick any name. The only
consequence of changing a dataset type name, is that any downstream code
that is to use the output dataset must have its default name changed to
that is to use the output data-set must have its default name changed to
match. As an aside, ``pipetask`` requires that the first time a
dataset name is used the activator command is run with the
data-set name is used the activator command is run with the
`--register-dataset-types` switch. This is to prevent accidental typos
becoming new dataset types.
becoming new data-set types.

Looking at the config file, there are two different `~lsst.pex.config.Field`\
s that are being set to the same value. In the case of other tasks this
number may even be higher. This leads not only to the issue of config
overrides needing to potentially be lengthy, but that there may be a typo,
and the fields will be set inconsistently. |PipelineTask| tasks address this by
providing a way to template dataset type names.
providing a way to template data-set type names.

.. code-block:: python

Expand Down Expand Up @@ -587,9 +584,9 @@ that should only be needed very rarely. As such, they are beyond the scope of
this tutorial, and are only brought up for completeness sake.


.. _PipelineTask-processing-multiple-datasets:
.. _PipelineTask-processing-multiple-data-sets:

Processing multiple datasets of the same dataset type
Processing multiple data-sets of the same dataset type
------------------------------------------------------
The dimensions you have used in your task up to this point have specified that
the unit of processing is to be done on individual detectors on a per-visit
Expand Down Expand Up @@ -650,27 +647,27 @@ follows.
...

The dimensions of your `ApertureTaskConnections` class are now ``visit`` and
``band``. However, all of your input datasets are themselves still defined
``band``. However, all of your input data-sets are themselves still defined
over each of these dimensions, and also ``detector``. That is to say you get
one ``calexp`` for ever unique combination of ``exposure``\ 's dimensions.
Because the tasks's dimensions are a more inclusive set of dimensions (less
specified) you should expect that for a given unit of processing, there will be
multiple values for each of the input dataset types along the ``detector``
multiple values for each of the input data-set types along the ``detector``
dimension. For example, in LSST there will be 189 detectors in each visit. You
indicate to the execution framework that you expect there to be a list of
datasets for each connection by adding ``multiple=True`` to its declaration.
This ensures the values passed will be inside of a list container.

As a caveat, depending on the exact data that has been ingested/processed,
there may only be one dataset that matches this combination of dimensions
there may only be one data-set that matches this combination of dimensions
(i.e. only one raw was ingested for a visit) but the ``multiple`` flag will
still ensure that the system passes this one dataset along inside contained
still ensure that the system passes this one data-set along inside contained
inside a list. This ensures a uniform api to program against. Make note that
the connection variable names change to add an `s` on connections marked with
`multi` to reflect that they will potentially contain multiple values.

With this in mind, go ahead and make the changes needed accommodate multiple
datasets in each inputs to the run method. Because this task is inherently
data-sets in each inputs to the run method. Because this task is inherently
parallel over detectors, these modifications are not the most natural way to
code this behavior, but are done to demonstrate how to make use of the
``multi`` flag for situations that are not so trivial.
Expand Down Expand Up @@ -929,7 +926,7 @@ Overriding Task execution
Overriding the `PipelineTask` method ``runQuantum`` is another advanced tool.
This method is used in task execution and is supplied identifiers for input
dataset references to be used in processing, and output dataset references
for datasets the middleware frame work expects to be written at the end of
for data-sets the middleware frame work expects to be written at the end of
processing. The ``runQuantum`` method is responsible for fetching inputs,
calling the ``run`` method, and writing outputs using the supplied data ids.
After an introduction to the default implementation of ``runQuantum``, this
Expand Down Expand Up @@ -965,7 +962,7 @@ about the unit of data your task will be processing, i.e. the quantum, as well
as extra functionality attached to it.

`~lsst.daf.butler.DatasetRef`\ s that can be used to interact with specific
datasets managed by the butler. See the description of the
data-sets managed by the butler. See the description of the
`InputQuantizedConnection` type in the section
:ref:`PipelineTask-processing-altering-what-is-processed` for more
information on `QuantizedConnection`\ s, noting that
Expand All @@ -981,7 +978,7 @@ capabilities of the ``get`` method, put for putting outputs into the butler.

For examples sake, add a `runQuantum` method to your photometry task that
loads in all the input references one connection at a time. Your task only
expects to write a single dataset out, so the `runQuantum` will also put
expects to write a single data-set out, so the `runQuantum` will also put
with that single `lsst.daf.butler.DatasetRef`.

.. code-block:: python
Expand Down
2 changes: 1 addition & 1 deletion python/lsst/pipe/base/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from . import automatic_connection_constants, connectionTypes, pipelineIR
from . import connectionTypes, pipelineIR
from ._dataset_handle import *
from ._instrument import *
from ._observation_dimension_packer import *
Expand Down
77 changes: 0 additions & 77 deletions python/lsst/pipe/base/automatic_connection_constants.py

This file was deleted.

5 changes: 1 addition & 4 deletions python/lsst/pipe/base/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ def __new__(
configConnectionsNamespace["ConnectionsClass"] = connectionsClass

# Create a new config class with the fields defined above
Connections = type(f"{name}Connections", (pexConfig.Config,), configConnectionsNamespace)
Connections = type("Connections", (pexConfig.Config,), configConnectionsNamespace)
# add it to the Config class that is currently being declared
dct["connections"] = pexConfig.ConfigField(
dtype=Connections,
Expand All @@ -173,9 +173,6 @@ def __init__(
# documentation on metaclasses
super().__init__(name, bases, dct)

ConnectionsClass: type[PipelineTaskConnections]
ConnectionsConfigClass: type[pexConfig.Config]


class PipelineTaskConfig(pexConfig.Config, metaclass=PipelineTaskConfigMeta):
"""Configuration class for `PipelineTask`
Expand Down