Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-16867: Move pieces of pipe_supertask to pipe_base #73

Merged
merged 49 commits into from Jan 4, 2019

Conversation

andy-slac
Copy link
Contributor

@andy-slac andy-slac commented Dec 31, 2018

Modules that have been moved here:

  • configOverrides
  • graph
  • graphBuilder
  • pipeline
  • pipeTools
  • pipelineBuilder
  • new ABC TaskFactory was added
  • all existing unit tests for the above

More or less complete history was copied too, branch tickets/DM-16867-supertask-history contains that history and its HEAD corresponds to the master of pipe_supertask. All migration changes to the code were done after merging that branch into tickets/DM-16867.

Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left lots of little minor comments, some of which I'm sure you'll want to defer to future tickets.

My only big question for this change is whether this is the right home for the expr_parser subpackage. My understanding was that we were not using this code at all yet, and that makes me a bit wary about putting it on master now, especially since it seems unlikely to be broken by changes to other parts of the codebase, and hence not obviously useful to get into CI now. I was also hoping that these expressions would ultimately be useful in high-level Registry APIs, not Pipeline activators. What do you think about moving this subpackage to a non-master (for now) branch of daf_butler instead?

@@ -19,21 +19,16 @@
# the GNU General Public License along with this program. If not,
# see <http://www.lsstcorp.org/LegalNotices/>.
#
"""Module defining PipelineBuilder class and related methods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be curious to get @jonathansick's opinion on this, but my inclination would be to remove this docstring, as it provides no real information and this module is itself something of an implementation detail (in that all of its public symbols are lifted into the package namespace).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that docstring like that add very little (or very little negative) value to the module. OTOH modules are supposed to have a summary docstring in them. I could either drop them entirely or try to expand them (but that would probably duplicate class docstring content).

@@ -55,123 +50,83 @@


class PipelineBuilder(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not inherit from object in Python 3; that's implied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is leftover of 2-to-3 migration which was never completed for pipe_supertask. It is removed in one of the latter commits.

def makePipeline(self, args):
"""Build pipeline from command line arguments.
Pipeline will be checked for possible inconsistencies before
returning, exception is raised if any problems are detected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to put this in a Raises section and document the exception type raised.


def __init__(self, taskFactory):
self.taskFactory = taskFactory
def pipeline(self, order_pipeline=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keyword arguments in this corner of the codebase should be camelCase. Given that "pipeline" is already the method name, though, I'd just called this argument ordered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the same naming rule apply to package names - should my expr_parser package be renamed to exprParser?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe so.

pipeline = orderedPipeline

return pipeline
orderedPipeline = pipeTools.orderPipeline(self._pipeline, self._taskFactory)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing import for pipeTools?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be there, unit test would fail without that import

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect it was fixed on a later commit and I didn't notice.


def _moveTask(self, pipeline, label, new_idx):
def moveTask(self, label, new_idx):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the index of a PipelineTask in a Pipeline or PipelineBuilder really user-visible? I'd think it'd be better for these objects to make the ordering deterministic (i.e. topologically sorted given DatasetType input/output relationships, maybe lexicographical to break ties). I'm not opposed to letting them having an unsorted state with an explicit validation or sort operation, but I don't think it's a good idea for them to expose an index-based interface to the Tasks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index is not visible (though we can make it visible when doing pipetask --show=pipeline), but tasks are ordered in a pipeline as Pipeline is just a list. For execution we'll sort things topologically of course, but my idea was that people may prefer some tasks to be prioritized compared to others (within topological constraints) and one way to do it is re-ordering pipeline. If re-ordering is not useful I can certainly drop it, and there is always an option for an activator to implement its own prioritization later.

# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.

"""Module definig TaskFactory interface.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo. Also not sure this docstring adds any value.

----------
taskName : `str`
Name of the PipelineTask class, interpretation depends entirely on
activator, e.g. it may or may not include dots.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be appropriate to say that the interpretation now is now entirely up to the TaskFactory and the contexts in which it is used? Or are there other places where assumptions are made about the form of these names?

Copy link
Contributor Author

@andy-slac andy-slac Jan 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point there are no other assumption, task names are just passed from Pipeline to task factory. TaskFactory implemented in ctrl_mpexec expects either something that can be resolved w.r.t. (PYTHONPATH) --package option (e.g. module.TaskClass) or just a task class name (super-inefficient to search). Other factories may use different conventions, e.g. registry can be built which uses abstract task names instead of class names, though that would probably make pipeline definitions non-portable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I may be confusing myself here, I think that logic may be a bit different - TaskFactory is used to resolve any user-provided name to a full class name and that is passed then to a Pipeline. Though even that is an implementation detail now and is not documented anywhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that we need to enforce this, but I can't actually think of a use case for having different TaskFactory implementations that might expect different kinds of names, and I think it would be desirable for all activators to handle task names consistently.

I gather the reason you didn't want to move the TaskFactory implementation here was because you didn't quite consider it mature (and DM-17044 backs that up). Perhaps after it does mature we should just move the implementation here and remove the ABC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think I actually prefer TaskFactory to be an ABC here, at least it tells clients what minimal interface is needed for the rest of the thing to work. Implementation in ctrl_mpexec does a little bit more that is needed from this interface - it also supports functionality that is needed by pipetask script like making a list of available tasks.

@@ -85,7 +85,7 @@ def __init__(self, taskName, refs):
# ------------------------


class GraphBuilder:
class GraphBuilder(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to inherit from object in Python 3.

pass

def tearDown(self):
pass
Copy link
Member

@TallJimbo TallJimbo Jan 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to just leave these methods out if they do nothing, I think.

@TallJimbo
Copy link
Member

Also: I looked at each of the commits since the transfer and commented on them individually (and didn't look carefully at the mass-rename ones). So there may be comments on things you'd already fixed in later commits, and if you want me to look at any of the code you transferred as whole rather than the changes, please let me know.

@andy-slac
Copy link
Contributor Author

Regarding expr_parser stuff - it is used though I'm not very happy with its current "interface" (or rather lack of it). pipetask accepts user expression string which it parses using expr_parser and then makes a string again out of that parsed expression and passes it to a pre-flight. I think better approach would be to have some abstract interface for user expression defined in daf_butler, trouble is that I have no idea how that interface can look like. One possible option is to define an interface as a parsed expression tree (AST) as it is defined now in expr_paser/exprTree.py but that implies that we "freeze" the grammar for the expression. I need to think a little bit about that, maybe having expr_paser/exprTree.py in daf_butler is a good first step and the rest can be kept in pipe_base as a reusable parser for one sort of syntax.

@TallJimbo
Copy link
Member

I think it'd make slightly more sense to just put it all in daf_butler for now - while I agree that we don't want to freeze the grammar now, I think any kind of engineering to make extending the parser easier (like trying to divide interface and implementation across packages) is premature. I am also skeptical that there is any parsing functionality we'd want in pipe_base but not in daf_butler.

@andy-slac
Copy link
Contributor Author

OK, I'm going to move expr_parser to daf_butler then and undo some commits here, hope I will not make too much mess with that.

First sketch, things are very likely to change as we start implementing
and understand more details.
For now Pipeline is just a list, it inherits from list and does not add
anything extra. We'll probably add few convenience method as we start
implementing client of this class.
Deleted obsolete stuff. renamed couple of modules for LSST convention.
Ran pylint on everyhting, fixed few easy-to-fix things.
Moved few utility methods into util.py. Move SuperTaskFactory into
taskFactory module and rename calss into TaskFactory. Add TaskInfo class
to be used as item type in Pipeline.
Instead of pre-flight activator we now define a common "graph builder"
class which takes Pipeline as input and produces "execution graph" as
output. The exact structure of that graph is not clear yet, for now it
is just a list of tuples which is sufficient for command line
nvironment.

Pipeline task was revised slightly to provide more efficient
implementation of task definition when reasonable.
Updated pipeline tests for new TaskDef class. Replaced SuperTask unit
test with nnew version which has tests for defineQuanta/runQuantum
methods.
Add support for specifying more than one task on command line and also
option for reading pipeline definition from pickle file (and also write
it to pickle file). Add a class representing execution graph, or at
least one representation of graph that is used by command-line tool.
Implemented a bunch of --show options.
Added new options for building/modifying pipelines. Also added quantum
graph options, but they are not used yet by cmdLineFwk (will work on
qgraph next). Code that builds pipelines based on command line arguments
was factored out into a PipelineBuilder class.
Still WIP, many updates to parser, SuperTask, SuperTaskConfig, unit test
and examples.
Part of a pre-flight algorithm is in SqlRegistry class in daf_butler.
Core in this package takes data generated by registry, conbines that
into Qunata and builds complete QuantumGraph.

I have added couple of example tasks which are less trivial than
examples that we had before and they also work with actual unit types
that exist in registry today. Can be used to build QGraph based on
ci_hsc registry for example.
Class is actually renamed to PipelineTask. Updated all dependencies here
to use PipelineTask from pipe_base.
The interface of PipelineTask class has changed, updated all
dependencies here.
andy-slac and others added 19 commits January 3, 2019 23:14
Loops are only detected when we order pipeline and previously ordering
was only done when --order-pipeline option was set. Now we do checks
unconditionally. Added one more test for trivial loop.
To run things we have to have all input DataRefs in the same collection
as output, added association of the all quantumGraph inputs with output
collection. Current assumption is that output collection is empty, will
have to extend implementation to support non-empty collections.

Updated example tasks to return data that can be saved to butler, with
that `stac` is now able to run complete chain on `ci_hsc` repo and write
something into that repo.
Data units were renamed in daf_butler schema, Camera becomes Instrument
and Sensor becomes Detector. These names are used in few places in this
package too, follow that renaming in all usage of the unit names.
Consistently use `logging` in Python code, CmdLineFwk configures logging
output to be forwarded to lsst.log.
These methods now return DatasetTypeDescriptor instance instead of plain
DatasetType, need to change all uses of the returned maps.
Change in DatasetType avoids the need to register StorageClass with
factory when StorageClass is not going to be used. Removed all
registration of example StorageClass in unit tests and task examples.
Make the PipelineTask activator able to parse config parameter
overrides which are lists when the overrides are passed from the
command line.
In addition, all DatasetTypes are now always checked against those
in the Registry.
PipelineBuilder is now looking more or less like normal class with
methods rather than something executing external actions.

Few other small changes to unit tests, exclude parser stuff from flake8
testing.
@andy-slac andy-slac merged commit 76b0903 into master Jan 4, 2019
@andy-slac andy-slac deleted the tickets/DM-16867 branch July 10, 2022 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants