New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-16867: Move pieces of pipe_supertask to pipe_base #73
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left lots of little minor comments, some of which I'm sure you'll want to defer to future tickets.
My only big question for this change is whether this is the right home for the expr_parser
subpackage. My understanding was that we were not using this code at all yet, and that makes me a bit wary about putting it on master now, especially since it seems unlikely to be broken by changes to other parts of the codebase, and hence not obviously useful to get into CI now. I was also hoping that these expressions would ultimately be useful in high-level Registry
APIs, not Pipeline
activators. What do you think about moving this subpackage to a non-master (for now) branch of daf_butler instead?
@@ -19,21 +19,16 @@ | |||
# the GNU General Public License along with this program. If not, | |||
# see <http://www.lsstcorp.org/LegalNotices/>. | |||
# | |||
"""Module defining PipelineBuilder class and related methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be curious to get @jonathansick's opinion on this, but my inclination would be to remove this docstring, as it provides no real information and this module is itself something of an implementation detail (in that all of its public symbols are lifted into the package namespace).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree that docstring like that add very little (or very little negative) value to the module. OTOH modules are supposed to have a summary docstring in them. I could either drop them entirely or try to expand them (but that would probably duplicate class docstring content).
@@ -55,123 +50,83 @@ | |||
|
|||
|
|||
class PipelineBuilder(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not inherit from object
in Python 3; that's implied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is leftover of 2-to-3 migration which was never completed for pipe_supertask
. It is removed in one of the latter commits.
def makePipeline(self, args): | ||
"""Build pipeline from command line arguments. | ||
Pipeline will be checked for possible inconsistencies before | ||
returning, exception is raised if any problems are detected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to put this in a Raises
section and document the exception type raised.
|
||
def __init__(self, taskFactory): | ||
self.taskFactory = taskFactory | ||
def pipeline(self, order_pipeline=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keyword arguments in this corner of the codebase should be camelCase. Given that "pipeline" is already the method name, though, I'd just called this argument ordered
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the same naming rule apply to package names - should my expr_parser
package be renamed to exprParser
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe so.
pipeline = orderedPipeline | ||
|
||
return pipeline | ||
orderedPipeline = pipeTools.orderPipeline(self._pipeline, self._taskFactory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing import for pipeTools
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be there, unit test would fail without that import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect it was fixed on a later commit and I didn't notice.
|
||
def _moveTask(self, pipeline, label, new_idx): | ||
def moveTask(self, label, new_idx): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the index of a PipelineTask
in a Pipeline
or PipelineBuilder
really user-visible? I'd think it'd be better for these objects to make the ordering deterministic (i.e. topologically sorted given DatasetType input/output relationships, maybe lexicographical to break ties). I'm not opposed to letting them having an unsorted state with an explicit validation or sort operation, but I don't think it's a good idea for them to expose an index-based interface to the Tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The index is not visible (though we can make it visible when doing pipetask --show=pipeline
), but tasks are ordered in a pipeline as Pipeline is just a list
. For execution we'll sort things topologically of course, but my idea was that people may prefer some tasks to be prioritized compared to others (within topological constraints) and one way to do it is re-ordering pipeline. If re-ordering is not useful I can certainly drop it, and there is always an option for an activator to implement its own prioritization later.
python/lsst/pipe/base/taskFactory.py
Outdated
# You should have received a copy of the GNU General Public License | ||
# along with this program. If not, see <http://www.gnu.org/licenses/>. | ||
|
||
"""Module definig TaskFactory interface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo. Also not sure this docstring adds any value.
---------- | ||
taskName : `str` | ||
Name of the PipelineTask class, interpretation depends entirely on | ||
activator, e.g. it may or may not include dots. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be appropriate to say that the interpretation now is now entirely up to the TaskFactory
and the contexts in which it is used? Or are there other places where assumptions are made about the form of these names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point there are no other assumption, task names are just passed from Pipeline to task factory. TaskFactory implemented in ctrl_mpexec
expects either something that can be resolved w.r.t. () PYTHONPATH
--package
option (e.g. module.TaskClass
) or just a task class name (super-inefficient to search). Other factories may use different conventions, e.g. registry can be built which uses abstract task names instead of class names, though that would probably make pipeline definitions non-portable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I may be confusing myself here, I think that logic may be a bit different - TaskFactory is used to resolve any user-provided name to a full class name and that is passed then to a Pipeline. Though even that is an implementation detail now and is not documented anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that we need to enforce this, but I can't actually think of a use case for having different TaskFactory
implementations that might expect different kinds of names, and I think it would be desirable for all activators to handle task names consistently.
I gather the reason you didn't want to move the TaskFactory implementation here was because you didn't quite consider it mature (and DM-17044 backs that up). Perhaps after it does mature we should just move the implementation here and remove the ABC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I think I actually prefer TaskFactory to be an ABC here, at least it tells clients what minimal interface is needed for the rest of the thing to work. Implementation in ctrl_mpexec does a little bit more that is needed from this interface - it also supports functionality that is needed by pipetask
script like making a list of available tasks.
@@ -85,7 +85,7 @@ def __init__(self, taskName, refs): | |||
# ------------------------ | |||
|
|||
|
|||
class GraphBuilder: | |||
class GraphBuilder(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to inherit from object
in Python 3.
tests/test_exprParserLex.py
Outdated
pass | ||
|
||
def tearDown(self): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to just leave these methods out if they do nothing, I think.
Also: I looked at each of the commits since the transfer and commented on them individually (and didn't look carefully at the mass-rename ones). So there may be comments on things you'd already fixed in later commits, and if you want me to look at any of the code you transferred as whole rather than the changes, please let me know. |
b58cae0
to
4d05848
Compare
Regarding |
I think it'd make slightly more sense to just put it all in daf_butler for now - while I agree that we don't want to freeze the grammar now, I think any kind of engineering to make extending the parser easier (like trying to divide interface and implementation across packages) is premature. I am also skeptical that there is any parsing functionality we'd want in pipe_base but not in daf_butler. |
OK, I'm going to move |
First sketch, things are very likely to change as we start implementing and understand more details.
For now Pipeline is just a list, it inherits from list and does not add anything extra. We'll probably add few convenience method as we start implementing client of this class.
Deleted obsolete stuff. renamed couple of modules for LSST convention. Ran pylint on everyhting, fixed few easy-to-fix things.
Moved few utility methods into util.py. Move SuperTaskFactory into taskFactory module and rename calss into TaskFactory. Add TaskInfo class to be used as item type in Pipeline.
Instead of pre-flight activator we now define a common "graph builder" class which takes Pipeline as input and produces "execution graph" as output. The exact structure of that graph is not clear yet, for now it is just a list of tuples which is sufficient for command line nvironment. Pipeline task was revised slightly to provide more efficient implementation of task definition when reasonable.
Updated pipeline tests for new TaskDef class. Replaced SuperTask unit test with nnew version which has tests for defineQuanta/runQuantum methods.
Add support for specifying more than one task on command line and also option for reading pipeline definition from pickle file (and also write it to pickle file). Add a class representing execution graph, or at least one representation of graph that is used by command-line tool. Implemented a bunch of --show options.
Added new options for building/modifying pipelines. Also added quantum graph options, but they are not used yet by cmdLineFwk (will work on qgraph next). Code that builds pipelines based on command line arguments was factored out into a PipelineBuilder class.
Still WIP, many updates to parser, SuperTask, SuperTaskConfig, unit test and examples.
Part of a pre-flight algorithm is in SqlRegistry class in daf_butler. Core in this package takes data generated by registry, conbines that into Qunata and builds complete QuantumGraph. I have added couple of example tasks which are less trivial than examples that we had before and they also work with actual unit types that exist in registry today. Can be used to build QGraph based on ci_hsc registry for example.
Class is actually renamed to PipelineTask. Updated all dependencies here to use PipelineTask from pipe_base.
The interface of PipelineTask class has changed, updated all dependencies here.
Loops are only detected when we order pipeline and previously ordering was only done when --order-pipeline option was set. Now we do checks unconditionally. Added one more test for trivial loop.
To run things we have to have all input DataRefs in the same collection as output, added association of the all quantumGraph inputs with output collection. Current assumption is that output collection is empty, will have to extend implementation to support non-empty collections. Updated example tasks to return data that can be saved to butler, with that `stac` is now able to run complete chain on `ci_hsc` repo and write something into that repo.
Data units were renamed in daf_butler schema, Camera becomes Instrument and Sensor becomes Detector. These names are used in few places in this package too, follow that renaming in all usage of the unit names.
Consistently use `logging` in Python code, CmdLineFwk configures logging output to be forwarded to lsst.log.
These methods now return DatasetTypeDescriptor instance instead of plain DatasetType, need to change all uses of the returned maps.
Change in DatasetType avoids the need to register StorageClass with factory when StorageClass is not going to be used. Removed all registration of example StorageClass in unit tests and task examples.
Make the PipelineTask activator able to parse config parameter overrides which are lists when the overrides are passed from the command line.
In addition, all DatasetTypes are now always checked against those in the Registry.
4d05848
to
d02f863
Compare
PipelineBuilder is now looking more or less like normal class with methods rather than something executing external actions. Few other small changes to unit tests, exclude parser stuff from flake8 testing.
6e051c1
to
3393fcb
Compare
Modules that have been moved here:
configOverrides
graph
graphBuilder
pipeline
pipeTools
pipelineBuilder
TaskFactory
was addedMore or less complete history was copied too, branch
tickets/DM-16867-supertask-history
contains that history and its HEAD corresponds to the master ofpipe_supertask
. All migration changes to the code were done after merging that branch intotickets/DM-16867
.