Skip to content

Commit

Permalink
Implementing a grouping feature to organize flow operations (#114)
Browse files Browse the repository at this point in the history
* Enables the creation and registering of groups.

* The FlowGroup class was created to represent the group concept and store
the necessary variables and methods.
* The classmethod make_group was created which returns a decorator that
adds a _flow_group label to functions for later registration. In
addtion, make_group also adds an entry to the new _GROUPS attribute in
the _FlowProject class.
* A function _register_groups was added that registers groups by using
the _GROUPS and _OPERATIONS attributes of the current and parent
classes.
* The __init__ method of FlowGroup was changed to automatically register
groups and add operations at initialization.

* Adding function to create group CL arg.

Adds the group commandline argument to submit, run, and script.

* First functioning group submit.

* _main_submit modified to have group path
* submit_group function added to submit groups
* changes do not directly update status
* no checks on if directives work have been made

* Adding support for groups in run command.

* changed the way that operation names are gathered for run condition to
aggregate those chosen from -o and -g options.

* Adding script support for group option.

* Raise ValueError if -o/--operation and -g/--group are set together
* Add logic to include JobGroups as script operations if -o/--operations
is set.

* Adding support for command options.

* Adding an options parameter for groups that appends options to the
output of FlowGroup's __call__ method. This allows for you to set
options like --num-passes for a group. These options only currently
apply during submit and script operations.

* Adding new methods to FlowGroup.

* FlowGroups now have a complete and eligible method
* FlowGroups now have two cmd's run_cmd and exec_cmd
* FlowGroups.__call__ now takes a flag on cmd mode
* make_group FlowProject method takes correct args and stores them in
  operation function dict correctly
* FlowGroups store operations as a dict now
* _register_groups FlowProject method correctly registers operations
* Temporary changes made to JobGroup to test FlowGroup function

* Adding bundling, exec_mode, op group conversion.

* JobGroups now correctly create unique id.
* JobOperations are removed as they are redundant.
* All _main_script and _main_submit always use group pathway now.
* Exec_mode runs and raises error for groups with more than one job.
* Bundling works with groups of all kinds.

* Changed JobOperation implimentation.

* Changed functions that gather operations.
* Changed the structure of JobOperations
* Add method to FlowGroup to create JobOperations.

* Integration of operation and group execution path.

* Submit, run , and script all handle groups and operations through -o
* FlowGroups are always created with knowledge of path now this is
required because there is not an easy way to find where an object is
instantiated. The current method uses the stack when using the
classmethod make_group, and the class attribute _OPERATION_FUNCTIONS
when creating FlowGroups directly from a FlowOperation.
* Appropriate errors are raised when undefined behavior might occur.

* Adding group support for run command.

* Adding support for groups in submit function.

* Adding requested revisions

* Changing FlowGroups compatible method
* Moving submit and JobOperation back to original location
* Added a groups property to FlowProject class
* Moved all call logic for FlowGroup to __call__ method
* Moved exec_mode command-line argument conversion to string to _main
functions.
* Changed logic in _verify_group_compatibility using python's set type
* renamed submit_operations group argument to operations

* Changing _verify_group_compatibility logic.

* Also doing linting changes that flake8 requested.
* The changes to the `_setup_template_environment` were required to
commit with the flake8 commit hook.

* Adding after condition for groups.

* Working on fixing bugs unit tests reveal.

* Also adding exec mode to relevant tests.

* Changing run behavior always to exec mode.

* Since operations are always expanded from groups before-hand, exec
mode is the correct mode to be in when calling a run command.

* Changing groups behavior to pass tests.

* Changing the `get_id` function in `FlowGroup` class to match previous
id in the case of single operation groups.
* Changing `_dumps_op` and `_loads_op` functions to include
`JobOperation` id since it is now an attribute.
* Created a function `_gather_FlowGroups` that takes a list of
group/operation names and returns a list of instances of `FlowGroups`
with removed duplication and error checking.
* Add argument `mode` to submit function. Previously a kwarg check was
preformed. This changes behavior to use the string which is the behavior
elsewhere.
	* _main_submit was changed to pass the string to submit as well.

* Added small changes in tests to match new implementations.

* All references to `JobOperation.get_id()` are changed to use the
attribute `id`.
* Calls to `script` and `submit` that check output have 'exec' mode
explicitly set.

* Fixing error in retrieving function path.

* Including inheritance hierarchy in gathering operation functions.

* Adding '--exec' option to `_main_script` test.

* Changing yield from to yield with a for loop for py2.

* Raising error if operation is added to group twice.

* Changing way group operations are joined for getting id.

* Adding regex support for selecting groups and operations

* had to refactor operation selection to include fullmatch.

* Making groups resubmit safe

* Also adding necessary tests for group behavior in submit, run, and
script
* Added `operation_ids` as attribute of JobOperation to hold individual
operation ids for resubmitting checks.
* Added `eligible_for_submission` method for JobOperaiton to check if
all operation are not currently submitted.

* Test main_submit and change dynamic tests

* For dynamic tests a workflow with multiple steps is tested. For the
groups test, it is more important to show that resubmission is not
possible. The multistep workflow is shown to work in the previous tests.

* Remove redundant expansion of operation names in _main_run

* test_main_submit for groups with with multiple calls.

This is necessary since one assertIn command cannot be used to ensure
all operations are in the command as the ordering in python version 3.5
is not commensurate with python > 3.5.

* Change `FlowGroup.get_id` to `FlowGroup._generate_id`

* Singleton `FlowGroup`s and move directives to `FlowGroup`s

* Currently the singleton groups must have the same name as the
operation name. As this is not changable by the user that should be
okay.
* I had to set the directives 'executable' and 'override_path' for
singleton groups with the `@cmd` decorator. This results in some ugly
logic that should work, but if there is an easier way should change.

* Remove group specific status.

* Removed group status from `set_status` command.
* Redefined `JobOperation.get_status` to always return a list of
  operation statuses.
* Logic has yet to be moved to `FlowProject`. I did not see a clear way
  to query whether a particular job-operation pair was currently submitted
  other than directly searching the project.
* The test change was just to correct for the fact that now `get_status`
  returns a list.

* Move schedule status checks to group level.

* Also show operation status through querying group status and expanding
  to composing operations.

* Fix cmd-operations.

* Initial check for prior group submission

* Prevents unnecessary checking if the same group is previously submitted

* Create two paths for gathering JobOperations

* One path for submission
* One path for running

* Remove obsolete doc-string entry.

There is actually no need to further specify what the id *should* look
like. It must be unique, that is the only requirement.

* Fix JobOperation.__repr__().

* Flip the FlowOperation constructor calls.

* Reorganize the FlowGroup exec mode exception flow.

1. Raise error at the point of assumption.
2. Use dedicated exception class to propagate error.
3. Raise SubmitError only during submission.
4. Raise ScriptError during script generation.

* Refactor run-cmd generation and fix one bug.

  * Simplified logic.
  * Use the group name as argument to -o in all cases.

* Misc. improvements and optimizations.

* Change finding of FlowGroup path to use stack.

* may need to change again but fixed a bug on my computer.

* Keep track of module path during FlowProject object instantiation..

* Change test strings to match new output and new pathfinding method

* New pathfinding method solves some problems but doesn't resolve all of
  them. Namely, for our tests the 'path' is always the test file.

* Removing test_group files from branch.

* Refactor run logic in FlowGroups.

* Remove six.PY2 testing
* Fix rebase errors
* Remove old function to find operation path

* Adding path and entrypoint attributes to FlowProject.

* Test work again with minor refactoring to account for this new API.

* Addressing new path/entrypoint motif with test_templates

* `FlowGroup` and `JobOperation` __repr__ methods fix/add

* Update flow/errors.py

Co-Authored-By: Carl Simon Adorf <csadorf@umich.edu>

* Update docstring JobOperation.

* Setup FlowProject._entrypoint as a dict

* Using yield from to iterate over operations FlowGroups

* Move cmd evaluation to when it is asked for.

* Change add_operation of FlowProject to make group as well

* Adding entrypoint argument to derived classes for tests

* Reverting make_bundles

* Fixing docstring for next_operation

* Fixing store_bundled bug

* str check in JobOperation cmd property

* Callable directive support for entrypoints.

* Using _collect_operations for _register_groups

* Order statuses after grabbing all operation statuses

* Change operations to OrderedDict in _gather_flow_groups

* Add documentation

* Add error catching back to test

* Make eligible_for_submission not user facing

* Add fork whitelisting from directives

* Rebasing missed this

* Fix errors introduced by merge.

Involved FlowGroups respecting ignore conditions

* Fix typos and style errors

* Cleanup ``FlowGroup`` some

* Change ``FlowGroup.intersects`` to ``FlowGroup.isdisjoint``

Matches set API better

* Fix typo in function name

* Move checking of conflicting group names earlier

Now at the time of group functor name conflict is checked.

* Remove ability to specify exec and run command generation

* Remove exec mode

Templates still broken.

* Remove cmd_prefix and get_prefix from templates

* All submitted jobs to ignore_conditions

* Use `FlowGroup.create_run_job_operations` for run

* Add operation specific directives

* Add operations to `FlowProject._GROUP_NAMES`

prevents users from having a group named the same as an operation since
singleton groups with operation name are automatically defined, and
shadowing should not be allowed.

* Change group registering logic

No longer recompute the operation list from `_collect_operations`
Also remove ExecCommandError

* Submission resource aggregation and group default directives

* Change Stampede2Environment for group parallel jobs

will currently break with bundling.

* Fix bug in correctly handling directives with Groups

* Fix group decorator to work with directives

* Make function for resolving FlowGroup directives

* Update flow/project.py

Co-Authored-By: Bradley Dice <bdice@bradleydice.com>

* Fix submission resource aggregation bug

* Create FlowGroupEntry class

Has two decoraters ``__call__`` and ``with_directives``.

* Remove directives from FlowGroups

* Add return type documentation.

* Make default_directives required for create_run_job_operations

* Update creation of submission JobOperations

Remove directives from _submit_cmd

Reduce noise in submission directives when 'ngpu', 'nranks',
or 'omp_num_threads' is set to 0

* Add tests for directives handling

* fix FlowGroupEntry bug and add documentation

* Fix bug in FlowProject._resolve_directives

* Fix bug on submission command generation

* Apply suggestions from code review

Co-Authored-By: Carl Simon Adorf <csadorf@umich.edu>
Co-Authored-By: vyasr <vramasub@umich.edu>

* Remove ``ScriptError`` (wasn't used).

* Add get_id back to ``JobOperation`` and id property

* Update *Operation docstrings and ``FlowGroup`` docstrings

Also edit other comments

* Change _flow_group to _flow_groups

* Remove redundant calls to .keys() for dict

* the buildin __contains__ or __iter__ is faster.

* Modify `~FlowGroupEntry~.with_directives` function

* Change default parameter values ``FlowGroup.__init__``

* Change entrpoint priority

* Move to string logic into IgnoreConditions

* Style changes

* Remove redundant deprecated call

* Change group test project class names

* Fix bug in submission cmd string generation

* Apply suggestions from code review

Co-Authored-By: Carl Simon Adorf <csadorf@umich.edu>

* Return function after FlowGroupEntry decorator

Before we did not return the function back at the end

* Add FlowGroup to docs

* Remove get calls in get_status functions

* Edit docstrings and change argument name

Use ignore_conditions_on_execution instead of
ignore_conditions_on_submit

* Fix bug with get_status

* Used define_test_project for groups too

* Update tests for using groups

I had to add operations to the project classes so some tests needed to
be updated

* Fix entrypoint for group tests

* Don't evaluate entrypoint in run_cmd until necessary

* Change tests to use mock_project more fully

* Fix next_operations tests for new project class

* Remove op5 from test project definition

* Fix error from inconistent naming

* Fix errors with testing

* Fix error on CI.

Entrypoints require we first cd into the test directory

* Revert "Fix error on CI."

This reverts commit aee0fe0.

* Use __file__ to determine entrypoint for tests

* Remove group specific test project

* Apply suggestions from code review

Co-Authored-By: Bradley Dice <bdice@bradleydice.com>
Co-Authored-By: Carl Simon Adorf <csadorf@umich.edu>

* Update FlowGroup and FlowProject API docs

* Rename methods for creating JobOperation from FlowGroup

* Revert previous changes __init__.py and errors.py

* make FlowGroupEntry.with_directives take a single dict

* Change cmd error checking to init : JobOperation

* Update docstrings, help strings, and comments

Also one style change

* Only add --ignore-conditions when necessary

Add logic to FlowGroup._sumbit_cmd

* Fix error where groups were not run in parallel

* Fix test to work with python 3.5

Status's with groups are sorted by operation name (this is due to
operation status having to be grabbed through groups). We sort the
output. This just makes the test do the same

* Change test_main_status to not be order sensitive for operations

* Update flow/project.py

Co-Authored-By: Bradley Dice <bdice@bradleydice.com>

* Fix FlowGroup.options and __repr__

Add test for __repr__

* Add comment explaining use of lazy cmd in JopOperation

* Make TestGroupProject.test_flowgroup_repr not run in Python 3.5

* Make test_flowgroup_repr not order dependent

* Update flow/project.py

Co-Authored-By: vyasr <vramasub@umich.edu>

* Update FlowGroup API docs

* Update template tests

* final changes

* change changelog

Co-authored-by: Carl Simon Adorf <carl.simon.adorf@gmail.com>
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: vyasr <vyas.ramasubramani@gmail.com>
  • Loading branch information
4 people committed Feb 21, 2020
1 parent 1de7ea7 commit d3a46f7
Show file tree
Hide file tree
Showing 11 changed files with 1,243 additions and 212 deletions.
1 change: 1 addition & 0 deletions changelog.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ next
Added
+++++

- Add ``FlowGroup``s (one or more operation grouping within an execution environment)(#114)
- Add official support for University of Michigan Great Lakes cluster (#185).
- Add official support for Bridges AI cluster (#222).
- Add IgnoreConditions option for submit(), run() and script() (#38).
Expand Down
7 changes: 7 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ The FlowProject
FlowProject.label
FlowProject.labels
FlowProject.main
FlowProject.make_group
FlowProject.next_operation
FlowProject.next_operations
FlowProject.operation
Expand Down Expand Up @@ -142,3 +143,9 @@ flow.get_environment()
----------------------

.. autofunction:: get_environment

The FlowGroup
-------------

.. autoclass:: flow.project.FlowGroup
:members: add_operation, complete, eligible, isdisjoint
14 changes: 9 additions & 5 deletions flow/environments/xsede.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ class Stampede2Environment(DefaultSlurmEnvironment):
template = 'stampede2.sh'
cores_per_node = 48
mpi_cmd = 'ibrun'
offset_counter = 0

@classmethod
def add_args(cls, parser):
Expand Down Expand Up @@ -84,13 +85,16 @@ def _get_mpi_prefix(cls, operation, parallel):
"""
if operation.directives.get('nranks'):
if parallel:
return '{} -n {} -o {} task_affinity '.format(
cls.mpi_cmd, operation.directives['nranks'],
operation.directives['np_offset'])
prefix = '{} -n {} -o {} task_affinity '.format(
cls.mpi_cmd, operation.directives['nranks'],
cls.offset_counter)
cls.offset_counter += operation.directives['nranks']
else:
return '{} -n {} '.format(cls.mpi_cmd, operation.directives['nranks'])
prefix = '{} -n {} '.format(cls.mpi_cmd,
operation.directives['nranks'])
else:
return ''
prefix = ''
return prefix


class BridgesEnvironment(DefaultSlurmEnvironment):
Expand Down
Loading

0 comments on commit d3a46f7

Please sign in to comment.