From 81bfeb3d0a6a41889a7b81e1b8256d875ebde4c4 Mon Sep 17 00:00:00 2001 From: Hardik Ojha <44747868+kidrahahjo@users.noreply.github.com> Date: Sun, 27 Jun 2021 00:39:04 +0530 Subject: [PATCH] Add documentation for aggregation (#122) * Add aggregation docs without FlowGroups * Add documentation for aggregation with FlowGroups * Address reviews and update docs with current API * Improve wordings * Address code review * Improvements to doc * Improve wording * Update docs/source/aggregation.rst Co-authored-by: Carl Simon Adorf * Update docs/source/aggregation.rst Co-authored-by: Carl Simon Adorf * Update docs/source/aggregation.rst Co-authored-by: Carl Simon Adorf * Update docs/source/aggregation.rst * Update docs/source/aggregation.rst * Update docs/source/aggregation.rst * Update docs/source/aggregation.rst * Update docs/source/aggregation.rst * Update docs/source/aggregation.rst * Update docs/source/aggregation.rst * Update docs/source/aggregation.rst * Update docs/source/aggregation.rst * Update docs/source/aggregation.rst * Update docs/source/aggregation.rst * Explain that operations are like aggregate operations acting on aggregates of one job. Co-authored-by: Carl Simon Adorf * Add aggregation to table of contents. * Rename section to match FlowGroup. * Unitalicize. * Fix links to pre/post. * Fix intersphinx references. * Use :py: role prefix for consistency with other docs. Co-authored-by: Bradley Dice Co-authored-by: Carl Simon Adorf --- docs/source/aggregation.rst | 191 +++++++++++++++++++++++++++++++++++ docs/source/flow-project.rst | 2 +- docs/source/index.rst | 1 + 3 files changed, 193 insertions(+), 1 deletion(-) create mode 100644 docs/source/aggregation.rst diff --git a/docs/source/aggregation.rst b/docs/source/aggregation.rst new file mode 100644 index 00000000..00c66f20 --- /dev/null +++ b/docs/source/aggregation.rst @@ -0,0 +1,191 @@ +.. _aggregation: + +=========== +Aggregation +=========== + +This chapter provides information about passing aggregates of jobs to operation functions. + + +.. _aggregator_definition: + +Definition +========== + +An :py:class:`~flow.aggregator` is used as a decorator for operation functions which accept a variable number of positional arguments, ``*jobs``. +The argument ``*jobs`` is unpacked into an *aggregate*, defined as an ordered tuple of jobs. +See also the Python documentation about :ref:`argument unpacking `. + +.. code-block:: python + + # project.py + from flow import FlowProject, aggregator + + class Project(FlowProject): + pass + + @aggregator() + @Project.operation + def op1(*jobs): + print("Number of jobs in aggregate:", len(jobs)) + + @Project.operation + def op2(job): + pass + + if __name__ == "__main__": + Project().main() + +If :py:class:`~flow.aggregator` is used with the default arguments, it will create a single aggregate containing all the jobs present in the project. +In the example above, ``op1`` is an *aggregate operation* where all the jobs present in the project are passed as a variable number of positional arguments (via ``*jobs``), while ``op2`` is an operation where only a single job is passed as an argument. + +.. tip:: + + The concept of aggregation may be easier to understand if one realizes that "normal" operation functions are equivalent to aggregate operation functions with an aggregate group size of one job. + + +.. note:: + + For an aggregate operation, all conditions like :py:meth:`~flow.FlowProject.pre` or :py:meth:`~flow.FlowProject.post`, callable directives, and other features are required to take the same number of jobs as the operation as arguments. + +.. _types_of_aggregation: + +Types of Aggregation +==================== + +Currently, **signac-flow** allows users to aggregate jobs in the following ways: + +- *All jobs*: All of the project's jobs are passed to the operation function. +- *Group by state point key(s)*: The aggregates are grouped by one or more state point keys. +- *Group by arbitrary key function*: The aggregates are grouped by keys determined by a key-function that expects an instance of :py:class:`~.signac.contrib.job.Job` and return the grouping key. +- Grouping into aggregates of a specific size. +- Using a completely custom aggregator function when even greater flexibility is needed. + +Group By +-------- + +:py:meth:`~flow.aggregator.groupby` allows users to aggregate jobs by grouping them by a state point key, an iterable of state point keys whose values define the groupings, or an arbitrary callable of :py:class:`~signac.contrib.job.Job`. + +.. code-block:: python + + @aggregator.groupby("temperature") + @Project.operation + def op3(*jobs): + pass + +In the above example, the jobs will be aggregated based on the state point key ``"temperature"``. +So, all the jobs having the same value of **temperature** in their state point will be aggregated together. + +Groups Of +--------- + +:py:meth:`~flow.aggregator.groupsof` allows users to aggregate jobs by generating aggregates of a given size. + +.. code-block:: python + + @aggregator.groupsof(2) + @Project.operation + def op4(job1, job2=None): + pass + +In the above example, the jobs will get aggregated in groups of 2 and hence, up to two jobs will be passed as arguments at once. + +.. note:: + + In case the number of jobs in the project in this example is odd, there will be one aggregate containing only a single job. + In general, the last aggregate from :py:meth:`~flow.aggregator.groupsof` will contain the remaining jobs if the aggregate size does not evenly divide the number of jobs in the project. + If a remainder is expected and valid, users should make sure that the operation function can be called with the reduced number of arguments (e.g. by using ``*jobs`` or providing default arguments as shown above). + +Sorting jobs for aggregation +---------------------------- + +Aggregators allow users to sort the jobs before creating aggregates with the ``sort_by`` parameter. +The sorting order can be defined with the ``sort_ascending`` parameter. +By default, when no ``sort_by`` parameter is specified, the order of the jobs will be decided by the iteration order of the **signac** project. + +.. code-block:: python + + @aggregator.groupsof(2, sort_by="temperature", sort_ascending=False) + @Project.operation + def op5(*jobs): + pass + +.. note:: + + In the above example, all the jobs will be sorted by the state point parameter ``"temperature"`` in descending order and then be aggregated as groups of 2. + +Selecting jobs for aggregation +------------------------------ + +**signac-flow** allows users to selectively choose which jobs to pass into operation functions. +This can be used to generate aggregates from only the selected jobs, excluding any jobs that do not meet the selection criteria. + +.. code-block:: python + + @aggregator(select=lambda job: job.sp.temperature > 0) + @Project.operation + def op6(*jobs): + pass + + +.. _aggregate_id: + +Aggregate ID +============ + +Similar to the concept of a job id, an aggregate id is a unique hash identifying an aggregate of jobs. +The aggregate id is sensitive to the order of the jobs in the aggregate. + + +.. note:: + + The id of an aggregate containing one job is that job's id. + +In order to distinguish between an aggregate id and a job id, the id of aggregates with more than one job will always have a prefix ``agg-``. + +Users can generate the aggregate id of an aggregate using :py:func:`flow.get_aggregate_id`. + +.. tip:: + + Users can also pass an aggregate id to the ``--job-id`` command line flag provided by **signac-flow** in ``run``, ``submit``, and ``exec``. + + +.. _aggregation_with_flow_groups: + +Aggregation with FlowGroups +=========================== + +In order to associate an aggregator object with a :py:class:`~flow.project.FlowGroup`, **signac-flow** provides a ``group_aggregator`` parameter in :py:meth:`~flow.FlowProject.make_group`. +By default, no aggregation takes place for a :py:class:`FlowGroup`. + +.. note:: + + All the operations in a :py:class:`~flow.project.FlowGroup` will use the same :py:class:`~flow.aggregator` object provided to the group's ``group_aggregator`` parameter. + +.. code-block:: python + + # project.py + from flow import FlowProject, aggregator + + class Project(FlowProject): + pass + + group = Project.make_group("agg-group", group_aggregator=aggregator()) + + @group + @aggregator() + @Project.operation + def op1(*jobs): + pass + + @group + @Project.operation + def op2(*jobs): + pass + + if __name__ == "__main__": + Project().main() + +In the above example, when the group ``agg-group`` is executed using ``python project.py run -o agg-group``, all the jobs in the project are passed as positional arguments for both ``op1`` and ``op2``. +If ``op1`` is executed using ``python project.py run -o op1``, all the jobs in the project are passed as positional arguments because a :py:class:`~flow.aggregator` is associated with the operation function ``op1`` (separately from the aggregator used for ``agg-group``). +If ``op2`` is executed using ``python project.py run -o op2``, only a single job is passed as an argument because no :py:class:`~flow.aggregator` is associated with the operation function ``op2``. diff --git a/docs/source/flow-project.rst b/docs/source/flow-project.rst index 6f6edb53..600f796b 100644 --- a/docs/source/flow-project.rst +++ b/docs/source/flow-project.rst @@ -49,7 +49,7 @@ Defining a workflow =================== We will reproduce the simple workflow introduced in the previous section by first copying both the ``greeted()`` condition function and the ``hello()`` *operation* function into the ``project.py`` module. -We then use the :py:func:`~flow.FlowProject.operation` and the :py:func:`~.flow.FlowProject.post` decorator functions to specify that the ``hello()`` operation function is part of our workflow and that it should only be executed if the ``greeted()`` condition is not met. +We then use the :py:meth:`~flow.FlowProject.operation` and the :py:meth:`~flow.FlowProject.post` decorator functions to specify that the ``hello()`` operation function is part of our workflow and that it should only be executed if the ``greeted()`` condition is not met. .. code-block:: python diff --git a/docs/source/index.rst b/docs/source/index.rst index 41a3dbe4..9438a1e9 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -48,6 +48,7 @@ If you are new to **signac**, the best place to start is to read the :ref:`intro environments templates flow-group + aggregation indexing collections configuration