From 81bfeb3d0a6a41889a7b81e1b8256d875ebde4c4 Mon Sep 17 00:00:00 2001
From: Hardik Ojha <44747868+kidrahahjo@users.noreply.github.com>
Date: Sun, 27 Jun 2021 00:39:04 +0530
Subject: [PATCH] Add documentation for aggregation (#122)

* Add aggregation docs without FlowGroups

* Add documentation for aggregation with FlowGroups

* Address reviews and update docs with current API

* Improve wordings

* Address code review

* Improvements to doc

* Improve wording

* Update docs/source/aggregation.rst

Co-authored-by: Carl Simon Adorf <carl.simon.adorf@gmail.com>

* Update docs/source/aggregation.rst

Co-authored-by: Carl Simon Adorf <carl.simon.adorf@gmail.com>

* Update docs/source/aggregation.rst

Co-authored-by: Carl Simon Adorf <carl.simon.adorf@gmail.com>

* Update docs/source/aggregation.rst

* Update docs/source/aggregation.rst

* Update docs/source/aggregation.rst

* Update docs/source/aggregation.rst

* Update docs/source/aggregation.rst

* Update docs/source/aggregation.rst

* Update docs/source/aggregation.rst

* Update docs/source/aggregation.rst

* Update docs/source/aggregation.rst

* Update docs/source/aggregation.rst

* Update docs/source/aggregation.rst

* Explain that operations are like aggregate operations acting on aggregates of one job.

Co-authored-by: Carl Simon Adorf <carl.simon.adorf@gmail.com>

* Add aggregation to table of contents.

* Rename section to match FlowGroup.

* Unitalicize.

* Fix links to pre/post.

* Fix intersphinx references.

* Use :py: role prefix for consistency with other docs.

Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Carl Simon Adorf <carl.simon.adorf@gmail.com>
---
 docs/source/aggregation.rst  | 191 +++++++++++++++++++++++++++++++++++
 docs/source/flow-project.rst |   2 +-
 docs/source/index.rst        |   1 +
 3 files changed, 193 insertions(+), 1 deletion(-)
 create mode 100644 docs/source/aggregation.rst

diff --git a/docs/source/aggregation.rst b/docs/source/aggregation.rst
new file mode 100644
index 00000000..00c66f20
--- /dev/null
+++ b/docs/source/aggregation.rst
@@ -0,0 +1,191 @@
+.. _aggregation:
+
+===========
+Aggregation
+===========
+
+This chapter provides information about passing aggregates of jobs to operation functions.
+
+
+.. _aggregator_definition:
+
+Definition
+==========
+
+An :py:class:`~flow.aggregator` is used as a decorator for operation functions which accept a variable number of positional arguments, ``*jobs``.
+The argument ``*jobs`` is unpacked into an *aggregate*, defined as an ordered tuple of jobs.
+See also the Python documentation about :ref:`argument unpacking <python:tut-unpacking-arguments>`.
+
+.. code-block:: python
+
+    # project.py
+    from flow import FlowProject, aggregator
+
+    class Project(FlowProject):
+        pass
+
+    @aggregator()
+    @Project.operation
+    def op1(*jobs):
+        print("Number of jobs in aggregate:", len(jobs))
+
+    @Project.operation
+    def op2(job):
+        pass
+
+    if __name__ == "__main__":
+        Project().main()
+
+If :py:class:`~flow.aggregator` is used with the default arguments, it will create a single aggregate containing all the jobs present in the project.
+In the example above, ``op1`` is an *aggregate operation* where all the jobs present in the project are passed as a variable number of positional arguments (via ``*jobs``), while ``op2`` is an operation where only a single job is passed as an argument.
+
+.. tip::
+
+    The concept of aggregation may be easier to understand if one realizes that "normal" operation functions are equivalent to aggregate operation functions with an aggregate group size of one job.
+
+
+.. note::
+
+    For an aggregate operation, all conditions like :py:meth:`~flow.FlowProject.pre` or :py:meth:`~flow.FlowProject.post`, callable directives, and other features are required to take the same number of jobs as the operation as arguments.
+
+.. _types_of_aggregation:
+
+Types of Aggregation
+====================
+
+Currently, **signac-flow** allows users to aggregate jobs in the following ways:
+
+- *All jobs*: All of the project's jobs are passed to the operation function.
+- *Group by state point key(s)*: The aggregates are grouped by one or more state point keys.
+- *Group by arbitrary key function*: The aggregates are grouped by keys determined by a key-function that expects an instance of :py:class:`~.signac.contrib.job.Job` and return the grouping key.
+- Grouping into aggregates of a specific size.
+- Using a completely custom aggregator function when even greater flexibility is needed.
+
+Group By
+--------
+
+:py:meth:`~flow.aggregator.groupby` allows users to aggregate jobs by grouping them by a state point key, an iterable of state point keys whose values define the groupings, or an arbitrary callable of :py:class:`~signac.contrib.job.Job`.
+
+.. code-block:: python
+
+    @aggregator.groupby("temperature")
+    @Project.operation
+    def op3(*jobs):
+        pass
+
+In the above example, the jobs will be aggregated based on the state point key ``"temperature"``.
+So, all the jobs having the same value of **temperature** in their state point will be aggregated together.
+
+Groups Of
+---------
+
+:py:meth:`~flow.aggregator.groupsof` allows users to aggregate jobs by generating aggregates of a given size.
+
+.. code-block:: python
+
+    @aggregator.groupsof(2)
+    @Project.operation
+    def op4(job1, job2=None):
+        pass
+
+In the above example, the jobs will get aggregated in groups of 2 and hence, up to two jobs will be passed as arguments at once.
+
+.. note::
+
+    In case the number of jobs in the project in this example is odd, there will be one aggregate containing only a single job.
+    In general, the last aggregate from :py:meth:`~flow.aggregator.groupsof` will contain the remaining jobs if the aggregate size does not evenly divide the number of jobs in the project.
+    If a remainder is expected and valid, users should make sure that the operation function can be called with the reduced number of arguments (e.g. by using ``*jobs`` or providing default arguments as shown above).
+
+Sorting jobs for aggregation
+----------------------------
+
+Aggregators allow users to sort the jobs before creating aggregates with the ``sort_by`` parameter.
+The sorting order can be defined with the ``sort_ascending`` parameter.
+By default, when no ``sort_by`` parameter is specified, the order of the jobs will be decided by the iteration order of the **signac** project.
+
+.. code-block:: python
+
+    @aggregator.groupsof(2, sort_by="temperature", sort_ascending=False)
+    @Project.operation
+    def op5(*jobs):
+        pass
+
+.. note::
+
+    In the above example, all the jobs will be sorted by the state point parameter ``"temperature"`` in descending order and then be aggregated as groups of 2.
+
+Selecting jobs for aggregation
+------------------------------
+
+**signac-flow** allows users to selectively choose which jobs to pass into operation functions.
+This can be used to generate aggregates from only the selected jobs, excluding any jobs that do not meet the selection criteria.
+
+.. code-block:: python
+
+    @aggregator(select=lambda job: job.sp.temperature > 0)
+    @Project.operation
+    def op6(*jobs):
+        pass
+
+
+.. _aggregate_id:
+
+Aggregate ID
+============
+
+Similar to the concept of a job id, an aggregate id is a unique hash identifying an aggregate of jobs.
+The aggregate id is sensitive to the order of the jobs in the aggregate.
+
+
+.. note::
+
+    The id of an aggregate containing one job is that job's id.
+
+In order to distinguish between an aggregate id and a job id, the id of aggregates with more than one job will always have a prefix ``agg-``.
+
+Users can generate the aggregate id of an aggregate using :py:func:`flow.get_aggregate_id`.
+
+.. tip::
+
+    Users can also pass an aggregate id to the ``--job-id`` command line flag provided by **signac-flow** in ``run``, ``submit``, and ``exec``.
+
+
+.. _aggregation_with_flow_groups:
+
+Aggregation with FlowGroups
+===========================
+
+In order to associate an aggregator object with a :py:class:`~flow.project.FlowGroup`, **signac-flow** provides a ``group_aggregator`` parameter in :py:meth:`~flow.FlowProject.make_group`.
+By default, no aggregation takes place for a :py:class:`FlowGroup`.
+
+.. note::
+
+    All the operations in a :py:class:`~flow.project.FlowGroup` will use the same :py:class:`~flow.aggregator` object provided to the group's ``group_aggregator`` parameter.
+
+.. code-block:: python
+
+    # project.py
+    from flow import FlowProject, aggregator
+
+    class Project(FlowProject):
+        pass
+
+    group = Project.make_group("agg-group", group_aggregator=aggregator())
+
+    @group
+    @aggregator()
+    @Project.operation
+    def op1(*jobs):
+        pass
+
+    @group
+    @Project.operation
+    def op2(*jobs):
+        pass
+
+    if __name__ == "__main__":
+        Project().main()
+
+In the above example, when the group ``agg-group`` is executed using ``python project.py run -o agg-group``, all the jobs in the project are passed as positional arguments for both ``op1`` and ``op2``.
+If ``op1`` is executed using ``python project.py run -o op1``, all the jobs in the project are passed as positional arguments because a :py:class:`~flow.aggregator` is associated with the operation function ``op1`` (separately from the aggregator used for ``agg-group``).
+If ``op2`` is executed using ``python project.py run -o op2``, only a single job is passed as an argument because no :py:class:`~flow.aggregator` is associated with the operation function ``op2``.
diff --git a/docs/source/flow-project.rst b/docs/source/flow-project.rst
index 6f6edb53..600f796b 100644
--- a/docs/source/flow-project.rst
+++ b/docs/source/flow-project.rst
@@ -49,7 +49,7 @@ Defining a workflow
 ===================
 
 We will reproduce the simple workflow introduced in the previous section by first copying both the ``greeted()`` condition function and the ``hello()`` *operation* function into the ``project.py`` module.
-We then use the :py:func:`~flow.FlowProject.operation` and the :py:func:`~.flow.FlowProject.post` decorator functions to specify that the ``hello()`` operation function is part of our workflow and that it should only be executed if the ``greeted()`` condition is not met.
+We then use the :py:meth:`~flow.FlowProject.operation` and the :py:meth:`~flow.FlowProject.post` decorator functions to specify that the ``hello()`` operation function is part of our workflow and that it should only be executed if the ``greeted()`` condition is not met.
 
 .. code-block:: python
 
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 41a3dbe4..9438a1e9 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -48,6 +48,7 @@ If you are new to **signac**, the best place to start is to read the :ref:`intro
    environments
    templates
    flow-group
+   aggregation
    indexing
    collections
    configuration