Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for aggregation #122

Merged
merged 30 commits into from
Jun 26, 2021
Merged
Changes from 16 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
197a1d4
Add aggregation docs without FlowGroups
kidrahahjo Feb 24, 2021
61f9c08
Add documentation for aggregation with FlowGroups
kidrahahjo Feb 24, 2021
cf56bd7
Merge remote-tracking branch 'origin/master' into feature/aggregation…
kidrahahjo May 21, 2021
b65f9f3
Address reviews and update docs with current API
kidrahahjo May 21, 2021
cc7de69
Improve wordings
kidrahahjo May 21, 2021
5d4b5c6
Address code review
kidrahahjo May 22, 2021
6d19ec7
Merge remote-tracking branch 'origin/master' into feature/aggregation…
kidrahahjo Jun 24, 2021
120db74
Improvements to doc
kidrahahjo Jun 24, 2021
c527af3
Improve wording
kidrahahjo Jun 24, 2021
a2fa6d8
Update docs/source/aggregation.rst
bdice Jun 25, 2021
971d828
Update docs/source/aggregation.rst
bdice Jun 25, 2021
5a3804e
Update docs/source/aggregation.rst
bdice Jun 25, 2021
1283361
Update docs/source/aggregation.rst
bdice Jun 25, 2021
5d64f3f
Update docs/source/aggregation.rst
bdice Jun 25, 2021
76ec6d4
Update docs/source/aggregation.rst
bdice Jun 25, 2021
c82be52
Update docs/source/aggregation.rst
bdice Jun 25, 2021
7ffd383
Update docs/source/aggregation.rst
bdice Jun 25, 2021
35cbf3b
Update docs/source/aggregation.rst
bdice Jun 25, 2021
c00bb86
Update docs/source/aggregation.rst
bdice Jun 25, 2021
49c31f2
Update docs/source/aggregation.rst
bdice Jun 25, 2021
375b3cd
Update docs/source/aggregation.rst
bdice Jun 25, 2021
7d5b3d5
Update docs/source/aggregation.rst
bdice Jun 25, 2021
573eb10
Update docs/source/aggregation.rst
bdice Jun 25, 2021
16b1c97
Explain that operations are like aggregate operations acting on aggre…
bdice Jun 26, 2021
8047341
Add aggregation to table of contents.
bdice Jun 26, 2021
d16f43a
Rename section to match FlowGroup.
bdice Jun 26, 2021
9cd9995
Unitalicize.
bdice Jun 26, 2021
23576c6
Fix links to pre/post.
bdice Jun 26, 2021
abd7544
Fix intersphinx references.
bdice Jun 26, 2021
aca6752
Use :py: role prefix for consistency with other docs.
bdice Jun 26, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
184 changes: 184 additions & 0 deletions docs/source/aggregation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
.. _aggregation:

===========
Aggregation
===========

This chapter provides information about passing aggregates of jobs to operation functions.


.. _aggregator_definition:

aggregator
==========

An :class:`~flow.aggregator` is used as a decorator for operation functions which accept a variable number of positional arguments, ``*jobs``.
The argument ``*jobs`` is unpacked into an *aggregate*, defined as an ordered tuple of jobs.
See also the Python documentation about :ref:`argument unpacking <python:tut-unpacking-arguments>`.

.. code-block:: python

# project.py
from flow import FlowProject, aggregator

class Project(FlowProject):
pass

@aggregator()
@Project.operation
def op1(*jobs):
print("Number of jobs in aggregate:", len(jobs))

@Project.operation
def op2(job):
pass

if __name__ == '__main__':
bdice marked this conversation as resolved.
Show resolved Hide resolved
Project().main()

If :class:`~flow.aggregator` is used with the default arguments, it will create a single aggregate containing all the jobs present in the project.
In the example above, ``op1`` is an *aggregate operation* where all the jobs present in the project are passed as a variable number of positional arguments (via ``*jobs``), while ``op2`` is a normal operation where only a single job is passed as an argument.
bdice marked this conversation as resolved.
Show resolved Hide resolved

.. note::

For an aggregate operation, all conditions like :class:`~flow.FlowProject.pre` or :class:`~flow.FlowProject.post`, callable directives, and other features are required to take the same number of jobs as the operation as arguments.

.. _types_of_aggregation:

Types of Aggregation
====================

Currently, **signac-flow** allows users to aggregate jobs in the following ways:

- *All jobs*: All of the project's jobs are passed to the operation function.
- *Group by state point key(s)*: The aggregates are grouped by one or more state point keys.
- *Group by arbitrary key function*: The aggregates are grouped by keys determined by a key-function that expects an instance of :class:`~.signac.contrib.job.Job` and return the grouping key.
- Grouping into aggregates of a specific size.
- Using a completely custom aggregator function when even greater flexibility is needed.

Group By
--------

:class:`~flow.aggregator.groupby` allows users to aggregate jobs by grouping them by a state point key, an iterable of state point keys whose values define the groupings, or an arbitrary callable of :class:`~signac.contrib.job.Job`.

.. code-block:: python

@aggregator.groupby("temperature")
@Project.operation
def op3(*jobs):
pass

In the above example, the jobs will be aggregated based on the state point key ``"temperature"``.
So, all the jobs having the same value of **temperature** in their state point will be aggregated together.

Groups Of
---------

:class:`~flow.aggregator.groupsof` allows users to aggregate jobs by generating aggregates of a given size.

.. code-block:: python

@aggregator.groupsof(2)
@Project.operation
def op4(job1, job2=None):
pass

In the above example, the jobs will get aggregated in groups of 2 and hence, up to two jobs will be passed as arguments at once.

.. note::

In case the number of jobs in the project in this example is odd, there will be one aggregate containing only a single job.
In general, the last aggregate from :class:`~flow.aggregator.groupsof` will contain the remaining jobs if the aggregate size does not evenly divide the number of jobs in the project.
If a remainder is expected and valid, users should make sure that the operation function can be called with the reduced number of arguments (e.g. by using ``*jobs`` or providing default arguments as shown above).

Copy link
Member

@bdice bdice Jun 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add more examples for some or all of the following:

  • Group by state point keys: The aggregates are grouped by multiple state point keys.
  • Group by arbitrary key function: The aggregates are grouped by keys determined by a key-function that expects an instance of :class:~.signac.contrib.job.Job and return the grouping key.
  • Using a completely custom aggregator function when even greater flexibility is needed.
  • Using sorting/selection in conjunction with other aggregator parameters.

Copy link
Member

@bdice bdice Jun 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a new issue for this. #146.

Sorting jobs for aggregation
----------------------------

Aggregators allow users to sort the jobs before creating aggregates with the ``sort_by`` parameter.
The sorting order can be defined with the ``sort_ascending`` parameter.
By default, when no ``sort_by`` parameter is specified, the order of the jobs will be decided by the iteration order of the **signac** project.

.. code-block:: python

@aggregator.groupsof(2, sort_by="temperature", sort_ascending=False)
@Project.operation
def op5(*jobs):
pass

.. note::

In the above example, all the jobs will be sorted by the state point parameter ``"temperature"`` in descending order and then be aggregated as groups of 2.

Selecting jobs for aggregation
------------------------------

**signac-flow** allows users to selectively choose which jobs to pass into operation functions.
bdice marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: python

@aggregator(select=lambda job: job.sp.temperature > 0)
@Project.operation
def op6(*jobs):
pass


.. _aggregate_id:

Aggregate ID
bdice marked this conversation as resolved.
Show resolved Hide resolved
============

Similar to the concept of a job id, an aggregate id is a unique hash identifying an aggregate of jobs.
The aggregate id is sensitive to the order of the jobs in the aggregate.


.. note::

The id of an aggregate containing one job is that job's id.

In order to distinguish between aggregate id and a job id, for an aggregate of more than one job the aggregate id of that aggregate will always have a prefix ``agg-``.
bdice marked this conversation as resolved.
Show resolved Hide resolved

Users can generate the aggregate id of an aggregate using :meth:`flow.get_aggregate_id`.

.. tip::

Users can also pass an aggregate id to the ``--job-id`` command-line flag provided by **signac-flow** in ``run``, ``submit``, and ``exec``.


.. _aggregation_with_flow_groups:

Aggregation with FlowGroups
===========================

In order to associate aggregator object with a :py:class:`FlowGroup`, **signac-flow** provides a ``group_aggregator`` parameter in :meth:`~flow.FlowProject.make_group`.
bdice marked this conversation as resolved.
Show resolved Hide resolved
By default, no aggregation takes place for a :py:class:`FlowGroup`.

.. note::

All the operations in a :py:class:`FlowGroup` will use the same :class:`~flow.aggregator` object provided to the group's ``group_aggregator`` parameter.

.. code-block:: python

# project.py
from flow import FlowProject, aggregator

class Project(FlowProject):
pass

group = Project.make_group('agg-group', group_aggregator=aggregator())
bdice marked this conversation as resolved.
Show resolved Hide resolved

@group
@aggregator()
@Project.operation
def op1(*jobs):
pass

@group
@Project.operation
def op2(*jobs):
pass

if __name__ == '__main__':
bdice marked this conversation as resolved.
Show resolved Hide resolved
Project().main()

In the above example, when the group ``agg-group`` is executed using ``python project.py run -o agg-group``, all the jobs in the project are passed as positional arguments for both ``op1`` and ``op2``.
If ``op2`` is executed using ``python project.py run -o op2``, only a single job is passed as an argument because no :class:`~flow.aggregator` is associated with the operation function ``op2``.
bdice marked this conversation as resolved.
Show resolved Hide resolved