Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for aggregation #122

Merged
merged 30 commits into from
Jun 26, 2021
Merged

Add documentation for aggregation #122

merged 30 commits into from
Jun 26, 2021

Conversation

kidrahahjo
Copy link
Member

Description

Add documentation for the newly introduced aggregation feature.

Motivation and Context

This should be merged after glotzerlab/signac-flow#464 gets merged into master.

Checklist:

@kidrahahjo kidrahahjo requested review from a team as code owners February 24, 2021 18:24
@kidrahahjo kidrahahjo self-assigned this Feb 24, 2021
@kidrahahjo kidrahahjo requested review from csadorf and Charlottez112 and removed request for a team February 24, 2021 18:24
Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking so long for this. Here is my first pass!

docs/source/aggregation.rst Outdated Show resolved Hide resolved
docs/source/aggregation.rst Outdated Show resolved Hide resolved
docs/source/aggregation.rst Outdated Show resolved Hide resolved
.. note::

In case the number of jobs in the project is odd, there will be one aggregate containing only a single
job and hence users should be careful while defining the parameters for an *aggregate operation*.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could show an example on how to do that, e.g.:

def op4(job1, job2=None):
    pass

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since in the above example the parity of total jobs is not defined, I think it'll be more appropriate to include this in the above example itself

.. note::

In case the number of jobs in the project is odd, there will be one aggregate containing only a single
job and hence users should be careful while defining the parameters for an *aggregate operation*.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to be a bit more precise here. What does "careful" mean? This may or may not be a valid use case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs/source/aggregation.rst Outdated Show resolved Hide resolved
============

Similar to the concept of a job id, an aggregate id is a unique hash identifying an aggregate of jobs.
The aggregate id is sensitive to the order of the jobs in the aggregate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what is the default order then? Should be documented here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs/source/aggregation.rst Outdated Show resolved Hide resolved
.. note::

Currently, **signac-flow** only allows single :class:`~flow.aggregator` per group, i.e. all the operations present
in a :py:class:`FlowGroup` will be using a same :class:`~flow.aggregator` object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
in a :py:class:`FlowGroup` will be using a same :class:`~flow.aggregator` object.
in a :py:class:`FlowGroup` will be using the same :class:`~flow.aggregator` object.

class Project(FlowProject):
pass

group = Project.make_group('agg-group', aggregator_obj=aggregator())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the example would be easier to follow if you provided an implementation for such an aggregator class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csadorf, I am not sure I completely follow what you say. Can you please clarify?

@bdice bdice requested a review from DomFijan May 13, 2021 17:15
@b-butler
Copy link
Member

@kidrahahjo can you still work on this? If not (which is fine) we could assign someone else so that we can prioritize getting this updated on the documentation.

@bdice
Copy link
Member

bdice commented May 21, 2021

I spoke with @DomFijan and he expressed interest in helping as well.

@kidrahahjo
Copy link
Member Author

I'll try to update this PR by addressing @csadorf 's reviews within a few hours. I don't have any problems with anyone else working on this. Please feel free to contribute.

@DomFijan
Copy link

When I was trying out aggregation yesterday and it wasn't immediately obvious to me that functions passed to @Project.post/pre need to now have a different input argument (job vs *jobs). If one wants to use aggregation to "embarrassingly parallelise" certain set of jobs (which I foresee will be a common application of this feature as bundling is often so problematic) the whole "workflow" of the operation one wishes to aggregate needs to be changed. This involves writing filters to filter out jobs which might have already been finished, rewriting pre/post condition functions for aggregated operation as outlined above, and finally splitting the MPI communicator for each job that one wishes to aggregate, and possibly some more...
My question is - should this be metioned somewhere in the documentation? Perhaps not in a form of an example in this place, but in such a capacity that user is aware that some modifications are needed, and simply adding an aggregate decorator will not just work automagically in such cases? Perhaps a followup example somewhere else in the docs might be useful to demonstrate this capability to provide a more well-rounded example of usefulness of this feature.

@Charlottez112
Copy link
Contributor

Thank you @kidrahahjo so much for the documentation! As someone who's not super familiar with aggregation, I found the examples very helpful. I just made a few comments on things that were not immediately clear to me. Could you also resolve the comments/ suggestions you've applied to the doc?

Copy link
Member

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments below. Overall this is great and covers the right topics! Just need to clean it up a bit. I think we should show fewer examples of def my_op(job1, job2) and replace them with def my_op(*jobs). I think that is a better practice to demonstrate for most users' needs.

Also, I have seen several times that newer Python programmers are confused by how *args and **kwargs become a tuple / dictionary inside the function body. I would suggest that some of the function bodies containing pass should be replaced by something acting on the jobs, like print("Number of jobs in aggregate:", len(jobs)), so that users can understand how the *jobs argument gets translated into a tuple.

docs/source/aggregation.rst Outdated Show resolved Hide resolved
if __name__ == '__main__':
Project().main()

If :class:`~flow.aggregator` is used with the default arguments, an aggregate of all the jobs present in the project will be created.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If :class:`~flow.aggregator` is used with the default arguments, an aggregate of all the jobs present in the project will be created.
If :class:`~flow.aggregator` is used with the default arguments, it will create a single aggregate containing all the jobs present in the project.

docs/source/aggregation.rst Outdated Show resolved Hide resolved
docs/source/aggregation.rst Outdated Show resolved Hide resolved
docs/source/aggregation.rst Outdated Show resolved Hide resolved

@aggregator.groupsof(2, sort_by='temperature', sort_ascending=False)
@Project.operation
def op5(job1, job2):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to use *jobs or job2=None to support a final aggregate with one job?
(I also worry that showing examples with job1, job2 will be more confusing than *jobs.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we've made our point of using non default arguments carefully, I think we should go with *jobs here.

docs/source/aggregation.rst Outdated Show resolved Hide resolved
docs/source/aggregation.rst Outdated Show resolved Hide resolved
docs/source/aggregation.rst Outdated Show resolved Hide resolved
docs/source/aggregation.rst Outdated Show resolved Hide resolved
@csadorf csadorf removed their request for review June 22, 2021 07:13
docs/source/aggregation.rst Outdated Show resolved Hide resolved
docs/source/aggregation.rst Outdated Show resolved Hide resolved
docs/source/aggregation.rst Outdated Show resolved Hide resolved
docs/source/aggregation.rst Outdated Show resolved Hide resolved
In case the number of jobs in the project in this example is odd, there will be one aggregate containing only a single job.
In general, the last aggregate from :class:`~flow.aggregator.groupsof` will contain the remaining jobs if the aggregate size does not evenly divide the number of jobs in the project.
If a remainder is expected and valid, users should make sure that the operation function can be called with the reduced number of arguments (e.g. by using ``*jobs`` or providing default arguments as shown above).

Copy link
Member

@bdice bdice Jun 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add more examples for some or all of the following:

  • Group by state point keys: The aggregates are grouped by multiple state point keys.
  • Group by arbitrary key function: The aggregates are grouped by keys determined by a key-function that expects an instance of :class:~.signac.contrib.job.Job and return the grouping key.
  • Using a completely custom aggregator function when even greater flexibility is needed.
  • Using sorting/selection in conjunction with other aggregator parameters.

Copy link
Member

@bdice bdice Jun 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a new issue for this. #146.

docs/source/aggregation.rst Outdated Show resolved Hide resolved
docs/source/aggregation.rst Outdated Show resolved Hide resolved
Copy link
Member

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is mostly complete, so I am approving it. All of my remaining suggestions could be addressed in a separate PR if desired.

Copy link
Contributor

@Charlottez112 Charlottez112 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referring to Python unpacking is very helpful!

…gates of one job.

Co-authored-by: Carl Simon Adorf <carl.simon.adorf@gmail.com>
@bdice
Copy link
Member

bdice commented Jun 26, 2021

I created #146, #147 to address my remaining comments on this PR. With two approvals, I think it's ready to merge. @csadorf I think we've addressed your comments sufficiently, but feel free to follow up if you wanted to make another round of changes. Thanks for your work on this, @kidrahahjo. 👍

edit: I did some final touches to add this to the table of contents and fix the intersphinx references.

@bdice bdice merged commit 81bfeb3 into master Jun 26, 2021
@bdice bdice deleted the feature/aggregation-docs branch June 26, 2021 19:09
@csadorf
Copy link
Contributor

csadorf commented Jun 26, 2021

Thx a lot @kidrahahjo !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants