WIP: Enable Aggregate Operations #289

kidrahahjo · 2020-05-10T18:02:53Z

Description

This pull request adds a new feature which allows the users to operate over a group of jobs at once

Motivation and Context

This pull request addresses issue #266
Users will be able to use this feature using the method below

def groupper_function(jobs):
    ## Returns jobs in a specified manner

aggregation_example = aggregation(grouper=groupper_function, sort='b', reverse=True)

@select(filterby=lambda job: job.sp.a > 25)
@aggregate_example
@FlowProject.operation
def example_operation(jobs):
    for job in jobs:
        print(jobs)

Types of Changes

Documentation update
Bug fix
New feature
Breaking change¹

¹The change breaks (or has the potential to break) existing functionality.

Work to be done:

Checklist:

I am familiar with the Contributing Guidelines.
I agree with the terms of the Contributor Agreement.
My name is on the list of contributors.
My code follows the code style guideline of this project.
The changes introduced by this pull request are covered by existing or newly introduced tests.

If necessary:

I have updated the API documentation as part of the package doc-strings.
I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
I have updated the changelog.

csadorf · 2020-05-12T13:20:06Z

@kidrahahjo Make sure to explicitly ping us when you want feedback/ input on this. Otherwise I'm just going to assume that this is pure drafting/ WIP and ignore related notifications.

kidrahahjo · 2020-05-12T13:25:10Z

@kidrahahjo Make sure to explicitly ping us when you want feedback/ input on this. Otherwise I'm just going to assume that this is pure drafting/ WIP and ignore related notifications.

@csadorf Yes, I'll make sure to do that.

kidrahahjo · 2020-05-27T10:35:39Z

An implementation that enables the execution of aggregate operations has been achieved for now. The current iteration of do not differentiate between normal operation and aggregate operation. Every operation is an aggregate operation but in order to maintain consistency for the users, unpacking of jobs, where the length of jobs is equal to 1, is done so they can directly refer to the statepoints rather that performing indexing (jobs[0])

kidrahahjo · 2020-05-28T08:11:17Z

Few problems remain in execution.

While the operations are executed as expected, I am facing an unexpected problem while executing groups.
If I run python project.py run -o some_operation, where the grouper function of that operation is the pairwise function, the code runs as expected. But when I run python project.py run -o group_name then only the first operation of that group runs as expected.
I checked the problem and it was the list of jobs (according to the grouper function) that were created in the _create_run_job_operations method.
Generally, the structure for groups of N jobs is [[a1, a2, a3, .. aN], [b1, b2, b3 ... bN], ....[...] ]
But when executed for groups, the list is created for the operations (except the first one) [[[a1, a2, a3, .. aN], [b1, b2, b3 ... bN], ....[...]]]. (I am not sure why this extra nesting is taking place here)
While running jobs in parallel using the Pool, there is a pickling problem for the grouper function along with the operation function.
There is also a problem with the groupby method. I'll fix this today.
Handling of attributes fillvalue, default of the class aggregate is not done well.
Implementation of the Select class and (key, reverse) sorting for the aggregate class

csadorf · 2020-05-28T08:42:59Z

flow/project.py

+        jobstr = ""
+        for job in self.jobs:
+            jobstr+="{} ".format(job.id)
+        return "{}({})".format(self.name, jobstr.strip())


Suggested change

jobstr = ""

for job in self.jobs:

jobstr+="{} ".format(job.id)

return "{}({})".format(self.name, jobstr.strip())

return "{}({})".format(self.name, ", ".join(map(str, self.jobs)))

Oh yes, that's a nice way to do this. Thanks! I'll make the change.

csadorf · 2020-05-28T08:43:33Z

flow/project.py

                   type=type(self).__name__,
                   name=self.name,
-                   job=str(self.job),
+                   job=jobstr.strip(),


See suggestion for
https://github.com/glotzerlab/signac-flow/pull/289/files#diff-896014ecae7b1581f5b837d124cdb4c9R378

I'll make the change

csadorf · 2020-05-28T08:43:54Z

flow/project.py

@@ -345,7 +422,7 @@ def set_status(self, value):
    def get_status(self):
        "Retrieve the operation's last known status."
        try:
-            return JobStatus(self.job._project.document['_status'][self.id])
+            return JobStatus(self.jobs._project.document['_status'][self.id])


How is that supposed to work?

This method was not used in the execution part, hence I haven't changed it. Though, self.jobs won't work here as jobs of the type list.

You should have it work with at least one job in this PR.

csadorf · 2020-05-28T08:44:55Z

flow/project.py

        "Return an id, which identifies this group with respect to this job."
-        project = job._project
+        project = jobs[0]._project


We can't just ignore the other jobs here, can we?

We're assuming that the jobs are from the same project. When we're done executing this, we filter the JobOperation instance and there we check whether every job is in the same project or not. Though it seems to me that I should test whether the jobs are in the same project or not here also.

Might be good to make that assumption explicit with an assert. This could be a significant performance drag, but now is not the time for optimization, now is the time for code safety.

That's true, will insert assertion

Eventually, I think this can be covered by tests, but for now safety is the right approach.

does this mean that if two aggregates have the same job first in their lists that their ids would be identical?

@atravitz
This can be a problem if we run in something like this python project.py run -o op_group_of_2 -j J1 J1 J1 J1
This behaviour seems fair to me because even if we generate similar ids, we'll just be overwriting the id in the project document.

csadorf · 2020-05-28T09:46:45Z

flow/project.py

+            for job in jobs:
+                eligible = False
+                for j in job:
+                    if not type(j) is signac.contrib.job.Job:


Why could it be any other type?

I made jobs to be a nested list.
If we wanted to create a group of 2 using the grouper function then we will eventually have to iterate over something like this: [[Job1, Job2], [Job3, Job4], [Job5, Job6], ... ]
Hence this must be of the type signac.contrib.job.job

If it must be of that type, no need to check.

@csadorf I forgot to mention that this also helps when we handle None Type inside a group.
For instance, [ [j1, j2, j3], [j4, j5, j6], ... [j10, None None] ]

Then it is better to explicitly check with is None.

csadorf · 2020-05-28T09:47:13Z

flow/project.py

@@ -2138,11 +2258,11 @@ class _PickleError(Exception):

    @staticmethod
    def _dumps_op(op):
-        return (op.id, op.name, op.job._id, op.cmd, op.directives)
+        return (op.id, op.name, op.jobs, op.cmd, op.directives)


We're dumping ids here to avoid serializing the whole job instance, we should do the same for multiple jobs.

Oh okay, I have a question though. Is this the reason I was facing a pickling error?

This worked, thank you!

csadorf · 2020-05-28T09:47:57Z

flow/project.py

+                    job_in_project = False
+                    break


If this constitutes a "fatal" (read unrecoverable) condition, then we should raise an exception right here instead of just breaking and ignoring the rest of the jobs.

I'll raise an error here.

csadorf · 2020-05-28T16:42:57Z

flow/project.py

@@ -299,7 +303,7 @@ def keyfunction(job):

        def grouper(jobs):
            for key, group in groupby(sorted(jobs, key=keyfunction), key=keyfunction):
-                yield group
+                yield list(group)


I think it would be much better to not rely on the fact that this is a list. Users might yield iterables as well. You can always create a list where you call the generator function.

That's valid, I'll make this compatible.

Also worth noting that you should only create a list, when you really need a list. There could be literally millions of jobs in that list which would take up a significant amount of memory.

This means that there's also a need of optimization in my code because there are places where I created a list where it's not even necessary.

Ok! Just be conscious of that, no need to optimize the code right now.

kidrahahjo · 2020-05-29T18:08:27Z

flow/project.py

+        self.grouper = grouper
+
+    @classmethod
+    def groupsof(cls, num=1, fillvalue=None):


How should the fillvalue attribute be treated by the users? What all values should it take? Should we check pre/post conditions for that particular value?

I think it's best to skip these.
For instance, if I want a group of 3 and I get [joba, None, None] then it's was probably supposed to happen. Hence we shouldn't interfere much when a use asks for "user specified" arguments

b-butler · 2020-06-08T17:21:22Z

flow/project.py

+import six
+from six.moves import zip_longest
+if six.PY2:
+    from collections import Iterable


We no longer support Python 2, so six should be unnecessary.

b-butler · 2020-06-08T17:24:44Z

flow/project.py

+            except KeyError:
+                raise KeyError("The key '{}' was not found in statepoint "
+                               "parameters of the job {}.".format(sort, job))
+        self.grouper = grouper


grouper should be a private member.

b-butler · 2020-06-26T20:24:38Z

flow/project.py

@@ -3401,7 +3887,8 @@ def _register_groups(self):
                operation_directives = getattr(func, '_flow_group_operation_directives', dict())
                for group_name in func._flow_groups:
                    self._groups[group_name].add_operation(
-                        op_name, op, operation_directives.get(group_name, None))
+                        op_name, op, operation_directives.get(group_name, None),
+                        flow_aggregate[op_name], flow_select[op_name])


See comment in FlowGroup.add_operation. I don't see a need for flow_aggregate or flow_select being parameters.

b-butler · 2020-06-26T20:24:46Z

flow/project.py

+        flow_aggregate = dict()
+        flow_select = dict()
+        for op_name, op in self._operations.items():
+            try:
+                flow_aggregate[op_name] = self._operation_functions[op_name]._flow_aggregate
+                flow_select[op_name] = self._operation_functions[op_name]._flow_select
+            except KeyError:
+                flow_aggregate[op_name] = op.cmd._flow_aggregate
+                flow_select[op_name] = op.cmd._flow_select


This code likely isn't necessary. This can be done in the loop below that uses (op_name, op) and the func variable.

b-butler · 2020-06-26T20:25:14Z

flow/project.py

@@ -3401,7 +3887,8 @@ def _register_groups(self):
                operation_directives = getattr(func, '_flow_group_operation_directives', dict())
                for group_name in func._flow_groups:
                    self._groups[group_name].add_operation(
-                        op_name, op, operation_directives.get(group_name, None))
+                        op_name, op, operation_directives.get(group_name, None),
+                        flow_aggregate[op_name], flow_select[op_name])

            # For singleton groups add directives
            self._groups[op_name].operation_directives[op_name] = getattr(func,


We need to set the singleton groups aggregation and selection here.

I'm not sure what do you mean by this. Can you please explain?

b-butler · 2020-06-26T20:25:48Z

flow/project.py

+    def _eligible_for_submission(self, flow_group, jobs):
+        """Determine if a flow_group is eligible for submission with the given jobs.


We will need to consider that the ordering of jobs can change.

b-butler · 2020-06-26T20:29:29Z

flow/project.py

+                print("Eligible aggregates: ", end=" ")
+                for job in op.jobs:
+                    print(job, end=" ")
+                print()


Suggested change

print("Eligible aggregates: ", end=" ")

for job in op.jobs:

print(job, end=" ")

print()

print("Eligible aggregates:", *op.jobs)

b-butler · 2020-06-26T20:32:25Z

flow/project.py

-                def operation_function(job):
-                    cmd = operation(job).format(job=job)
+                def operation_function(*jobs):
+                    cmd = operation(jobs).format(jobs=jobs)


There is no need for the format here, it exists in FlowCmdOperation.

b-butler · 2020-06-26T20:34:18Z

flow/project.py

+        try:
+            filter = operation_function._flow_select
+            grouper, sort = operation_function._flow_aggregate
+        except AttributeError:
+            filter = operation.cmd._flow_select
+            grouper, sort = operation.cmd._flow_aggregate


This seems to be checking indirectly the operation class of the given operation. We should just check that directly then. Also, I think we want to allow for the group to determine the aggregation and selection behavior. This does not allow for that since we cannot override that in exec.

b-butler · 2020-06-26T20:35:03Z

flow/project.py

+        jobs = list(jobs)
+        jobs_list = filter(jobs)
+        if sort is not None:
+            jobs_list = sort(jobs_list)
+        jobs_list = grouper([job for job in jobs_list])
+
+        for job_list in jobs_list:
+            job_list = list(job_list)
+            for i, job in enumerate(job_list):
+                if job is None:
+                    del job_list[i:]
+                    break


This code is repeated frequently. We should combine this logic into a function or class.

I'll merge this concept with the aggregate class.

b-butler

I left more comments on changes I think would improve the code. Some of the errors, I believe I found, would be caught by some simple tests you could write. Writing tests will help you to figure out if your assumptions about the code are true (which is really helpful for developing code).

Some high level things,

I think we should have an Aggregate class that handles the logic for taking a list of jobs and spitting back out individual aggregates. This prevents the logic from being in multiple places, and having tuples where each element represents only part of the aggregation logic.
We need to find some consistence in the naming jobs v. job_list v. jobs_list. The ambiguity here makes the code hard to parse at times.
We need to decide on how to deal with currently user facing functions that accept lists of jobs versus those that pass in individual positional arguments (i.e. jobs v. *jobs). We are not consistent with this currently, and I think this will lead to a lot of user confusion.
I know there is a separate branch for submitting, but this is definitively still broken in this branch (this is more of a note).

…dule

kidrahahjo · 2020-08-14T18:07:33Z

This pull request has served its purpose of guiding me through my project as #336 is now ready for a review.
@b-butler @csadorf @atravitz @bdice I'm now closing this pull request.
After #336 gets merged into feature/enable-aggregation our final step will be to merge feature/enable-aggregation into master after resolving merge conflicts.

kidrahahjo changed the title ~~Aggregation in a very simple way~~ WIP: Aggregation - A Rough Sketch May 12, 2020

kidrahahjo force-pushed the enable-aggregate-operation branch from cdba49f to 87f04ee Compare May 12, 2020 12:46

mikemhenry added aggregation enhancement New feature or request labels May 18, 2020

kidrahahjo force-pushed the enable-aggregate-operation branch from 87f04ee to b07c207 Compare May 27, 2020 10:19

kidrahahjo changed the title ~~WIP: Aggregation - A Rough Sketch~~ WIP: Enable Aggregate Operations May 27, 2020

csadorf added the GSoC Google Summer of Code label May 27, 2020

kidrahahjo added 5 commits May 28, 2020 19:22

Enabling aggregate operation execution

2826578

Minute changes to the API

fe44c1c

Removed printing of operations

24e4f6c

Enable parallelized execution

49209f2

Minute changes

e7e8c3e

kidrahahjo force-pushed the enable-aggregate-operation branch from f868f2a to e7e8c3e Compare May 28, 2020 13:54

csadorf reviewed May 28, 2020

View reviewed changes

groupby method, parallel execution, styling fixed

11f63dc

csadorf reviewed May 28, 2020

View reviewed changes

Compatibilty with iterables, FlowGroup fixed

049247d

kidrahahjo commented May 29, 2020

View reviewed changes

kidrahahjo added 6 commits May 30, 2020 17:37

Support for exec next added

44552d5

Multiple arguments will now be passed while using conditions

325fd80

Support for _condition class added

736a5b3

Support for filter using select, and sorting added

86dc219

Accepting non-default arguments

6d23405

Storing id refactored

2f3ebe8

b-butler reviewed Jun 8, 2020

View reviewed changes

b-butler reviewed Jun 26, 2020

View reviewed changes

b-butler requested changes Jun 26, 2020

View reviewed changes

kidrahahjo and others added 2 commits June 28, 2020 22:08

Refactored, tests now pass. New tests added which needs to be reviewed

54dc6fc

Minor fixes to tests.

bb4496b

kidrahahjo mentioned this pull request Jun 30, 2020

WIP: Add Aggregation Feature to Flow #306

Closed

12 tasks

kidrahahjo added 13 commits July 8, 2020 02:37

Implemented the groupwise printing of status

379d548

Implement suggested changes

09849f1

Merge upstream/master

8f405ce

Aggregate class handles aggregation of jobs completely

04dfcd8

Fixed typo

bdc64a9

Renamed jobs_list to aggregated_jobs and job_list to aggregate

688e50e

Signature handling fixed

6fc8aac

Resolve merge conflicts

c29eb0c

Minor changes

875c3b1

Add _label method which returns all the labels of all the jobs passed

a9afb94

Status Rendering Refactored

e3bbe8c

Refactored

16d2401

Style fix

b8e3694

kidrahahjo mentioned this pull request Jul 16, 2020

Enable aggregate logic in flow #324

Merged

12 tasks

Refactor the API with proper documentation, testing, add aggregate mo…

d93eecf

…dule

kidrahahjo mentioned this pull request Aug 4, 2020

Add Aggregation Feature to Flow #336

Closed

12 tasks

kidrahahjo closed this Aug 14, 2020

		def _eligible_for_submission(self, flow_group, jobs):
		"""Determine if a flow_group is eligible for submission with the given jobs.

WIP: Enable Aggregate Operations #289

WIP: Enable Aggregate Operations #289

Conversation

kidrahahjo commented May 10, 2020 • edited

Description

Motivation and Context

Types of Changes

Work to be done:

Checklist:

csadorf commented May 12, 2020

kidrahahjo commented May 12, 2020

kidrahahjo commented May 27, 2020

kidrahahjo commented May 28, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

b-butler Jun 26, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

b-butler left a comment

Choose a reason for hiding this comment

kidrahahjo commented Aug 14, 2020

kidrahahjo commented May 10, 2020 •

edited

kidrahahjo commented May 28, 2020 •

edited

b-butler Jun 26, 2020 •

edited