Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Enable Aggregate Operations #289

Closed

Conversation

kidrahahjo
Copy link
Member

@kidrahahjo kidrahahjo commented May 10, 2020

Description

This pull request adds a new feature which allows the users to operate over a group of jobs at once

Motivation and Context

This pull request addresses issue #266
Users will be able to use this feature using the method below

def groupper_function(jobs):
    ## Returns jobs in a specified manner

aggregation_example = aggregation(grouper=groupper_function, sort='b', reverse=True)

@select(filterby=lambda job: job.sp.a > 25)
@aggregate_example
@FlowProject.operation
def example_operation(jobs):
    for job in jobs:
        print(jobs)

Types of Changes

  • Documentation update
  • Bug fix
  • New feature
  • Breaking change1

1The change breaks (or has the potential to break) existing functionality.

Work to be done:

  • Support for aggregate and select classes.
  • Introduce functionality where all operations act as aggregate operations.
  • Support for all pre/post conditions.
  • Support for aggregate-groups, where you cannot add aggregate operations having different aggregate parameters in a single group.
  • Enable execution of operations and groups using run, exec.
  • Enable checking of jobs to be executed for any operation using next.
  • Support for FlowCmdOperation
  • Enable submission of operations
  • Preventing from resubmission
  • Ensuring the jobs are aggregated in the same manner every time we run or submit operations
  • Enable status printing of operations
  • Write tests

Checklist:

If necessary:

  • I have updated the API documentation as part of the package doc-strings.
  • I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
  • I have updated the changelog.

@kidrahahjo kidrahahjo changed the title Aggregation in a very simple way WIP: Aggregation - A Rough Sketch May 12, 2020
@csadorf
Copy link
Contributor

csadorf commented May 12, 2020

@kidrahahjo Make sure to explicitly ping us when you want feedback/ input on this. Otherwise I'm just going to assume that this is pure drafting/ WIP and ignore related notifications.

@kidrahahjo
Copy link
Member Author

@kidrahahjo Make sure to explicitly ping us when you want feedback/ input on this. Otherwise I'm just going to assume that this is pure drafting/ WIP and ignore related notifications.

@csadorf Yes, I'll make sure to do that.

@mikemhenry mikemhenry added aggregation enhancement New feature or request labels May 18, 2020
@kidrahahjo kidrahahjo changed the title WIP: Aggregation - A Rough Sketch WIP: Enable Aggregate Operations May 27, 2020
@csadorf csadorf added the GSoC Google Summer of Code label May 27, 2020
@kidrahahjo
Copy link
Member Author

An implementation that enables the execution of aggregate operations has been achieved for now. The current iteration of do not differentiate between normal operation and aggregate operation. Every operation is an aggregate operation but in order to maintain consistency for the users, unpacking of jobs, where the length of jobs is equal to 1, is done so they can directly refer to the statepoints rather that performing indexing (jobs[0])

@kidrahahjo
Copy link
Member Author

kidrahahjo commented May 28, 2020

Few problems remain in execution.

  • While the operations are executed as expected, I am facing an unexpected problem while executing groups.
    If I run python project.py run -o some_operation, where the grouper function of that operation is the pairwise function, the code runs as expected. But when I run python project.py run -o group_name then only the first operation of that group runs as expected.
    I checked the problem and it was the list of jobs (according to the grouper function) that were created in the _create_run_job_operations method.
    Generally, the structure for groups of N jobs is [[a1, a2, a3, .. aN], [b1, b2, b3 ... bN], ....[...] ]
    But when executed for groups, the list is created for the operations (except the first one) [[[a1, a2, a3, .. aN], [b1, b2, b3 ... bN], ....[...]]]. (I am not sure why this extra nesting is taking place here)

  • While running jobs in parallel using the Pool, there is a pickling problem for the grouper function along with the operation function.

  • There is also a problem with the groupby method. I'll fix this today.

  • Handling of attributes fillvalue, default of the class aggregate is not done well.

  • Implementation of the Select class and (key, reverse) sorting for the aggregate class

flow/project.py Outdated
Comment on lines 375 to 378
jobstr = ""
for job in self.jobs:
jobstr+="{} ".format(job.id)
return "{}({})".format(self.name, jobstr.strip())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
jobstr = ""
for job in self.jobs:
jobstr+="{} ".format(job.id)
return "{}({})".format(self.name, jobstr.strip())
return "{}({})".format(self.name, ", ".join(map(str, self.jobs)))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, that's a nice way to do this. Thanks! I'll make the change.

flow/project.py Outdated
type=type(self).__name__,
name=self.name,
job=str(self.job),
job=jobstr.strip(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make the change

flow/project.py Outdated
@@ -345,7 +422,7 @@ def set_status(self, value):
def get_status(self):
"Retrieve the operation's last known status."
try:
return JobStatus(self.job._project.document['_status'][self.id])
return JobStatus(self.jobs._project.document['_status'][self.id])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is that supposed to work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method was not used in the execution part, hence I haven't changed it. Though, self.jobs won't work here as jobs of the type list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should have it work with at least one job in this PR.

"Return an id, which identifies this group with respect to this job."
project = job._project
project = jobs[0]._project
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't just ignore the other jobs here, can we?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're assuming that the jobs are from the same project. When we're done executing this, we filter the JobOperation instance and there we check whether every job is in the same project or not. Though it seems to me that I should test whether the jobs are in the same project or not here also.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to make that assumption explicit with an assert. This could be a significant performance drag, but now is not the time for optimization, now is the time for code safety.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, will insert assertion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, I think this can be covered by tests, but for now safety is the right approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean that if two aggregates have the same job first in their lists that their ids would be identical?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atravitz
This can be a problem if we run in something like this python project.py run -o op_group_of_2 -j J1 J1 J1 J1
This behaviour seems fair to me because even if we generate similar ids, we'll just be overwriting the id in the project document.

flow/project.py Outdated
for job in jobs:
eligible = False
for j in job:
if not type(j) is signac.contrib.job.Job:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why could it be any other type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made jobs to be a nested list.
If we wanted to create a group of 2 using the grouper function then we will eventually have to iterate over something like this: [[Job1, Job2], [Job3, Job4], [Job5, Job6], ... ]
Hence this must be of the type signac.contrib.job.job

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it must be of that type, no need to check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csadorf I forgot to mention that this also helps when we handle None Type inside a group.
For instance, [ [j1, j2, j3], [j4, j5, j6], ... [j10, None None] ]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it is better to explicitly check with is None.

flow/project.py Outdated
@@ -2138,11 +2258,11 @@ class _PickleError(Exception):

@staticmethod
def _dumps_op(op):
return (op.id, op.name, op.job._id, op.cmd, op.directives)
return (op.id, op.name, op.jobs, op.cmd, op.directives)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're dumping ids here to avoid serializing the whole job instance, we should do the same for multiple jobs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay, I have a question though. Is this the reason I was facing a pickling error?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worked, thank you!

flow/project.py Outdated
Comment on lines 2446 to 2447
job_in_project = False
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this constitutes a "fatal" (read unrecoverable) condition, then we should raise an exception right here instead of just breaking and ignoring the rest of the jobs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll raise an error here.

flow/project.py Outdated
@@ -299,7 +303,7 @@ def keyfunction(job):

def grouper(jobs):
for key, group in groupby(sorted(jobs, key=keyfunction), key=keyfunction):
yield group
yield list(group)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be much better to not rely on the fact that this is a list. Users might yield iterables as well. You can always create a list where you call the generator function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's valid, I'll make this compatible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also worth noting that you should only create a list, when you really need a list. There could be literally millions of jobs in that list which would take up a significant amount of memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that there's also a need of optimization in my code because there are places where I created a list where it's not even necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! Just be conscious of that, no need to optimize the code right now.

flow/project.py Outdated
self.grouper = grouper

@classmethod
def groupsof(cls, num=1, fillvalue=None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should the fillvalue attribute be treated by the users? What all values should it take? Should we check pre/post conditions for that particular value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best to skip these.
For instance, if I want a group of 3 and I get [joba, None, None] then it's was probably supposed to happen. Hence we shouldn't interfere much when a use asks for "user specified" arguments

flow/project.py Outdated
Comment on lines 41 to 44
import six
from six.moves import zip_longest
if six.PY2:
from collections import Iterable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer support Python 2, so six should be unnecessary.

flow/project.py Outdated
except KeyError:
raise KeyError("The key '{}' was not found in statepoint "
"parameters of the job {}.".format(sort, job))
self.grouper = grouper
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grouper should be a private member.

flow/project.py Outdated
@@ -3401,7 +3887,8 @@ def _register_groups(self):
operation_directives = getattr(func, '_flow_group_operation_directives', dict())
for group_name in func._flow_groups:
self._groups[group_name].add_operation(
op_name, op, operation_directives.get(group_name, None))
op_name, op, operation_directives.get(group_name, None),
flow_aggregate[op_name], flow_select[op_name])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment in FlowGroup.add_operation. I don't see a need for flow_aggregate or flow_select being parameters.

flow/project.py Outdated
Comment on lines 3865 to 3873
flow_aggregate = dict()
flow_select = dict()
for op_name, op in self._operations.items():
try:
flow_aggregate[op_name] = self._operation_functions[op_name]._flow_aggregate
flow_select[op_name] = self._operation_functions[op_name]._flow_select
except KeyError:
flow_aggregate[op_name] = op.cmd._flow_aggregate
flow_select[op_name] = op.cmd._flow_select
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code likely isn't necessary. This can be done in the loop below that uses (op_name, op) and the func variable.

@@ -3401,7 +3887,8 @@ def _register_groups(self):
operation_directives = getattr(func, '_flow_group_operation_directives', dict())
for group_name in func._flow_groups:
self._groups[group_name].add_operation(
op_name, op, operation_directives.get(group_name, None))
op_name, op, operation_directives.get(group_name, None),
flow_aggregate[op_name], flow_select[op_name])

# For singleton groups add directives
self._groups[op_name].operation_directives[op_name] = getattr(func,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to set the singleton groups aggregation and selection here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what do you mean by this. Can you please explain?

flow/project.py Outdated
Comment on lines 3907 to 3908
def _eligible_for_submission(self, flow_group, jobs):
"""Determine if a flow_group is eligible for submission with the given jobs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to consider that the ordering of jobs can change.

flow/project.py Outdated
Comment on lines 3983 to 3986
print("Eligible aggregates: ", end=" ")
for job in op.jobs:
print(job, end=" ")
print()
Copy link
Member

@b-butler b-butler Jun 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print("Eligible aggregates: ", end=" ")
for job in op.jobs:
print(job, end=" ")
print()
print("Eligible aggregates:", *op.jobs)

flow/project.py Outdated
def operation_function(job):
cmd = operation(job).format(job=job)
def operation_function(*jobs):
cmd = operation(jobs).format(jobs=jobs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need for the format here, it exists in FlowCmdOperation.

flow/project.py Outdated
Comment on lines 4075 to 4080
try:
filter = operation_function._flow_select
grouper, sort = operation_function._flow_aggregate
except AttributeError:
filter = operation.cmd._flow_select
grouper, sort = operation.cmd._flow_aggregate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be checking indirectly the operation class of the given operation. We should just check that directly then. Also, I think we want to allow for the group to determine the aggregation and selection behavior. This does not allow for that since we cannot override that in exec.

flow/project.py Outdated
Comment on lines 4082 to 4093
jobs = list(jobs)
jobs_list = filter(jobs)
if sort is not None:
jobs_list = sort(jobs_list)
jobs_list = grouper([job for job in jobs_list])

for job_list in jobs_list:
job_list = list(job_list)
for i, job in enumerate(job_list):
if job is None:
del job_list[i:]
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is repeated frequently. We should combine this logic into a function or class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll merge this concept with the aggregate class.

Copy link
Member

@b-butler b-butler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left more comments on changes I think would improve the code. Some of the errors, I believe I found, would be caught by some simple tests you could write. Writing tests will help you to figure out if your assumptions about the code are true (which is really helpful for developing code).

Some high level things,

  • I think we should have an Aggregate class that handles the logic for taking a list of jobs and spitting back out individual aggregates. This prevents the logic from being in multiple places, and having tuples where each element represents only part of the aggregation logic.
  • We need to find some consistence in the naming jobs v. job_list v. jobs_list. The ambiguity here makes the code hard to parse at times.
  • We need to decide on how to deal with currently user facing functions that accept lists of jobs versus those that pass in individual positional arguments (i.e. jobs v. *jobs). We are not consistent with this currently, and I think this will lead to a lot of user confusion.
  • I know there is a separate branch for submitting, but this is definitively still broken in this branch (this is more of a note).

@kidrahahjo kidrahahjo mentioned this pull request Jul 16, 2020
12 tasks
@kidrahahjo kidrahahjo mentioned this pull request Aug 4, 2020
12 tasks
@kidrahahjo
Copy link
Member Author

This pull request has served its purpose of guiding me through my project as #336 is now ready for a review.
@b-butler @csadorf @atravitz @bdice I'm now closing this pull request.
After #336 gets merged into feature/enable-aggregation our final step will be to merge feature/enable-aggregation into master after resolving merge conflicts.

@kidrahahjo kidrahahjo closed this Aug 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aggregation enhancement New feature or request GSoC Google Summer of Code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants