Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

light-weight, param-driven implementation of parallelism using multiproc #99

Merged
merged 22 commits into from Jan 4, 2019
Merged

Conversation

jasondemorrow
Copy link
Contributor

For a version of population unit testing that compares execution time between evaluation with and without multiproc, see here.

@koaning koaning self-assigned this Jan 1, 2019
@koaning
Copy link
Contributor

koaning commented Jan 1, 2019

First of all - happy new year! 🎆

So I've had a brief look and I like the approach. Previously I've tried using multiprocessing from within a Population but this was a bit awkward because it involved making a curried function that could not be pickled. By moving this to the Individual you prevent this and you should keep it easy to perform certain operations (mainly evaluate and mutate, which are the main things to parallelise).

There's also some small things that I like:

  • the VSCode to the .gitignore
  • this approach can easily be moved to the ContestPopulation as well where I can imagine the speedup is really significant

There's a few concerns:

  • Do we need the pathos dependency? it currently needs to be added as a dependency (setup_requires needs to be added in setup.py but if we could live without the dependency that would be preferable.
  • How do we test the parallelism? I'd like to prevent somebody accidentally breaking the performance boost.

@rogiervandergeer thoughts?

@jasondemorrow
Copy link
Contributor Author

The pathos dependency is how I got around the issue of pickling functions. It's basically an exact port of the mp module that uses dill as a drop-in replacement for pickle. It can serialize functions that pickle can't.

@jasondemorrow
Copy link
Contributor Author

jasondemorrow commented Jan 1, 2019

As for testing, I took a stab at unit testing the performance here. This test is far from complete coverage and it adds significant time to the test run, so I ended up taking it out. But there's no reason you couldn't reduce the sleep to something much smaller, then compare the wall-clock time of this run to the test that doesn't use concurrency. I'll try adding this back in a sane way.

@jasondemorrow
Copy link
Contributor Author

Timing added to unit tests. Tried making the latency large enough to maintain consistency on any number of cores while keeping it small enough that the test time isn't dramatically impacted. The times aren't compared at all if running on a single core, even though the functionality is still tested for multi-proc.

@rogiervandergeer
Copy link
Collaborator

Thank you @jasondemorrow for this PR!

We've been meaning to implement something like this for a while, and now you beat us to it! 😊

I agree with @koaning though that we should prefer to do without introducing any new dependencies. I think the fact that we do not have any dependencies so far is one of the strong points of evol. If we were to use standard multiprocessing instead of pathos, the only downside would be that we can't accept any lambda functions. Given that multiprocessing is probably only useful for more complicated problems, and that for complicated problems one is much less likely to use lambda's anyway, would it be a big problem if we were to use standard multiprocessing and apply proper handling of the pickle errors?

@koaning
Copy link
Contributor

koaning commented Jan 2, 2019

@jasondemorrow could you add pathos to the setup.py as a dependency for now? travis is currently showing this error:

Processing /home/travis/build/godatadriven/evol
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-j9hxt9jv-build/setup.py", line 1, in <module>
        import evol
      File "/tmp/pip-j9hxt9jv-build/evol/__init__.py", line 117, in <module>
        from .individual import Individual
      File "/tmp/pip-j9hxt9jv-build/evol/individual.py", line 10, in <module>
        from pathos.multiprocessing import Pool
    ModuleNotFoundError: No module named 'pathos'

I haven't decided fully yet on pathos but in the meantime it would be nice to have travis confirm that the tests indeed are all green :)

@jasondemorrow
Copy link
Contributor Author

Will work on the setup.py tonight; my initial guess (online documentation on the latest format is sparse for some reason) on how to add this seems wrong. As for pathos, it's for more than just lambdas. Pickle can't even serialize instance methods, meaning that my implementation doesn't work at all with the regular multiprocessing/pickle library. The fact that evaluation_function is assigned to a member of Population is what pickle can't handle. Using the python lib would require de-coupling the eval function from any instance. I'm not even immediately sure how you do that, but it would probably break your OO design.

@jasondemorrow
Copy link
Contributor Author

At this point I'm a little stumped as to why the build fails. The package multiprocess is public on PyPi. I notice your build is using an old version of pip; is there any way to upgrade it?

@jasondemorrow
Copy link
Contributor Author

Build passes. I reduced the dependency footprint by using just the multiprocess module of the pathos library, which can be installed as a standalone library.

@koaning
Copy link
Contributor

koaning commented Jan 3, 2019

@rogiervandergeer right now multiprocess is used which should not be confused with multiprocessing. there's still an external dependency in there but it has a much smaller footprint.

i'll admit that i like the implementation, another implementation with multiprocessing might be more of a pain to maintain. it also seems that the pathos is active enough to expect support. I do think that evol needs to have easy parallelism support in there.

If you're still doubting @rogiervandergeer; there might also be some middle ground. I can imagine that advanced users who need the parallelism can install evol via pip install evol[all] which does include the dependency while pip install evol merely contains the base evol. Part of me has mixed feelings about this approach because it would involve splitting up the library, probably by having ParallelPopulation and a ParallelContestPopulation but this would allow us to keep the base package without dependencies.

Copy link
Collaborator

@rogiervandergeer rogiervandergeer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your efforts to make evol better!

evol/population.py Outdated Show resolved Hide resolved
evol/individual.py Outdated Show resolved Hide resolved
.travis.yml Outdated Show resolved Hide resolved
evol/individual.py Outdated Show resolved Hide resolved
evol/population.py Outdated Show resolved Hide resolved
tests/test_individual.py Show resolved Hide resolved
tests/test_population.py Outdated Show resolved Hide resolved
@rogiervandergeer
Copy link
Collaborator

@koaning I also considered making the dependency optional, for example by not listing it as a requirement but only allowing concurrent_workers when multiprocess is installed. You could even use standard multiprocessing when multiprocess isn't available.

In the end I think it is too much of a hassle. In any case I do not think that having multiprocessing functionality available by default is a bad idea. Also multiprocess seems to have a small enough footprint.

@jasondemorrow
Copy link
Contributor Author

The problem is that setup.py imports evol, which in turn transitively imports multiprocess. So running "pip install ." fails because it attempts to import deps before installing them. I guess you guys never ran into this before because you were using only native libraries. The good news is, you're only using the package to get version, which could be defined someplace else. If I were to move that, any suggestions on where it should go? Examples I've seen simply defined directly it in setup.py.

@rogiervandergeer
Copy link
Collaborator

rogiervandergeer commented Jan 3, 2019

That explains. I never liked that construction anyway.

I've fixed that problem in #100. Please rebase, either after we've merged that, or on that branch.

@koaning
Copy link
Contributor

koaning commented Jan 3, 2019

I've just merged #100. Good find that odd bug.

@jasondemorrow
Copy link
Contributor Author

Perfect, thanks!

@koaning
Copy link
Contributor

koaning commented Jan 3, 2019

i have just done a review and i have only one comment.

@rogiervandergeer do we also want to implement parallelism for the ContestPopulation or is that something for another PR?

if we don't mind implementing it in another PR then i am fine with mergeing.

@jasondemorrow
Copy link
Contributor Author

I went ahead and took a stab at adding concurrency to ContestPopulation. I'm not 100% confident in the change, since this code uses a lot of itertools stuff I'm not terribly familiar with.

@koaning
Copy link
Contributor

koaning commented Jan 4, 2019

LGTM. @rogiervandergeer if you don't have any comments i'll gladly merge.

Copy link
Collaborator

@rogiervandergeer rogiervandergeer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! I have a few more small comments but overall I think this is good to merge. What I do think is that we need more testing (e.g. parametrize all tests for Population with one and multiple cpus); but perhaps it is better to do that after we merge -- it doesn't make sense to delay the merge.

@@ -327,6 +340,36 @@ def _update_documented_best(self):
self.documented_best = copy(current_best)


class Contest(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In python3 we no longer need to inherit from object; i.e. class Contest: should suffice.

self.eval_function = eval_function
self.pool = process_pool

def post_evaluate(self, scores):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be private, in my opinion.

@koaning
Copy link
Contributor

koaning commented Jan 4, 2019

Just mentioning the issue here: #47.

@rogiervandergeer rogiervandergeer merged commit 4237fe9 into godatadriven:master Jan 4, 2019
@jasondemorrow
Copy link
Contributor Author

Thanks for the prompt and thorough reviews!

@rogiervandergeer
Copy link
Collaborator

rogiervandergeer commented Jan 5, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants