-
Notifications
You must be signed in to change notification settings - Fork 3
Allow multi-processing of results via a pool parameter #516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Click here to view all benchmarks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not fully convinced that we should have this code at Lynx. First, we make a lot of decisions here about the data flow control, e.g. in this implementation the whole obstable would be pickled and transferred, the whole result dataframe will be pickled and transferred, and all the results (doubled) would be in memory at a few points.
I see different pipeline control decisions which potentially could be made:
- Pre-generate parameters, pre-select obstable values.
- Do not transfer the result back, just dump them as a parquet file.
pool.mapis a common interface, but others exist..submit/.resultinterface would allow a progress bar!
I would also love to see batch-size-independent randomness, but I do understand that it could be tricky to achieve.
Allow the core simulation function to take an executor parameter and distribute the tasks via that executor. This allows the simulation to be parallelized by a variety of mechanisms including ProcessPoolExecutor, Dask, and Ray.
This PR also provides the option (via argument) to output the results to a file instead of returning the NestedFrame so that we can distribute computation without worrying about all of the results fitting in memory together.