New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] BUG: send length of data on MPI child completion #196
[MRG] BUG: send length of data on MPI child completion #196
Conversation
Instead of implicitly assuming that all data has been recieved after the child process terminates, verify that it matches the expected length. This changes the signals between processes to 1) end_of_sim and 2) end_of_data:[#bytes]. Upon completion verify that the length of the base64 byte string matches this number. Turns out that padding is necessary. Added back code to only add the minimal amount of padding (e.g. '=').
Codecov Report
@@ Coverage Diff @@
## master #196 +/- ##
==========================================
+ Coverage 67.90% 68.06% +0.15%
==========================================
Files 19 20 +1
Lines 2047 2123 +76
==========================================
+ Hits 1390 1445 +55
- Misses 657 678 +21
Continue to review full report at Codecov.
|
@blakecaldwell let me know when you are ready to merge |
Refactor mpi_child.py into a proper Python class. Also reuse _clone_and_simulate() between backends.
- add skip_MPI_import to MPISimulation - refactor MPIBackend to separate _process_child_data for testing
Still raise a custom exception for troubleshooting if it does arise
1033662
to
d1372c2
Compare
hnn_core/parallel_backends.py
Outdated
def _clone_and_simulate(net, trial_idx, prng_seedcore_initial): | ||
"""Run a simulation including building the network | ||
|
||
This is used by both backends. MPIBackend calls this in mpi_child.py, once | ||
for each trial (blocking), and JoblibBackend calls this for each trial | ||
(non-blocking) | ||
""" | ||
|
||
# avoid relative lookups after being forked (Joblib) | ||
from hnn_core.network_builder import NetworkBuilder | ||
from hnn_core.network_builder import _simulate_single_trial | ||
|
||
# XXX this should be built into NetworkBuilder | ||
# update prng_seedcore params to provide jitter between trials | ||
for param_key in net.params['prng_*'].keys(): | ||
net.params[param_key] += trial_idx | ||
|
||
neuron_net = NetworkBuilder(net) | ||
dpl = _simulate_single_trial(neuron_net, trial_idx) | ||
|
||
spikedata = neuron_net.get_data_from_neuron() | ||
|
||
return dpl, spikedata | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jasmainak @rythorpe Note that this code is shared between MPIBackend and JoblibBackend. Should allow for easier modification of how simulations get seeded in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great @blakecaldwell, thanks!
Where is the prng_seedcore_initial
argument used in this function? I'm a little surprised that all the tests pass without _clone_and_simulate()
referencing the initial seedcore params when setting new seedcore params for each trial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! Thanks for the suggestion below.
@jasmainak I'm satisfied with this version. Open to comments (note new commits) |
@@ -150,15 +161,17 @@ class MPIBackend(object): | |||
|
|||
""" | |||
def __init__(self, n_procs=None, mpi_cmd='mpiexec'): | |||
self.n_procs = n_procs | |||
n_logical_cores = multiprocessing.cpu_count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be better to use psutil
module? I've run into issues on computing clusters where multiprocessing.cpu_count()
returns all cores on a node, rather than cores available to the user.
This could be something left to the user to define explicitly, but in any case the replacement would be:
n_logical_cores = len(psutil.Process().cpu_affinity())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keen observation. Yes, I've run into the same, which is why I use os.sched_getaffinity(0)
a few lines below. I think both ways achieve the same end result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing @ntolley!
@jasmainak updated with documentation! resolved comments are addressed in the new commits. |
@@ -32,7 +32,7 @@ This backend will use MPI (Message Passing Interface) on the system to split neu | |||
|
|||
**MacOS Dependencies**:: | |||
|
|||
$ conda install yes openmpi mpi4py | |||
$ conda install -y openmpi mpi4py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the contributing guide still recommends the pip install. Perhaps we should update it and point here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't really tell why mpi4py
is included in "Building the Documentation". It seems like it would fail on macs. Also, it may not even be necessary for just building the docs. I think the plot_simulate_evoked.py
example will fall back to the Joblib backend.
I did add a link to this page at the top of the contribution guide.
Use the parameter prng_seedcore_initial
5804e96
to
c40ebcd
Compare
|
||
_BACKEND = None | ||
|
||
|
||
def _clone_and_simulate(net, trial_idx, prng_seedcore_initial): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fantastic!
Instead of implicitly assuming that all data has been recieved after
the child process terminates, verify that it matches the expected
length. This changes the signals between processes to 1) end_of_sim
and 2) end_of_data:[#bytes]. Upon completion verify that the length
of the base64 byte string matches this number.
Turns out that padding is necessary. Added back code to only add the
minimal amount of padding (e.g. '=').