Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] BUG: send length of data on MPI child completion #196

Merged
merged 10 commits into from Oct 21, 2020

Conversation

blakecaldwell
Copy link
Member

Instead of implicitly assuming that all data has been recieved after
the child process terminates, verify that it matches the expected
length. This changes the signals between processes to 1) end_of_sim
and 2) end_of_data:[#bytes]. Upon completion verify that the length
of the base64 byte string matches this number.

Turns out that padding is necessary. Added back code to only add the
minimal amount of padding (e.g. '=').

Instead of implicitly assuming that all data has been recieved after
the child process terminates, verify that it matches the expected
length. This changes the signals between processes to 1) end_of_sim
and 2) end_of_data:[#bytes]. Upon completion verify that the length
of the base64 byte string matches this number.

Turns out that padding is necessary. Added back code to only add the
minimal amount of padding (e.g. '=').
@blakecaldwell blakecaldwell changed the title BUG: send length of data on MPI child completion [WIP] BUG: send length of data on MPI child completion Oct 19, 2020
@codecov-io
Copy link

codecov-io commented Oct 19, 2020

Codecov Report

Merging #196 into master will increase coverage by 0.15%.
The diff coverage is 39.86%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #196      +/-   ##
==========================================
+ Coverage   67.90%   68.06%   +0.15%     
==========================================
  Files          19       20       +1     
  Lines        2047     2123      +76     
==========================================
+ Hits         1390     1445      +55     
- Misses        657      678      +21     
Impacted Files Coverage Δ
hnn_core/mpi_child.py 0.00% <0.00%> (ø)
hnn_core/parallel_backends.py 19.33% <26.19%> (+4.27%) ⬆️
hnn_core/tests/test_mpi_child.py 100.00% <100.00%> (ø)
hnn_core/tests/test_parallel_backends.py 96.05% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2f12489...c40ebcd. Read the comment docs.

@jasmainak
Copy link
Collaborator

@blakecaldwell let me know when you are ready to merge

Blake Caldwell added 5 commits October 19, 2020 13:05
Refactor mpi_child.py into a proper Python class.

Also reuse _clone_and_simulate() between backends.
- add skip_MPI_import to MPISimulation
- refactor MPIBackend to separate _process_child_data for testing
Still raise a custom exception for troubleshooting if it does arise
@blakecaldwell blakecaldwell changed the title [WIP] BUG: send length of data on MPI child completion [MRG] BUG: send length of data on MPI child completion Oct 19, 2020
Comment on lines 20 to 44
def _clone_and_simulate(net, trial_idx, prng_seedcore_initial):
"""Run a simulation including building the network

This is used by both backends. MPIBackend calls this in mpi_child.py, once
for each trial (blocking), and JoblibBackend calls this for each trial
(non-blocking)
"""

# avoid relative lookups after being forked (Joblib)
from hnn_core.network_builder import NetworkBuilder
from hnn_core.network_builder import _simulate_single_trial

# XXX this should be built into NetworkBuilder
# update prng_seedcore params to provide jitter between trials
for param_key in net.params['prng_*'].keys():
net.params[param_key] += trial_idx

neuron_net = NetworkBuilder(net)
dpl = _simulate_single_trial(neuron_net, trial_idx)

spikedata = neuron_net.get_data_from_neuron()

return dpl, spikedata


Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasmainak @rythorpe Note that this code is shared between MPIBackend and JoblibBackend. Should allow for easier modification of how simulations get seeded in the future

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @blakecaldwell, thanks!

Where is the prng_seedcore_initial argument used in this function? I'm a little surprised that all the tests pass without _clone_and_simulate() referencing the initial seedcore params when setting new seedcore params for each trial.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Thanks for the suggestion below.

@blakecaldwell
Copy link
Member Author

@jasmainak I'm satisfied with this version. Open to comments (note new commits)

hnn_core/mpi_child.py Outdated Show resolved Hide resolved
@@ -150,15 +161,17 @@ class MPIBackend(object):

"""
def __init__(self, n_procs=None, mpi_cmd='mpiexec'):
self.n_procs = n_procs
n_logical_cores = multiprocessing.cpu_count()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to use psutil module? I've run into issues on computing clusters where multiprocessing.cpu_count() returns all cores on a node, rather than cores available to the user.

This could be something left to the user to define explicitly, but in any case the replacement would be:

 n_logical_cores = len(psutil.Process().cpu_affinity())

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keen observation. Yes, I've run into the same, which is why I use os.sched_getaffinity(0) a few lines below. I think both ways achieve the same end result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing @ntolley!

@blakecaldwell
Copy link
Member Author

@jasmainak updated with documentation! resolved comments are addressed in the new commits.

@@ -32,7 +32,7 @@ This backend will use MPI (Message Passing Interface) on the system to split neu

**MacOS Dependencies**::

$ conda install yes openmpi mpi4py
$ conda install -y openmpi mpi4py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the contributing guide still recommends the pip install. Perhaps we should update it and point here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't really tell why mpi4py is included in "Building the Documentation". It seems like it would fail on macs. Also, it may not even be necessary for just building the docs. I think the plot_simulate_evoked.py example will fall back to the Joblib backend.

I did add a link to this page at the top of the contribution guide.

@blakecaldwell blakecaldwell force-pushed the mpi_data_passing branch 2 times, most recently from 5804e96 to c40ebcd Compare October 20, 2020 23:30

_BACKEND = None


def _clone_and_simulate(net, trial_idx, prng_seedcore_initial):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fantastic!

@jasmainak
Copy link
Collaborator

@ntolley @rythorpe please feel free to merge if you're happy

@rythorpe rythorpe merged commit 737739e into jonescompneurolab:master Oct 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants