[MRG] BUG: send length of data on MPI child completion #196

blakecaldwell · 2020-10-19T14:12:26Z

Instead of implicitly assuming that all data has been recieved after
the child process terminates, verify that it matches the expected
length. This changes the signals between processes to 1) end_of_sim
and 2) end_of_data:[#bytes]. Upon completion verify that the length
of the base64 byte string matches this number.

Turns out that padding is necessary. Added back code to only add the
minimal amount of padding (e.g. '=').

Instead of implicitly assuming that all data has been recieved after the child process terminates, verify that it matches the expected length. This changes the signals between processes to 1) end_of_sim and 2) end_of_data:[#bytes]. Upon completion verify that the length of the base64 byte string matches this number. Turns out that padding is necessary. Added back code to only add the minimal amount of padding (e.g. '=').

codecov-io · 2020-10-19T14:20:43Z

Codecov Report

Merging #196 into master will increase coverage by 0.15%.
The diff coverage is 39.86%.

@@            Coverage Diff             @@
##           master     #196      +/-   ##
==========================================
+ Coverage   67.90%   68.06%   +0.15%     
==========================================
  Files          19       20       +1     
  Lines        2047     2123      +76     
==========================================
+ Hits         1390     1445      +55     
- Misses        657      678      +21

Impacted Files	Coverage Δ
hnn_core/mpi_child.py	`0.00% <0.00%> (ø)`
hnn_core/parallel_backends.py	`19.33% <26.19%> (+4.27%)`	⬆️
hnn_core/tests/test_mpi_child.py	`100.00% <100.00%> (ø)`
hnn_core/tests/test_parallel_backends.py	`96.05% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2f12489...c40ebcd. Read the comment docs.

jasmainak · 2020-10-19T14:56:15Z

@blakecaldwell let me know when you are ready to merge

hnn_core/parallel_backends.py

Refactor mpi_child.py into a proper Python class. Also reuse _clone_and_simulate() between backends.

- add skip_MPI_import to MPISimulation - refactor MPIBackend to separate _process_child_data for testing

Still raise a custom exception for troubleshooting if it does arise

blakecaldwell · 2020-10-19T20:17:32Z

hnn_core/parallel_backends.py

+def _clone_and_simulate(net, trial_idx, prng_seedcore_initial):
+    """Run a simulation including building the network
+
+    This is used by both backends. MPIBackend calls this in mpi_child.py, once
+    for each trial (blocking), and JoblibBackend calls this for each trial
+    (non-blocking)
+    """
+
+    # avoid relative lookups after being forked (Joblib)
+    from hnn_core.network_builder import NetworkBuilder
+    from hnn_core.network_builder import _simulate_single_trial
+
+    # XXX this should be built into NetworkBuilder
+    # update prng_seedcore params to provide jitter between trials
+    for param_key in net.params['prng_*'].keys():
+        net.params[param_key] += trial_idx
+
+    neuron_net = NetworkBuilder(net)
+    dpl = _simulate_single_trial(neuron_net, trial_idx)
+
+    spikedata = neuron_net.get_data_from_neuron()
+
+    return dpl, spikedata
+
+


@jasmainak @rythorpe Note that this code is shared between MPIBackend and JoblibBackend. Should allow for easier modification of how simulations get seeded in the future

This is great @blakecaldwell, thanks!

Where is the prng_seedcore_initial argument used in this function? I'm a little surprised that all the tests pass without _clone_and_simulate() referencing the initial seedcore params when setting new seedcore params for each trial.

Oops! Thanks for the suggestion below.

blakecaldwell · 2020-10-19T20:19:57Z

@jasmainak I'm satisfied with this version. Open to comments (note new commits)

hnn_core/mpi_child.py

ntolley · 2020-10-19T22:25:05Z

hnn_core/parallel_backends.py

@@ -150,15 +161,17 @@ class MPIBackend(object):

    """
    def __init__(self, n_procs=None, mpi_cmd='mpiexec'):
-        self.n_procs = n_procs
        n_logical_cores = multiprocessing.cpu_count()


Might be better to use psutil module? I've run into issues on computing clusters where multiprocessing.cpu_count() returns all cores on a node, rather than cores available to the user.

This could be something left to the user to define explicitly, but in any case the replacement would be:

n_logical_cores = len(psutil.Process().cpu_affinity())

Keen observation. Yes, I've run into the same, which is why I use os.sched_getaffinity(0) a few lines below. I think both ways achieve the same end result.

Thanks for reviewing @ntolley!

hnn_core/tests/test_compare_hnn.py

hnn_core/tests/test_mpi_child.py

hnn_core/parallel_backends.py

blakecaldwell · 2020-10-20T11:36:48Z

@jasmainak updated with documentation! resolved comments are addressed in the new commits.

hnn_core/parallel_backends.py

jasmainak · 2020-10-20T19:34:41Z

doc/parallel.rst

@@ -32,7 +32,7 @@ This backend will use MPI (Message Passing Interface) on the system to split neu

 **MacOS Dependencies**::

-    $ conda install yes openmpi mpi4py
+    $ conda install -y openmpi mpi4py


the contributing guide still recommends the pip install. Perhaps we should update it and point here?

I can't really tell why mpi4py is included in "Building the Documentation". It seems like it would fail on macs. Also, it may not even be necessary for just building the docs. I think the plot_simulate_evoked.py example will fall back to the Joblib backend.

I did add a link to this page at the top of the contribution guide.

Use the parameter prng_seedcore_initial

jasmainak · 2020-10-21T02:16:03Z

hnn_core/parallel_backends.py


 _BACKEND = None


+def _clone_and_simulate(net, trial_idx, prng_seedcore_initial):


jasmainak · 2020-10-21T02:17:46Z

@ntolley @rythorpe please feel free to merge if you're happy

blakecaldwell changed the title ~~BUG: send length of data on MPI child completion~~ [WIP] BUG: send length of data on MPI child completion Oct 19, 2020

jasmainak reviewed Oct 19, 2020

View reviewed changes

hnn_core/parallel_backends.py Outdated Show resolved Hide resolved

jasmainak reviewed Oct 19, 2020

View reviewed changes

hnn_core/parallel_backends.py Outdated Show resolved Hide resolved

Blake Caldwell added 5 commits October 19, 2020 13:05

MAINT: MPISimulation class

455a01a

Refactor mpi_child.py into a proper Python class. Also reuse _clone_and_simulate() between backends.

TST: run tests on mpi_child.py

04c91e2

- add skip_MPI_import to MPISimulation - refactor MPIBackend to separate _process_child_data for testing

MAINT: no need to test for data padding

83a4aa0

Still raise a custom exception for troubleshooting if it does arise

TST: let MPIBackend find number of processors

bf99363

MAINT: rename skip_MPI_import to skip_mpi_import

d1372c2

blakecaldwell force-pushed the mpi_data_passing branch from 1033662 to d1372c2 Compare October 19, 2020 20:03

blakecaldwell changed the title ~~[WIP] BUG: send length of data on MPI child completion~~ [MRG] BUG: send length of data on MPI child completion Oct 19, 2020

blakecaldwell commented Oct 19, 2020

View reviewed changes

blakecaldwell mentioned this pull request Oct 19, 2020

[MRG] Add functionality to record somatic voltage from all cells in simulation #190

Merged

jasmainak reviewed Oct 19, 2020

View reviewed changes

hnn_core/mpi_child.py Show resolved Hide resolved

jasmainak requested review from rythorpe and ntolley October 19, 2020 21:38

ntolley approved these changes Oct 19, 2020

View reviewed changes

jasmainak reviewed Oct 20, 2020

View reviewed changes

hnn_core/tests/test_compare_hnn.py Show resolved Hide resolved

jasmainak reviewed Oct 20, 2020

View reviewed changes

hnn_core/tests/test_mpi_child.py Outdated Show resolved Hide resolved

jasmainak reviewed Oct 20, 2020

View reviewed changes

hnn_core/parallel_backends.py Outdated Show resolved Hide resolved

Blake Caldwell added 2 commits October 20, 2020 06:34

MAINT: naming consistency changes

bca0e54

DOC: update parallel docs for contributors

7a92671

rythorpe reviewed Oct 20, 2020

View reviewed changes

hnn_core/parallel_backends.py Outdated Show resolved Hide resolved

jasmainak reviewed Oct 20, 2020

View reviewed changes

Blake Caldwell added 2 commits October 20, 2020 18:56

DOC: add link to parallel_backends to contributing

aed705b

BUG: use initial seed param in _clone_and_simulate

c40ebcd

Use the parameter prng_seedcore_initial

blakecaldwell force-pushed the mpi_data_passing branch 2 times, most recently from 5804e96 to c40ebcd Compare October 20, 2020 23:30

jasmainak reviewed Oct 21, 2020

View reviewed changes

hnn_core/parallel_backends.py

_BACKEND = None

def _clone_and_simulate(net, trial_idx, prng_seedcore_initial):

Copy link

Collaborator

jasmainak Oct 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fantastic!

rythorpe approved these changes Oct 21, 2020

View reviewed changes

rythorpe merged commit 737739e into jonescompneurolab:master Oct 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] BUG: send length of data on MPI child completion #196

[MRG] BUG: send length of data on MPI child completion #196

blakecaldwell commented Oct 19, 2020

codecov-io commented Oct 19, 2020 •

edited

jasmainak commented Oct 19, 2020

blakecaldwell Oct 19, 2020

rythorpe Oct 20, 2020

blakecaldwell Oct 20, 2020

blakecaldwell commented Oct 19, 2020

ntolley Oct 19, 2020

blakecaldwell Oct 19, 2020

blakecaldwell Oct 20, 2020

blakecaldwell commented Oct 20, 2020

jasmainak Oct 20, 2020

blakecaldwell Oct 20, 2020

jasmainak Oct 21, 2020

jasmainak commented Oct 21, 2020


		_BACKEND = None


		def _clone_and_simulate(net, trial_idx, prng_seedcore_initial):

[MRG] BUG: send length of data on MPI child completion #196

[MRG] BUG: send length of data on MPI child completion #196

Conversation

blakecaldwell commented Oct 19, 2020

codecov-io commented Oct 19, 2020 • edited

Codecov Report

jasmainak commented Oct 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blakecaldwell commented Oct 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blakecaldwell commented Oct 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasmainak commented Oct 21, 2020

codecov-io commented Oct 19, 2020 •

edited