Fixed desynchronization of MPI processes due to waveform relaxation (issue #600) #607

Merged
merged 11 commits into from Jan 10, 2017

Conversation

@janhahne
Contributor

janhahne commented Dec 27, 2016

This PR fixes issue #600 by

  • adding a function wfr_synchrony that communicates the local values of any_node_uses_wfr and returns a global one
  • removing static from the GapJunctionEvent variable coeff_length_ and communicating it instead (if any MPI process does not contain a neuron that uses wfr the static variable is not set correctly when it is used to read GapJunctionEvents from the receive buffer in deliver_events(). This resulted in a segmentation fault.)

Furthermore this PR

  • removes the possibility for the user to set node_uses_wfr (as this is not required and leads to undefined/wrong results)
  • adds a unittests issue-600.sli

I suggest @heplesser and @jakobj as reviewers.

@heplesser

@janhahne Nice work. I have one comment on a function name and a possible MPI optimisation.

nestkernel/mpi_manager.cpp
@@ -713,6 +713,30 @@ nest::MPIManager::grng_synchrony( unsigned long process_rnd_number )
return true;
}
+// wfr_synchrony: called at the beginning of each simulate
+bool
+nest::MPIManager::wfr_synchrony( bool any_node_uses_wfr )

This comment has been minimized.

@heplesser

heplesser Jan 2, 2017

Contributor

I think we should be more abstract at the level of MPIManage and choose a method name that expresses the logical operation, not the semantic purpose in one special case. If I understand the code and context right, this method takes a single bool, exchanges with all other processes, and returns true if one or more processes provide true. In NumPy-parlance, this is any().

I am wondering about two things:

  1. What would be a good name? I fear any() is a bit too short.
  2. Is there an MPI function that provides this type of aggregation built-in?

I realize that I wrote the similarly problematic grng_synchrony() myself ...

@heplesser

heplesser Jan 2, 2017

Contributor

I think we should be more abstract at the level of MPIManage and choose a method name that expresses the logical operation, not the semantic purpose in one special case. If I understand the code and context right, this method takes a single bool, exchanges with all other processes, and returns true if one or more processes provide true. In NumPy-parlance, this is any().

I am wondering about two things:

  1. What would be a good name? I fear any() is a bit too short.
  2. Is there an MPI function that provides this type of aggregation built-in?

I realize that I wrote the similarly problematic grng_synchrony() myself ...

This comment has been minimized.

@janhahne

janhahne Jan 4, 2017

Contributor

@heplesser You understand it right and I agree. I made an attempt to improve the naming

  1. any is indeed very short. I chose any_true instead. This indicates that this is an operation with boolean parameters.
  2. I took a look at the MPI standard but did not find anything like this.
@janhahne

janhahne Jan 4, 2017

Contributor

@heplesser You understand it right and I agree. I made an attempt to improve the naming

  1. any is indeed very short. I chose any_true instead. This indicates that this is an operation with boolean parameters.
  2. I took a look at the MPI standard but did not find anything like this.
@janhahne

This comment has been minimized.

Show comment
Hide comment
@janhahne

janhahne Jan 4, 2017

Contributor

@heplesser Thank you for the review! I addressed your comments, please have another look.

Contributor

janhahne commented Jan 4, 2017

@heplesser Thank you for the review! I addressed your comments, please have another look.

@heplesser

@janhahne That looks much better now and inspired me to a few more suggestions.

nestkernel/node_manager.h
@@ -205,11 +205,16 @@ class NodeManager : public ManagerInterface
void finalize_nodes();
/**
- *
+ * Returns if any node uses waveform relaxation

This comment has been minimized.

@heplesser

heplesser Jan 4, 2017

Contributor

"if" -> "whether"

@heplesser

heplesser Jan 4, 2017

Contributor

"if" -> "whether"

nestkernel/simulation_manager.cpp
+ // across all MPI processes
+ kernel().node_manager.set_any_node_uses_wfr( kernel().mpi_manager.any_true(
+ kernel().node_manager.any_node_uses_wfr() ) );
+ }

This comment has been minimized.

@heplesser

heplesser Jan 4, 2017

Contributor

This code is less legible than I had hoped. Especially, it is not really clear that we exchange local information to get a global picture. I wonder if it would not be better to replace the entire if block with a single call

kernel().node_manager.check_wfr_use();

and move the actual check to the NodeManager class, replacing set_any_node_uses_wfr() with check_wfr_use(), which would just be

{
  wfr_is_used_ = kernel().mpi_manager.any_true( wfr_is_used_ );
}

I would also change the any_node_uses_wfr variable and method names to wfr_is_used, which I think is a little clearer, and would move the check for multiple processes into the any_true() method.

@heplesser

heplesser Jan 4, 2017

Contributor

This code is less legible than I had hoped. Especially, it is not really clear that we exchange local information to get a global picture. I wonder if it would not be better to replace the entire if block with a single call

kernel().node_manager.check_wfr_use();

and move the actual check to the NodeManager class, replacing set_any_node_uses_wfr() with check_wfr_use(), which would just be

{
  wfr_is_used_ = kernel().mpi_manager.any_true( wfr_is_used_ );
}

I would also change the any_node_uses_wfr variable and method names to wfr_is_used, which I think is a little clearer, and would move the check for multiple processes into the any_true() method.

@janhahne

This comment has been minimized.

Show comment
Hide comment
@janhahne

janhahne Jan 5, 2017

Contributor

@heplesser Applied your suggestions. It indeed improved readability. Again: Please have another look

Contributor

janhahne commented Jan 5, 2017

@heplesser Applied your suggestions. It indeed improved readability. Again: Please have another look

@jakobj

nice work @janhahne. sorry, it took so long, I've now added my comments.

nestkernel/event.h
@@ -1015,6 +1017,7 @@ class GapJunctionEvent : public SecondaryEvent
{
size_t s = number_of_uints_covered< synindex >();
s += number_of_uints_covered< index >();
+ s += number_of_uints_covered< size_t >();

This comment has been minimized.

@jakobj

jakobj Jan 8, 2017

Contributor

doesn't this increases the size of all secondary events by 8byte? did you perform any benchmarks? or is this negligible compared to the coefficient array? why do you need to communicate it for every event?

@jakobj

jakobj Jan 8, 2017

Contributor

doesn't this increases the size of all secondary events by 8byte? did you perform any benchmarks? or is this negligible compared to the coefficient array? why do you need to communicate it for every event?

nestkernel/mpi_manager.cpp
+ if ( get_num_processes() == 1 )
+ return my_bool;
+
+ int my_int = 0;

This comment has been minimized.

@jakobj

jakobj Jan 8, 2017

Contributor

this could be written as

int my_int = my_bool;

using implicit conversion bool -> int

@jakobj

jakobj Jan 8, 2017

Contributor

this could be written as

int my_int = my_bool;

using implicit conversion bool -> int

nestkernel/mpi_manager.cpp
+// any_true: takes a single bool, exchanges with all other processes,
+// and returns "true" if one or more processes provide "true"
+bool
+nest::MPIManager::any_true( bool my_bool )

This comment has been minimized.

@jakobj

jakobj Jan 8, 2017

Contributor

argument could be const

@jakobj

jakobj Jan 8, 2017

Contributor

argument could be const

nestkernel/node_manager.cpp
@@ -843,6 +843,12 @@ NodeManager::finalize_nodes()
}
+void
+NodeManager::check_wfr_use( void )

This comment has been minimized.

@jakobj

jakobj Jan 8, 2017

Contributor

please remove the void

@jakobj

jakobj Jan 8, 2017

Contributor

please remove the void

@janhahne

This comment has been minimized.

Show comment
Hide comment
@janhahne

janhahne Jan 8, 2017

Contributor

@jakobj Thank you for the suggestions. I addressed all of them and also improved the handling of the coeff_lenght_ parameter. Now it isn't communicated anymore. This is possible since the size of the coeff_array can also be computed in the check_wfr_use function. Would you please have another look?

Contributor

janhahne commented Jan 8, 2017

@jakobj Thank you for the suggestions. I addressed all of them and also improved the handling of the coeff_lenght_ parameter. Now it isn't communicated anymore. This is possible since the size of the coeff_array can also be computed in the check_wfr_use function. Would you please have another look?

@jakobj

This comment has been minimized.

Show comment
Hide comment
@jakobj

jakobj Jan 9, 2017

Contributor

very nice! i'm happy. 👍

Contributor

jakobj commented Jan 9, 2017

very nice! i'm happy. 👍

@jakobj

jakobj approved these changes Jan 9, 2017

@heplesser

@janhahne Nice that you now avoid communication, but I think we should protect the static variable a little more, see comment below.

nestkernel/mpi_manager.cpp
- int my_int = 0;
- if ( my_bool )
- my_int = 1;
+ int my_int = my_bool;

This comment has been minimized.

@heplesser

heplesser Jan 9, 2017

Contributor

I think it would be useful to add a comment here, explaining why you need to convert to an int here. Two years from now a reader may not remember that MPI does not support bool. The declaration can also be const.

Then on line 734 below (I cannot add comments there in Github), I think it would be even safer to check all_int[i] != 0) instead of checking for == 1.

@heplesser

heplesser Jan 9, 2017

Contributor

I think it would be useful to add a comment here, explaining why you need to convert to an int here. Two years from now a reader may not remember that MPI does not support bool. The declaration can also be const.

Then on line 734 below (I cannot add comments there in Github), I think it would be even safer to check all_int[i] != 0) instead of checking for == 1.

This comment has been minimized.

@janhahne

janhahne Jan 9, 2017

Contributor

I will add a comment on this and change the check to != 0. This indeed seems safer, although the C++ standard actually does guarantee that a boolean true is converted to 1

@janhahne

janhahne Jan 9, 2017

Contributor

I will add a comment on this and change the check to != 0. This indeed seems safer, although the C++ standard actually does guarantee that a boolean true is converted to 1

nestkernel/event.h
+ static void
+ set_coeff_length( const size_t coeff_length )
+ {
+ coeff_length_ = coeff_length;

This comment has been minimized.

@heplesser

heplesser Jan 9, 2017

Contributor

From the context it seems certain that this will never be called from a thread-parallel context, but if that should happen, we would be in trouble. Can we add an assertion here that we are running single-threaded, or explicitly mark this as critical section?

@heplesser

heplesser Jan 9, 2017

Contributor

From the context it seems certain that this will never be called from a thread-parallel context, but if that should happen, we would be in trouble. Can we add an assertion here that we are running single-threaded, or explicitly mark this as critical section?

This comment has been minimized.

@janhahne

janhahne Jan 9, 2017

Contributor

I gave it another thought:

Something like

static void
set_coeff_length( const size_t coeff_length )
{
  #pragma omp critical
  coeff_length_ = coeff_length;
}

would not solve this problem: There is no parallel region within this function. Also checking the number of threads in this function is no option (this would always be 1).

One just needs to make sure that this function is never called from several threads at the same time. As the check_wfr_use function is in a omp single region this is true here. So from my point of view there is nothing we could do here, do you agree?

@janhahne

janhahne Jan 9, 2017

Contributor

I gave it another thought:

Something like

static void
set_coeff_length( const size_t coeff_length )
{
  #pragma omp critical
  coeff_length_ = coeff_length;
}

would not solve this problem: There is no parallel region within this function. Also checking the number of threads in this function is no option (this would always be 1).

One just needs to make sure that this function is never called from several threads at the same time. As the check_wfr_use function is in a omp single region this is true here. So from my point of view there is nothing we could do here, do you agree?

This comment has been minimized.

@heplesser

heplesser Jan 9, 2017

Contributor

@janhahne I think

#ifdef _OPENMP
  assert( omp_get_num_threads(void) == 1 );
#endif

should make sure this function breaks if it is ever called from a thread-parallel context (see
https://computing.llnl.gov/tutorials/openMP/#OMP_GET_NUM_THREADS).

@heplesser

heplesser Jan 9, 2017

Contributor

@janhahne I think

#ifdef _OPENMP
  assert( omp_get_num_threads(void) == 1 );
#endif

should make sure this function breaks if it is ever called from a thread-parallel context (see
https://computing.llnl.gov/tutorials/openMP/#OMP_GET_NUM_THREADS).

This comment has been minimized.

@janhahne

janhahne Jan 9, 2017

Contributor

You are right! I tested it by calling the function from a thread-parallel region, it works. I also added this assert to the other two functions from GapJunctionEvent that modify static variables

@janhahne

janhahne Jan 9, 2017

Contributor

You are right! I tested it by calling the function from a thread-parallel region, it works. I also added this assert to the other two functions from GapJunctionEvent that modify static variables

@janhahne

This comment has been minimized.

Show comment
Hide comment
@janhahne

janhahne Jan 9, 2017

Contributor

@heplesser Applied your suggestions. Please have a (hopefully) final look.

Contributor

janhahne commented Jan 9, 2017

@heplesser Applied your suggestions. Please have a (hopefully) final look.

@heplesser

This comment has been minimized.

Show comment
Hide comment
@heplesser

heplesser Jan 10, 2017

Contributor

@janhahne Very nice! I just sent you a PR towards you repo with a little tidy-up. It will automatically appear here when you merge it into your repo and then I will approve and merge this PR.

Contributor

heplesser commented Jan 10, 2017

@janhahne Very nice! I just sent you a PR towards you repo with a little tidy-up. It will automatically appear here when you merge it into your repo and then I will approve and merge this PR.

Merge pull request #2 from heplesser/jan_hahne_wfr_mpi_fix
Moved single-thread assertion to method
@janhahne

This comment has been minimized.

Show comment
Hide comment
@janhahne

janhahne Jan 10, 2017

Contributor

@heplesser Thank you!

Contributor

janhahne commented Jan 10, 2017

@heplesser Thank you!

@heplesser heplesser merged commit 28ed7b5 into nest:master Jan 10, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@janhahne janhahne deleted the janhahne:wfr_mpi_fix branch Jan 10, 2017

janhahne added a commit to janhahne/nest-simulator that referenced this pull request May 10, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment