Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed desynchronization of MPI processes due to waveform relaxation (issue #600) #607

Merged
merged 11 commits into from Jan 10, 2017

Conversation

@janhahne
Copy link

janhahne commented Dec 27, 2016

This PR fixes issue #600 by

  • adding a function wfr_synchrony that communicates the local values of any_node_uses_wfr and returns a global one
  • removing static from the GapJunctionEvent variable coeff_length_ and communicating it instead (if any MPI process does not contain a neuron that uses wfr the static variable is not set correctly when it is used to read GapJunctionEvents from the receive buffer in deliver_events(). This resulted in a segmentation fault.)

Furthermore this PR

  • removes the possibility for the user to set node_uses_wfr (as this is not required and leads to undefined/wrong results)
  • adds a unittests issue-600.sli

I suggest @heplesser and @jakobj as reviewers.

janhahne added 2 commits Dec 27, 2016
janhahne
@janhahne janhahne force-pushed the janhahne:wfr_mpi_fix branch from 670baef to fdc4d55 Dec 27, 2016
Copy link
Contributor

heplesser left a comment

@janhahne Nice work. I have one comment on a function name and a possible MPI optimisation.

@@ -713,6 +713,30 @@ nest::MPIManager::grng_synchrony( unsigned long process_rnd_number )
return true;
}

// wfr_synchrony: called at the beginning of each simulate
bool
nest::MPIManager::wfr_synchrony( bool any_node_uses_wfr )

This comment has been minimized.

Copy link
@heplesser

heplesser Jan 2, 2017

Contributor

I think we should be more abstract at the level of MPIManage and choose a method name that expresses the logical operation, not the semantic purpose in one special case. If I understand the code and context right, this method takes a single bool, exchanges with all other processes, and returns true if one or more processes provide true. In NumPy-parlance, this is any().

I am wondering about two things:

  1. What would be a good name? I fear any() is a bit too short.
  2. Is there an MPI function that provides this type of aggregation built-in?

I realize that I wrote the similarly problematic grng_synchrony() myself ...

This comment has been minimized.

Copy link
@janhahne

janhahne Jan 4, 2017

Author

@heplesser You understand it right and I agree. I made an attempt to improve the naming

  1. any is indeed very short. I chose any_true instead. This indicates that this is an operation with boolean parameters.
  2. I took a look at the MPI standard but did not find anything like this.
janhahne
@janhahne
Copy link
Author

janhahne commented Jan 4, 2017

@heplesser Thank you for the review! I addressed your comments, please have another look.

Copy link
Contributor

heplesser left a comment

@janhahne That looks much better now and inspired me to a few more suggestions.

@@ -205,11 +205,16 @@ class NodeManager : public ManagerInterface
void finalize_nodes();

/**
*
* Returns if any node uses waveform relaxation

This comment has been minimized.

Copy link
@heplesser

heplesser Jan 4, 2017

Contributor

"if" -> "whether"

// across all MPI processes
kernel().node_manager.set_any_node_uses_wfr( kernel().mpi_manager.any_true(
kernel().node_manager.any_node_uses_wfr() ) );
}

This comment has been minimized.

Copy link
@heplesser

heplesser Jan 4, 2017

Contributor

This code is less legible than I had hoped. Especially, it is not really clear that we exchange local information to get a global picture. I wonder if it would not be better to replace the entire if block with a single call

kernel().node_manager.check_wfr_use();

and move the actual check to the NodeManager class, replacing set_any_node_uses_wfr() with check_wfr_use(), which would just be

{
  wfr_is_used_ = kernel().mpi_manager.any_true( wfr_is_used_ );
}

I would also change the any_node_uses_wfr variable and method names to wfr_is_used, which I think is a little clearer, and would move the check for multiple processes into the any_true() method.

janhahne
@janhahne
Copy link
Author

janhahne commented Jan 5, 2017

@heplesser Applied your suggestions. It indeed improved readability. Again: Please have another look

Copy link
Contributor

jakobj left a comment

nice work @janhahne. sorry, it took so long, I've now added my comments.

@@ -1015,6 +1017,7 @@ class GapJunctionEvent : public SecondaryEvent
{
size_t s = number_of_uints_covered< synindex >();
s += number_of_uints_covered< index >();
s += number_of_uints_covered< size_t >();

This comment has been minimized.

Copy link
@jakobj

jakobj Jan 8, 2017

Contributor

doesn't this increases the size of all secondary events by 8byte? did you perform any benchmarks? or is this negligible compared to the coefficient array? why do you need to communicate it for every event?

if ( get_num_processes() == 1 )
return my_bool;

int my_int = 0;

This comment has been minimized.

Copy link
@jakobj

jakobj Jan 8, 2017

Contributor

this could be written as

int my_int = my_bool;

using implicit conversion bool -> int

// any_true: takes a single bool, exchanges with all other processes,
// and returns "true" if one or more processes provide "true"
bool
nest::MPIManager::any_true( bool my_bool )

This comment has been minimized.

Copy link
@jakobj

jakobj Jan 8, 2017

Contributor

argument could be const

@@ -843,6 +843,12 @@ NodeManager::finalize_nodes()
}

void
NodeManager::check_wfr_use( void )

This comment has been minimized.

Copy link
@jakobj

jakobj Jan 8, 2017

Contributor

please remove the void

@janhahne
Copy link
Author

janhahne commented Jan 8, 2017

@jakobj Thank you for the suggestions. I addressed all of them and also improved the handling of the coeff_lenght_ parameter. Now it isn't communicated anymore. This is possible since the size of the coeff_array can also be computed in the check_wfr_use function. Would you please have another look?

@jakobj
Copy link
Contributor

jakobj commented Jan 9, 2017

very nice! i'm happy. 👍

@jakobj
jakobj approved these changes Jan 9, 2017
Copy link
Contributor

heplesser left a comment

@janhahne Nice that you now avoid communication, but I think we should protect the static variable a little more, see comment below.

int my_int = 0;
if ( my_bool )
my_int = 1;
int my_int = my_bool;

This comment has been minimized.

Copy link
@heplesser

heplesser Jan 9, 2017

Contributor

I think it would be useful to add a comment here, explaining why you need to convert to an int here. Two years from now a reader may not remember that MPI does not support bool. The declaration can also be const.

Then on line 734 below (I cannot add comments there in Github), I think it would be even safer to check all_int[i] != 0) instead of checking for == 1.

This comment has been minimized.

Copy link
@janhahne

janhahne Jan 9, 2017

Author

I will add a comment on this and change the check to != 0. This indeed seems safer, although the C++ standard actually does guarantee that a boolean true is converted to 1

static void
set_coeff_length( const size_t coeff_length )
{
coeff_length_ = coeff_length;

This comment has been minimized.

Copy link
@heplesser

heplesser Jan 9, 2017

Contributor

From the context it seems certain that this will never be called from a thread-parallel context, but if that should happen, we would be in trouble. Can we add an assertion here that we are running single-threaded, or explicitly mark this as critical section?

This comment has been minimized.

Copy link
@janhahne

janhahne Jan 9, 2017

Author

I gave it another thought:

Something like

static void
set_coeff_length( const size_t coeff_length )
{
  #pragma omp critical
  coeff_length_ = coeff_length;
}

would not solve this problem: There is no parallel region within this function. Also checking the number of threads in this function is no option (this would always be 1).

One just needs to make sure that this function is never called from several threads at the same time. As the check_wfr_use function is in a omp single region this is true here. So from my point of view there is nothing we could do here, do you agree?

This comment has been minimized.

Copy link
@heplesser

heplesser Jan 9, 2017

Contributor

@janhahne I think

#ifdef _OPENMP
  assert( omp_get_num_threads(void) == 1 );
#endif

should make sure this function breaks if it is ever called from a thread-parallel context (see
https://computing.llnl.gov/tutorials/openMP/#OMP_GET_NUM_THREADS).

This comment has been minimized.

Copy link
@janhahne

janhahne Jan 9, 2017

Author

You are right! I tested it by calling the function from a thread-parallel region, it works. I also added this assert to the other two functions from GapJunctionEvent that modify static variables

@janhahne janhahne force-pushed the janhahne:wfr_mpi_fix branch from f262d3d to 8e2a1ab Jan 9, 2017
@janhahne
Copy link
Author

janhahne commented Jan 9, 2017

@heplesser Applied your suggestions. Please have a (hopefully) final look.

@heplesser
Copy link
Contributor

heplesser commented Jan 10, 2017

@janhahne Very nice! I just sent you a PR towards you repo with a little tidy-up. It will automatically appear here when you merge it into your repo and then I will approve and merge this PR.

Moved single-thread assertion to method
@janhahne
Copy link
Author

janhahne commented Jan 10, 2017

@heplesser Thank you!

@heplesser heplesser merged commit 28ed7b5 into nest:master Jan 10, 2017
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@janhahne janhahne deleted the janhahne:wfr_mpi_fix branch Jan 10, 2017
janhahne pushed a commit to janhahne/nest-simulator that referenced this pull request May 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.