New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

music_event_out_proxy broken for local_num_threads > 1 #696

Closed
mhoff opened this Issue Mar 28, 2017 · 13 comments

Comments

Projects
None yet
4 participants
@mhoff
Contributor

mhoff commented Mar 28, 2017

Description of problem:
When using more than one thread the music_event_out_proxy fails to map the internal event port with the right size, for some specific values of local_num_threads.

Git commit: 58fd190
How reproducible: Always (assuming local_num_threads > 1)
Steps to Reproduce: mpirun -np 2 music run.music (for attached testcase)

Actual results:

Testing `local_num_threads = 1`
Testing `local_num_threads = 2`
Error in MUSIC library: sender and receiver width mismatch (0 != 20)
Testing `local_num_threads = 3`
Testing `local_num_threads = 4`
Error in MUSIC library: sender and receiver width mismatch (0 != 20)
Testing `local_num_threads = 5`
Error in MUSIC library: sender and receiver width mismatch (0 != 20)
Testing `local_num_threads = 6`
Error in MUSIC library: sender and receiver width mismatch (0 != 20)
Testing `local_num_threads = 7`
Testing `local_num_threads = 8`
Error in MUSIC library: sender and receiver width mismatch (0 != 20)
Testing `local_num_threads = 9`
Error in MUSIC library: sender and receiver width mismatch (0 != 20)
Testing `local_num_threads = 10`
Error in MUSIC library: sender and receiver width mismatch (0 != 20)
Testing `local_num_threads = 11`
Error in MUSIC library: sender and receiver width mismatch (0 != 20)
Testing `local_num_threads = 12`
Error in MUSIC library: sender and receiver width mismatch (0 != 20)
Testing `local_num_threads = 13`
Error in MUSIC library: sender and receiver width mismatch (0 != 20)
Testing `local_num_threads = 14`
Error in MUSIC library: sender and receiver width mismatch (0 != 20)
Testing `local_num_threads = 15`
Error in MUSIC library: sender and receiver width mismatch (0 != 20)
Testing `local_num_threads = 16`
Error in MUSIC library: sender and receiver width mismatch (0 != 20)

Expected results:

[...]
Received spike from neuron 3: 0.001
Received spike from neuron 4: 0.0005
Received spike from neuron 5: 0.0004
Received spike from neuron 6: 0.0003
Received spike from neuron 10: 0.0003
Received spike from neuron 10: 0.001
Received spike from neuron 11: 0.0003
Received spike from neuron 12: 0.0008
Received spike from neuron 15: 0.0005
[...]

Additional info:
MUSIC version: 1.1.15
MPI version: (Open MPI) 1.6.5

  1. This error did not arise with v2.10.0, therefore thread-level parallelism in interplay with MUSIC has been perfectly functional before.
  2. By connecting a population with bigger size than port width to the proxy, we came to understand that the order of elements in index_map_ apparently is subject to a race-condition. For local_num_threads == 1 the first offending entry was always width + 1 and in for local_num_threads > 1 it could be any index > width (but interestingly always the same value for a fixed local_num_threads).

@mhoff mhoff changed the title from music_event_out_proxy broken for local_num_threads > 1 to music_event_out_proxy broken for `local_num_threads > 1` Mar 28, 2017

@mhoff mhoff changed the title from music_event_out_proxy broken for `local_num_threads > 1` to music_event_out_proxy broken for local_num_threads > 1 Mar 28, 2017

@heplesser

This comment has been minimized.

Show comment
Hide comment
@heplesser

heplesser Apr 3, 2017

Contributor

@jougs Could you take a look at this?

Contributor

heplesser commented Apr 3, 2017

@jougs Could you take a look at this?

@mhoff

This comment has been minimized.

Show comment
Hide comment
@mhoff

mhoff Apr 3, 2017

Contributor

@jougs I had another look into the issue and recognized that some settings for local_num_threads do actually work. See the original issue for the updated description.
@kappeld FYI

Contributor

mhoff commented Apr 3, 2017

@jougs I had another look into the issue and recognized that some settings for local_num_threads do actually work. See the original issue for the updated description.
@kappeld FYI

@heplesser

This comment has been minimized.

Show comment
Hide comment
@heplesser

heplesser Apr 3, 2017

Contributor

@mdjurfeldt Maybe you also could take a look, although the problem quite likely is on the NEST side since things worked with 2.10.

Contributor

heplesser commented Apr 3, 2017

@mdjurfeldt Maybe you also could take a look, although the problem quite likely is on the NEST side since things worked with 2.10.

@mhoff

This comment has been minimized.

Show comment
Hide comment
@mhoff

mhoff Apr 3, 2017

Contributor

The testcase now contains a small helper script for testing the parameter range: out_proxy_bug.zip. It would be nice if someone could confirm this behaviour.

I also updated the original report, because I accidentally used another testcase version for the actual results. The attached testcase is now consistent with the above results.

Contributor

mhoff commented Apr 3, 2017

The testcase now contains a small helper script for testing the parameter range: out_proxy_bug.zip. It would be nice if someone could confirm this behaviour.

I also updated the original report, because I accidentally used another testcase version for the actual results. The attached testcase is now consistent with the above results.

@mhoff

This comment has been minimized.

Show comment
Hide comment
@mhoff

mhoff Apr 7, 2017

Contributor

@kappeld and I tracked down the issue to the ConnectionManager:

void
nest::ConnectionManager::connect( index sgid,
  Node* target,
  thread target_thread,
  index syn,
  double d,
  double w )
{
  const thread tid = kernel().vp_manager.get_thread_id();
  Node* source = kernel().node_manager.get_node( sgid, target_thread );
  // [...]
    if ( source->has_proxies() ) // normal neuron->device connection
    {
      connect_( *source, *target, sgid, target_thread, syn, d, w );
    }
    else // create device->device connections on suggested thread of target
    {
      // ##########################################################
      // ######### REMOVE THESE LINES TO FIX* THE ERROR ##########
      target_thread = kernel().vp_manager.vp_to_thread(
        kernel().vp_manager.suggest_vp( target->get_gid() ) );
      // ##########################################################
      if ( target_thread == tid )
      {
        source = kernel().node_manager.get_node( sgid, target_thread );
        target = kernel().node_manager.get_node( target->get_gid(), target_thread );
        connect_( *source, *target, sgid, target_thread, syn, d, w );
      }
    }
  // [...]
}

If the conditions

  • not target->has_proxies() and target->local_receiver()
  • not source->is_proxy() and not source->has_proxies()

are met, the parameter target_thread gets overwritten by a suggested thread for the target gid and connection takes only place if this thread corresponds to the thread currently executing the function.

As overwriting the target_thread seemed unintuitive to us we tested the system behavior without the override (more precise, without line 413). We found out that our test case appears to work flawlessly with this fix. Spikes from all source neurons are received successfully.

(The same code can be found in line 475 and 543; Here we applied the same fix.)

@jougs @heplesser @mdjurfeldt We believe this error does not only effect the music_event_out_proxy, but all connections which satisfy the above conditions for the case of multi-threading. Also, the error appears to be located completely on the NEST-side and therefore MUSIC seems not be not responsible at all.

EDIT: I'd like to point out that removing these lines fixes the testcase, but does not fix the problem in general. In a more complex setup, which is way too big for a testcase, I still receive error messages like

Error in MUSIC library: sender and receiver width mismatch (16 != 20)

which supposedly originates from improperly populating the index_map_ through several threads.

Contributor

mhoff commented Apr 7, 2017

@kappeld and I tracked down the issue to the ConnectionManager:

void
nest::ConnectionManager::connect( index sgid,
  Node* target,
  thread target_thread,
  index syn,
  double d,
  double w )
{
  const thread tid = kernel().vp_manager.get_thread_id();
  Node* source = kernel().node_manager.get_node( sgid, target_thread );
  // [...]
    if ( source->has_proxies() ) // normal neuron->device connection
    {
      connect_( *source, *target, sgid, target_thread, syn, d, w );
    }
    else // create device->device connections on suggested thread of target
    {
      // ##########################################################
      // ######### REMOVE THESE LINES TO FIX* THE ERROR ##########
      target_thread = kernel().vp_manager.vp_to_thread(
        kernel().vp_manager.suggest_vp( target->get_gid() ) );
      // ##########################################################
      if ( target_thread == tid )
      {
        source = kernel().node_manager.get_node( sgid, target_thread );
        target = kernel().node_manager.get_node( target->get_gid(), target_thread );
        connect_( *source, *target, sgid, target_thread, syn, d, w );
      }
    }
  // [...]
}

If the conditions

  • not target->has_proxies() and target->local_receiver()
  • not source->is_proxy() and not source->has_proxies()

are met, the parameter target_thread gets overwritten by a suggested thread for the target gid and connection takes only place if this thread corresponds to the thread currently executing the function.

As overwriting the target_thread seemed unintuitive to us we tested the system behavior without the override (more precise, without line 413). We found out that our test case appears to work flawlessly with this fix. Spikes from all source neurons are received successfully.

(The same code can be found in line 475 and 543; Here we applied the same fix.)

@jougs @heplesser @mdjurfeldt We believe this error does not only effect the music_event_out_proxy, but all connections which satisfy the above conditions for the case of multi-threading. Also, the error appears to be located completely on the NEST-side and therefore MUSIC seems not be not responsible at all.

EDIT: I'd like to point out that removing these lines fixes the testcase, but does not fix the problem in general. In a more complex setup, which is way too big for a testcase, I still receive error messages like

Error in MUSIC library: sender and receiver width mismatch (16 != 20)

which supposedly originates from improperly populating the index_map_ through several threads.

@heplesser

This comment has been minimized.

Show comment
Hide comment
@heplesser

heplesser Apr 7, 2017

Contributor

This code was introduced with #221, which fixed #211, more precisely with 73ad9f8. Removing lines 413, 475, and 513 should make regressiontests/issue-211.sli fail.

The background is as follows: Devices, e.g. spike generators and spike detectors, are replicated on all VPs (all threads). So if we run on two threads and create one SG and one SD,

/sg /spike_generator Create ;
/sd /spike_detector Create;

we actually get two SGs and two SDs, although they are mostly nicely hidden from the user (but the two SDs will write to two different files). Now, if we did

sg sd Connect

then without precautions we would get not one but two connections: on each thread, the local SG clone connects to the local SD clone.

The code on line 413 etc avoids this: It first checks on which thread the target would (exclusively) exist if it were a normal neuron. This value is the new target_thread. A connection is then only created between sender clone and target clone on that thread, so that we get one connection.

Now, when you connect a generator, such as the poisson_generator as in your test case, to the event out proxy, you will end up in the device->device branch of connect(). But the event out proxy isn't a normal device, it exists only once per MPI process. Therefore, I think the right fix would be to change line 407

   if ( source->has_proxies() or target->one_node_per_process() ) 
    {
      // normal neuron->device or any -> music proxy connection
      connect_( *source, *target, sgid, target_thread, syn, d, w );
    }

and correspondingly for the other two cases.

@jougs Comments?

@mhoff @kappeld Would you prepare a PR? It would be great if you would add your test case to the testusuite/musictest directory as regression test issue-696.*.

Contributor

heplesser commented Apr 7, 2017

This code was introduced with #221, which fixed #211, more precisely with 73ad9f8. Removing lines 413, 475, and 513 should make regressiontests/issue-211.sli fail.

The background is as follows: Devices, e.g. spike generators and spike detectors, are replicated on all VPs (all threads). So if we run on two threads and create one SG and one SD,

/sg /spike_generator Create ;
/sd /spike_detector Create;

we actually get two SGs and two SDs, although they are mostly nicely hidden from the user (but the two SDs will write to two different files). Now, if we did

sg sd Connect

then without precautions we would get not one but two connections: on each thread, the local SG clone connects to the local SD clone.

The code on line 413 etc avoids this: It first checks on which thread the target would (exclusively) exist if it were a normal neuron. This value is the new target_thread. A connection is then only created between sender clone and target clone on that thread, so that we get one connection.

Now, when you connect a generator, such as the poisson_generator as in your test case, to the event out proxy, you will end up in the device->device branch of connect(). But the event out proxy isn't a normal device, it exists only once per MPI process. Therefore, I think the right fix would be to change line 407

   if ( source->has_proxies() or target->one_node_per_process() ) 
    {
      // normal neuron->device or any -> music proxy connection
      connect_( *source, *target, sgid, target_thread, syn, d, w );
    }

and correspondingly for the other two cases.

@jougs Comments?

@mhoff @kappeld Would you prepare a PR? It would be great if you would add your test case to the testusuite/musictest directory as regression test issue-696.*.

@mhoff

This comment has been minimized.

Show comment
Hide comment
@mhoff

mhoff Apr 8, 2017

Contributor

Thank you for the explanation.

[...] Therefore, I think the right fix would be to change line 407

   if ( source->has_proxies() or target->one_node_per_process() ) 
    {
      // normal neuron->device or any -> music proxy connection
      connect_( *source, *target, sgid, target_thread, syn, d, w );
    }

and correspondingly for the other two cases.

Assuming this is what you meant, I can already say that it ends up in the same behavior as the previous "fix". The testcase works fine, but my more complex example produces

Error in MUSIC library: sender and receiver width mismatch (16 != 20)

@mhoff @kappeld Would you prepare a PR? It would be great if you would add your test case to the testusuite/musictest directory as regression test issue-696.*.

Certainly. As soon as the problem is fixed I can provide the PR.
Meanwhile I will try to create a simple testcase which expresses the mismatching widths.

Contributor

mhoff commented Apr 8, 2017

Thank you for the explanation.

[...] Therefore, I think the right fix would be to change line 407

   if ( source->has_proxies() or target->one_node_per_process() ) 
    {
      // normal neuron->device or any -> music proxy connection
      connect_( *source, *target, sgid, target_thread, syn, d, w );
    }

and correspondingly for the other two cases.

Assuming this is what you meant, I can already say that it ends up in the same behavior as the previous "fix". The testcase works fine, but my more complex example produces

Error in MUSIC library: sender and receiver width mismatch (16 != 20)

@mhoff @kappeld Would you prepare a PR? It would be great if you would add your test case to the testusuite/musictest directory as regression test issue-696.*.

Certainly. As soon as the problem is fixed I can provide the PR.
Meanwhile I will try to create a simple testcase which expresses the mismatching widths.

@kappeld

This comment has been minimized.

Show comment
Hide comment
@kappeld

kappeld Apr 10, 2017

Contributor

Hi, I think I understand now what is going wrong here. The patch that @heplesser proposed almost fixes the problem but there is one condition in which the function may still exit without creating a connection. In line 463 of connection_manager.cpp the check source->get_thread() != target_thread may fail if the source is a normal neuron that happens not to run in the same thread as the target (i.e. the music_event_out_proxy). In that case the function just returns in line 466 and does not create a synapse at all, which explains why we get a seemingly random number of connections, but exactly the same number in each run with the same number of local threads.

It seems to me that a correct fix would be to generate a synapse if target->one_node_per_process()=True in this branch no matter how the source is parametrized. This seems to fix the bug in all our test scenarios (also the more complex setting). I therefore suggest the following patch:

IGITUGraz@4618fdd

@heplesser and @jougs can you please look at this patch?

Please review this carefully since I don't know all the details here but at least in our case this creates the correct behaviour. Are there other nodes in NEST with has_proxies==False, local_receiver()==True and one_node_per_process==True for which this may break something?

If you agree on these changes we would prepare a pull request with the fix and our test cases added to the test suite as discussed.

Contributor

kappeld commented Apr 10, 2017

Hi, I think I understand now what is going wrong here. The patch that @heplesser proposed almost fixes the problem but there is one condition in which the function may still exit without creating a connection. In line 463 of connection_manager.cpp the check source->get_thread() != target_thread may fail if the source is a normal neuron that happens not to run in the same thread as the target (i.e. the music_event_out_proxy). In that case the function just returns in line 466 and does not create a synapse at all, which explains why we get a seemingly random number of connections, but exactly the same number in each run with the same number of local threads.

It seems to me that a correct fix would be to generate a synapse if target->one_node_per_process()=True in this branch no matter how the source is parametrized. This seems to fix the bug in all our test scenarios (also the more complex setting). I therefore suggest the following patch:

IGITUGraz@4618fdd

@heplesser and @jougs can you please look at this patch?

Please review this carefully since I don't know all the details here but at least in our case this creates the correct behaviour. Are there other nodes in NEST with has_proxies==False, local_receiver()==True and one_node_per_process==True for which this may break something?

If you agree on these changes we would prepare a pull request with the fix and our test cases added to the test suite as discussed.

@mhoff

This comment has been minimized.

Show comment
Hide comment
@mhoff

mhoff Apr 10, 2017

Contributor

Meanwhile I will try to create a simple testcase which expresses the mismatching widths.

The regression test can be found here: IGITUGraz@71c755d
Without the fix it does produce the mismatching widths behavior.

Contributor

mhoff commented Apr 10, 2017

Meanwhile I will try to create a simple testcase which expresses the mismatching widths.

The regression test can be found here: IGITUGraz@71c755d
Without the fix it does produce the mismatching widths behavior.

@heplesser

This comment has been minimized.

Show comment
Hide comment
@heplesser

heplesser Apr 10, 2017

Contributor

@mhoff @kappeld Thank you for your detective work! one_node_per_process() == true for all four music_* models and == false for all other models in NEST master.

I think your analysis is essentially correct, but you should place the source->is_proxy() check first:

// make sure source is on this MPI rank
if ( source->is_proxy() )
{
    return;
}

if ( target->one_node_per_process() )
{
   // music proxy connection or similar device with one node per process.
  connect_( *source, *target, sgid, target_thread, syn, d, w );
  return;
}

This ensures that only local neurons (same MPI process) will send spikes to the music proxy.

To make further discussion easier (and more in keeping with normal practice in the NEST Github repo), It would be great if you could create a PR now, so that we can discuss code in the PR.

Contributor

heplesser commented Apr 10, 2017

@mhoff @kappeld Thank you for your detective work! one_node_per_process() == true for all four music_* models and == false for all other models in NEST master.

I think your analysis is essentially correct, but you should place the source->is_proxy() check first:

// make sure source is on this MPI rank
if ( source->is_proxy() )
{
    return;
}

if ( target->one_node_per_process() )
{
   // music proxy connection or similar device with one node per process.
  connect_( *source, *target, sgid, target_thread, syn, d, w );
  return;
}

This ensures that only local neurons (same MPI process) will send spikes to the music proxy.

To make further discussion easier (and more in keeping with normal practice in the NEST Github repo), It would be great if you could create a PR now, so that we can discuss code in the PR.

kappeld added a commit to IGITUGraz/nest-simulator that referenced this issue Apr 11, 2017

@mhoff

This comment has been minimized.

Show comment
Hide comment
@mhoff

mhoff Apr 11, 2017

Contributor

We will do one final test on our side and file the PR presumably tomorrow.

Contributor

mhoff commented Apr 11, 2017

We will do one final test on our side and file the PR presumably tomorrow.

@jougs

This comment has been minimized.

Show comment
Hide comment
@jougs

jougs Apr 11, 2017

Contributor

@kappeld, @mhoff : many thanks for tracking this down. Other than the MUSIC proxies, there are no nodes with this specific setting of flags. And if there were, I assume they need to be treated in the same way.

Can you please start a PR with your changes? I'll review it positively and quickly if you do.

Contributor

jougs commented Apr 11, 2017

@kappeld, @mhoff : many thanks for tracking this down. Other than the MUSIC proxies, there are no nodes with this specific setting of flags. And if there were, I assume they need to be treated in the same way.

Can you please start a PR with your changes? I'll review it positively and quickly if you do.

@heplesser heplesser added P: PR Created and removed P: Pending labels Apr 18, 2017

jougs added a commit that referenced this issue Apr 25, 2017

Merge pull request #710 from IGITUGraz/iss696
Fix connections to music_event_out_proxy for local_num_threads > 1 (#710, #696)
@heplesser

This comment has been minimized.

Show comment
Hide comment
@heplesser

heplesser Jun 29, 2017

Contributor

Closing as the issue has been fixed by #710.

Contributor

heplesser commented Jun 29, 2017

Closing as the issue has been fixed by #710.

@heplesser heplesser closed this Jun 29, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment