New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEST encounters bad_alloc exception when changing tic length #644

Closed
heplesser opened this Issue Jan 26, 2017 · 8 comments

Comments

Projects
None yet
4 participants
@heplesser
Contributor

heplesser commented Jan 26, 2017

The following causes a bad_malloc exception (per hash 1cb0529, OSX 10.11 with gcc 6.2):

SLI ] 0 << /tics_per_ms 1. /resolution 1. >> SetStatus
SLI ] 0 [[/tics_per_ms /resolution]] get ==
[1.000000e+00 1.000000e+00]
SLI ] 1 Simulate
nest(23226,0x7fff7cf0a000) malloc: *** mach_vm_map(size=9223372036857856) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug

Jan 26 17:08:24 Simulate_d [Error]: C++Exception
    std::bad_alloc

Interestingly, a number of examples and tests, e.g. ArtificialSynchrony.sli and test_iaf_ps_dc_t_accuracy.sli set resolution and tics in this way and work. In those cases, a full network is built before Simulate is called. Creating just a single neuron does not get rid of the exception.

See also #643.

@heplesser

This comment has been minimized.

Show comment
Hide comment
@heplesser

heplesser Jan 26, 2017

Contributor

@gtrensch This is a slightly mysterious bug that may provide an interesting opportunity to delve into the NEST kernel---interested?

Contributor

heplesser commented Jan 26, 2017

@gtrensch This is a slightly mysterious bug that may provide an interesting opportunity to delve into the NEST kernel---interested?

@gtrensch

This comment has been minimized.

Show comment
Hide comment
@gtrensch

gtrensch Jan 30, 2017

Contributor

@heplesser yes, I like mysterious bugs.

Contributor

gtrensch commented Jan 30, 2017

@heplesser yes, I like mysterious bugs.

@apeyser

This comment has been minimized.

Show comment
Hide comment
@apeyser

apeyser Feb 20, 2017

Contributor

So the question is which array that is binned by tic size isn't properly resized on resetting tics?

Contributor

apeyser commented Feb 20, 2017

So the question is which array that is binned by tic size isn't properly resized on resetting tics?

@gtrensch

This comment has been minimized.

Show comment
Hide comment
@gtrensch

gtrensch Feb 20, 2017

Contributor

@apeyser It's the spike buffer. Something goes wrong with min_delay calculation in case tics_per_ms is < 1000.0.

Contributor

gtrensch commented Feb 20, 2017

@apeyser It's the spike buffer. Something goes wrong with min_delay calculation in case tics_per_ms is < 1000.0.

@gtrensch

This comment has been minimized.

Show comment
Hide comment
@gtrensch

gtrensch Feb 22, 2017

Contributor

@heplesser before I create a pull request, can you please have a look at the text below, in particular my remarks at the end, and let me know what you think. Perhaps we can also discuss this in Oslo during the hackathon.

@apeyser many thanks for your hints and the debug session !!!

The problem is the following:
Let's assume that tics_per_ms was set to 100. and it is:

TICS_PER_STEP = 100
MS_PER_STEP = 1.000000 

This will set following max limits (LIM_MAX):

tics (rounded to step boundary) = 1152921504606846900
steps = 11529215046068469 
ms = 11529215046068468.000000 

During the instantiation of the DelayCheckers, the value of tics of the min_delay_ object is preset with the positive infinity value (not intuitive) by calling Time::pos_inf(). Which means:
( ==> LONG_LONG_MAX / 8 + 1 ==> 1152921504606846976 ==> 0x1000000000000000 ==> the infinity value LIM_POS_INF.tics)

DelayChecker::Calibrate destroys this infinity value when TimeConverter is called. The problem is not visible for TICS_PER_STEP >= 1000. A boundary check hides the issue when a new Time-object is instantiated. See the nested macro below: The absolute infinity value contained in t.t won't be smaller than LIM_MAX.ms.

Time
TimeConverter::from_old_tics( tic_t t_old ) const
{
  double ms = t_old / OLD_TICS_PER_MS;
  return Time::ms( ms );     // <-- return Time object
}
Boudary check:

Time( ms t ) : tics( ( time_abs( t.t ) < LIM_MAX.ms ) ? 
    static_cast< tic_t >( t.t * Range::TICS_PER_MS + 0.5 ) : 
    ( t.t < 0 ) ? LIM_NEG_INF.tics : LIM_POS_INF.tics)

For TICS_PER_STEP < 1000 this condition is not given any longer. In the example above, it will be:

t.t = 1152921504606847 this is < ms (11529215046068468.000000)

... causing a new tics value of 115292150460684704 after calibration. This is not an infinity value anymore and the if-statment below in ConnectionManager::update_delay_extrema_ which checks against infinity will fail !

if ( min_delay_ == Time::pos_inf().get_steps() )
    min_delay_ = Time::get_resolution().get_steps();

Based on the invalid tics value an inappropriate steps value is calculated which in turn is used to resize the moduli_ array during simulation start causing the bad_alloc.

Solution:
I go along with @apeyser who came up with the idea to simply preserve the infinity values in the TimeConverter class. This would fix the issue.

Some general remarks:
I take a critical perspective on how infinity values are handeled in the code. The program flow relies on numbers in data fields. This is potentially error-prone. I understand the concept of doing calculations with infinity values, but wouldn't it be better to keep the state of a variable separated from the data?

In my opinion the TimerConverter belongs to the Time class, I would merge them.

The Time class itself is very hard to read and to debug. E.g.: Nested macros in initializers should be better functions. What looks like constants is sometimes overwritten. Some of the names are ambiguous and not descriptive (min_delay, for example, is used for different things).

Will say, I suggest a small conservative refactoring.

Contributor

gtrensch commented Feb 22, 2017

@heplesser before I create a pull request, can you please have a look at the text below, in particular my remarks at the end, and let me know what you think. Perhaps we can also discuss this in Oslo during the hackathon.

@apeyser many thanks for your hints and the debug session !!!

The problem is the following:
Let's assume that tics_per_ms was set to 100. and it is:

TICS_PER_STEP = 100
MS_PER_STEP = 1.000000 

This will set following max limits (LIM_MAX):

tics (rounded to step boundary) = 1152921504606846900
steps = 11529215046068469 
ms = 11529215046068468.000000 

During the instantiation of the DelayCheckers, the value of tics of the min_delay_ object is preset with the positive infinity value (not intuitive) by calling Time::pos_inf(). Which means:
( ==> LONG_LONG_MAX / 8 + 1 ==> 1152921504606846976 ==> 0x1000000000000000 ==> the infinity value LIM_POS_INF.tics)

DelayChecker::Calibrate destroys this infinity value when TimeConverter is called. The problem is not visible for TICS_PER_STEP >= 1000. A boundary check hides the issue when a new Time-object is instantiated. See the nested macro below: The absolute infinity value contained in t.t won't be smaller than LIM_MAX.ms.

Time
TimeConverter::from_old_tics( tic_t t_old ) const
{
  double ms = t_old / OLD_TICS_PER_MS;
  return Time::ms( ms );     // <-- return Time object
}
Boudary check:

Time( ms t ) : tics( ( time_abs( t.t ) < LIM_MAX.ms ) ? 
    static_cast< tic_t >( t.t * Range::TICS_PER_MS + 0.5 ) : 
    ( t.t < 0 ) ? LIM_NEG_INF.tics : LIM_POS_INF.tics)

For TICS_PER_STEP < 1000 this condition is not given any longer. In the example above, it will be:

t.t = 1152921504606847 this is < ms (11529215046068468.000000)

... causing a new tics value of 115292150460684704 after calibration. This is not an infinity value anymore and the if-statment below in ConnectionManager::update_delay_extrema_ which checks against infinity will fail !

if ( min_delay_ == Time::pos_inf().get_steps() )
    min_delay_ = Time::get_resolution().get_steps();

Based on the invalid tics value an inappropriate steps value is calculated which in turn is used to resize the moduli_ array during simulation start causing the bad_alloc.

Solution:
I go along with @apeyser who came up with the idea to simply preserve the infinity values in the TimeConverter class. This would fix the issue.

Some general remarks:
I take a critical perspective on how infinity values are handeled in the code. The program flow relies on numbers in data fields. This is potentially error-prone. I understand the concept of doing calculations with infinity values, but wouldn't it be better to keep the state of a variable separated from the data?

In my opinion the TimerConverter belongs to the Time class, I would merge them.

The Time class itself is very hard to read and to debug. E.g.: Nested macros in initializers should be better functions. What looks like constants is sometimes overwritten. Some of the names are ambiguous and not descriptive (min_delay, for example, is used for different things).

Will say, I suggest a small conservative refactoring.

@apeyser

This comment has been minimized.

Show comment
Hide comment
@apeyser

apeyser Feb 23, 2017

Contributor

I think the issue is that their exist a large number of time units: steps (int), ms (float), tics (int), delay (naked int)... there should be a universal unit, and all inputs should immediately be converted into it. It should be a very simple class without all the object oriented complexity.

It needs to be transparent and fast -- the previous iteration was eating 50% of simulation time in constructed and handling time objects.

Unfortunately, this is a lot of work.

Contributor

apeyser commented Feb 23, 2017

I think the issue is that their exist a large number of time units: steps (int), ms (float), tics (int), delay (naked int)... there should be a universal unit, and all inputs should immediately be converted into it. It should be a very simple class without all the object oriented complexity.

It needs to be transparent and fast -- the previous iteration was eating 50% of simulation time in constructed and handling time objects.

Unfortunately, this is a lot of work.

@heplesser

This comment has been minimized.

Show comment
Hide comment
@heplesser

heplesser Mar 2, 2017

Contributor

@gtrensch @apeyser Thank you very much for your detective work! I agree that the best fix for now is to modify the TimeConverter class so that it returns arguments equal to positive/negative infinity unchanged. It is unfortunate that integer types, in contrast to IEEE754 doubles, do not have properly defined infinity values.

The reason for initializing with positive/negative infinity is that it makes comparisons easy---otherwise, one would always have to check against invalid values.

The Time class is probably ready for an overhaul, but that is a different issue. But we need time in al three representations (ms, steps, tics) for consistency and accuracy.

@gtrensch Would you create a PR?

Contributor

heplesser commented Mar 2, 2017

@gtrensch @apeyser Thank you very much for your detective work! I agree that the best fix for now is to modify the TimeConverter class so that it returns arguments equal to positive/negative infinity unchanged. It is unfortunate that integer types, in contrast to IEEE754 doubles, do not have properly defined infinity values.

The reason for initializing with positive/negative infinity is that it makes comparisons easy---otherwise, one would always have to check against invalid values.

The Time class is probably ready for an overhaul, but that is a different issue. But we need time in al three representations (ms, steps, tics) for consistency and accuracy.

@gtrensch Would you create a PR?

@heplesser heplesser added this to the NEST 2.12.1 milestone Jun 29, 2017

@gtrensch

This comment has been minimized.

Show comment
Hide comment
@gtrensch

gtrensch Jul 18, 2017

Contributor

PR has been created. The TimeConverter class has not been merged into the Time class. This would be better done in the course of a refactoring task.
This issue can be closed after merge.

Contributor

gtrensch commented Jul 18, 2017

PR has been created. The TimeConverter class has not been merged into the Time class. This would be better done in the course of a refactoring task.
This issue can be closed after merge.

@terhorstd terhorstd added P: PR Created and removed P: Pending labels Jul 31, 2017

@terhorstd terhorstd closed this in 7e25df7 Aug 7, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment