Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connections Optimization #161

Open
ctrl-z-9000-times opened this issue Dec 14, 2018 · 12 comments

Comments

Projects
None yet
2 participants
@ctrl-z-9000-times
Copy link

commented Dec 14, 2018

I think that computeActivity could be made faster with better book-keeping. For reference:

void Connections::computeActivity(
    vector<UInt32> &numActiveConnectedSynapsesForSegment,
    vector<UInt32> &numActivePotentialSynapsesForSegment,
    const vector<CellIdx> &activePresynapticCells,
    Permanence connectedPermanence) const {

  for (CellIdx cell : activePresynapticCells) {
    if (synapsesForPresynapticCell_.count(cell)) {
      for (Synapse synapse : synapsesForPresynapticCell_.at(cell)) {
        const SynapseData &synapseData = synapses_[synapse];
        ++numActivePotentialSynapsesForSegment[synapseData.segment];

        if (synapseData.permanence >= connectedPermanence - EPSILON) {
          ++numActiveConnectedSynapsesForSegment[synapseData.segment];
}}}}}

It should not check the permanence here. Synapses don't change connect state all that often, and when they do the connections class will know. Instead the connections class should keep two lists of synapses for each presynaptic cell.

  • Replace synapsesForPresynapticCell_ with connectedSynapsesForPresynapticCell_ and potentialSynapsesForPresynapticCell_.

  • The compute method can then remove the line if(synapseData.permanence >= connectedPermanence). Replace it with two for-loops: first iterate through connectedSynapsesForPresynapticCell_ to compute numActiveConnectedSynapsesForSegment. Then copy numActiveConnectedSynapsesForSegment to numActivePotentialSynapsesForSegment. Finaly, iterate through potentialSynapsesForPresynapticCell_ to finish computing numActivePotentialSynapsesForSegment. This way reduces the amount of random-access incrementing by the size of numActiveConnectedSynapsesForSegment, in favour of copying a numSegments sized buffer.

Furthermore, the SpatialPooler uses the numActiveConnectedPresyn but does not use the numActivePotentialPresyn, so by splitting this into two methods, SpatialPooler's compute method can avoid seeing the potential synapses altogether (until it learns of course).

Futhermore, there are other ppl who want to be notified when a synapse changes state, such as connectedCount book keeping, and possibly ConnectionsEventHandler::onUpdateSynapsePermanence. So it makes sense to have a single place in connections where it checks for this event and deals with it: method Connections::updateSynapsePermanence.

EDIT:
Next steps:

I assume all your work has been merged? Can we close this as resolved?

Ideas for optimization which are not yet implemented:

@breznak

This comment has been minimized.

Copy link
Member

commented Dec 14, 2018

So it makes sense to have a single place in connections where it checks for this event and deals with it: method Connections::updateSynapsePermanence.

Very good analysis and I like this approach very much!
But, please wait after #153 with the next step.

@ctrl-z-9000-times

This comment has been minimized.

Copy link
Author

commented Dec 15, 2018

But, please wait

Ha ha, yes. This is going to be a big change if/when it happens. By my estimate this will probably take 2 weeks - 2 months to do. I intend to work on other tasks after #153 and return to this task the next time the speed performance of connections::compute becomes an issue. I'm sure this will happen at some point when we attempt to scale up the program.

@ctrl-z-9000-times

This comment has been minimized.

Copy link
Author

commented Dec 15, 2018

Note: in order to quickly switch a synapse's connected state, it needs to know its pre-synaptic input cell, and also where the synapse is in the presynaptic input cell. This is so that when removing a synapse from the presynapse's list it does not need to search the list for the synapse. The trick is to replace the synapse with the final synapse in the list, so that a synapse is overwritten, without all of its adjacent synapses moving around. After popping the end of the list on top of the (now removed) synapse, the moved synapse needs to update it;s presynaptic-location index.

This amounts to an extra 32-bit integer in the synapse's data structure.

@ctrl-z-9000-times

This comment has been minimized.

Copy link
Author

commented Dec 15, 2018

Extra: Presynaptic cells keep a list of synapses, when what they really need is the segment they connect to. Connections could keep another structure running parallel to the presynaptic synapses lists for the segments. Then the compute method would omit the synapseData lookup. The presynaptic synapse data would only be used when a synapse changes connected state.

@ctrl-z-9000-times

This comment has been minimized.

Copy link
Author

commented Dec 15, 2018

Another area for optimization: method Connections::updateSynapsePermanence currently clips permanence values into the proper range: [0, 1]

@breznak
my idea was microoptimization - to avoid 1op for max, 1op for assign. I wanted to validate that in Conn Permanence is always within the [0,1] range, so even this clipping could be removed.
(the assert would be there to make code clearer, check in debug)
But maybe that's overcomplicating things

@ctrl-z-9000-times
[I agree], Multiple variants of this function could be useful and optimized:

  • We could have an increment variant, and a decrement variant, and these would be able to omit a single bounds check each.
  • A variant with both checks for creature comfort.
  • A common version with no checks, for the other variants to call.
@breznak

This comment has been minimized.

Copy link
Member

commented Dec 16, 2018

Related issues:

  • generic performance optimizations #158
  • possibly orthogonal issue, #154 - makes the logic nicer, maybe slower (maybe not)

@breznak breznak added this to the optimization milestone Dec 19, 2018

@ctrl-z-9000-times

This comment has been minimized.

Copy link
Author

commented Dec 19, 2018

Another area for optimization: presynaptic cells uses a map when it could use a list. This would add the constraint on inputs that their IDs start at zero, are contiguous, and the total number of inputs is known ahead of time. The advantage of a list over a map is memory and to a lesser amount speed.

@breznak

This comment has been minimized.

Copy link
Member

commented Dec 19, 2018

The advantage of a list over a map is memory and to a lesser amount speed.

This would probably need some microbenchmark, wonder how significant/justified such change would be.

@ctrl-z-9000-times

This comment has been minimized.

Copy link
Author

commented Jan 2, 2019

Update

I wanted my program to run faster so I did two of these optimizations. These changes are prototypes, not ready for merging.

  • Presynaptic cells keep two sets of synapses (connected & potential).
  • Presynaptic cells keep list of target segments.

The code is located on branch mnist because I used my MNIST dataset solver to profile the connections class. To profile I used valgrind --tool=cachegrind mnist_sp -v and then cg_annotate. See: http://valgrind.org/

Results

The program mnist_sp achieves ~94% accuracy on the MNIST dataset. These performance optimizations have no effect on functionality.

Runtime:

  • Before: ~30 minutes
  • After: ~7 minutes

Instruction Reads:

  • Before: 3.16 trillion
  • After: 2.26 trillion

In Method Connections::computeActivity, Data Cache 1 misses / Data Reads = Cache miss rate

  • Before: 104 billion / 570 billion = 18% cache misses
  • After: 20 billion / 170 billion = 12% cache misses

Analysis

By keeping separate lists of connected & potential presynapses, the innermost loop of method Connections::computeActivity() is shorter and simpler. I did not run valgrind's branch predictor tool, but I expect that the check permanence >= conenctedThreshold was a major pain point which has now been moved out of the critical path, from compute to learn. Furthermore, I would expect (but have not verified) that checking if a synapse has changed state is predictably false (which means branch prediction works good & fast), whereas checking if a synapse is connected is unpredictable.

By keeping the target segments for each presynaptic cell, the innermost loop of the compute method is able to omit a memory read to the synapse's data and instead directly access the segment's data. This optimization, combined with moving the permanence check and not iterating over disconnected synapses, yielded a 70% reduction in memory reads. The cache miss rate decreased by 35%.

@breznak

This comment has been minimized.

Copy link
Member

commented Jan 14, 2019

@ctrl-z-9000-times this is HUGE! Big KUDOs! 👍 👍

Runtime:

Before: ~30 minutes
After: ~7 minutes

Instruction Reads:

Before: 3.16 trillion
After: 2.26 trillion

Also, the code looks much cleaner thanks to the finer grained paths.
I assume all your work (from MNIST branch) has been merged as part of the PR #190?
Can we close this as resolved?

@ctrl-z-9000-times

This comment has been minimized.

Copy link
Author

commented Jan 14, 2019

I assume all your work has been merged? Can we close this as resolved?

Ideas for optimization which are not yet implemented:

@ctrl-z-9000-times

This comment has been minimized.

Copy link
Author

commented Jan 19, 2019

Another area for optimization:

Connections::compute(
    UInt32 *numActiveConnectedSynapsesForSegment
    UInt32 *numActivePotentialSynapsesForSegment)

Both could probably be made into arrays of UInt16 or even UInt8. Would need to verify gains, and also check that they don't overflow. This would reduce the memory footprint of a performance critical data structure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.