Abolish the SegmentOverlap. Embrace references. #3307
Conversation
By analyzing the blame information on this pull request, we identified @andrewmalta13, @chetan51 and @david-ragazzi to be potential reviewers |
b2dea6e
to
da976ee
Compare
EDIT: For anyone seeing this at a later date. The following comments were made from my lack of understanding of what the internals of Segments represent and the mistake of an "ordinal" for an index. It does not apply any longer. I'd like to use the previous code to point out a subtle After spending some time adjusting subtle differences in Java to get exactly the same output as with the Python version, I want to point out that the previous code had a slight functional bug that doesn't get exposed by the unit tests. At the end of The key: really was arbitrary (due to the cell index being multiplied to These sorted segments were used in At first, I instinctually grouped by the ascending columns of the cells the segments belonged to and I assumed that the above key was doing the same. I was getting superior results but not equal results (my anomaly scores were better) - which was a weird unwanted result because I was shooting for absolute parity. Then I took another look at the key and examined what it was actually producing which led to my discovery of the bug. Summary:
You are doing a great job, @mrcslws! Keep up the great work! |
A couple quick points:
|
cont'd here so as not to clutter this PR with loosely related commentary 😉 |
@@ -187,12 +161,9 @@ def segmentsForCell(self, cell): | |||
@return (list) Segment objects representing segments on the given cell |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to change this docstring's return type. It's now a generator.
Current status: I'm thinking about the Connections interface. Right now it's a little confusing having direct access to all the data, and having to know which data you can/can't access. For example, right now you're supposed to read I think I've made the Connections interface deceptive. It doesn't guide you to do the right thing. This PR is not ready for review. |
8e7c5d2
to
8c370f8
Compare
Verdict: I think this is a good change, it just needs to explicitly mark which parts of the We could add a layer of indirection, using integers to refer to segments and synapses. The benefit of this would be to avoid incorrect use of the API. But I can't justify that -- using direct references to the data will always be faster. We've seen that these lookups are expensive in Python. |
@scottpurdy ready for review. My comment above:
no longer applies. I reverted parts of my change and then squashed my commits to avoid churning the blame info. You might prefer to review the two commits individually, rather than the full diff. |
@@ -210,36 +210,6 @@ def testReinforceCorrectlyActiveSegments(self): | |||
self.assertAlmostEqual(.42, tm.connections.dataForSynapse(is1).permanence) | |||
|
|||
|
|||
def testNoGrowthOnCorrectlyActiveSegments(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this a valid test anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It stopped being valid with the recent algorithm change #3293
It was explicitly checking that we don't grow synapses on predicted columns. Now we do.
The test was flawed (my bad) -- it should be failing now, but instead it only sometimes fails, depending on the random number generator. It didn't do a good job of controlling the winner cells, so sometimes no synapse growth happens in today's code. If I were writing this test today, I'd use 1 cell per column to control the winner cells.
To make sure I understand this, it doesn't change the algorithm, just how the segment-related data is stored? |
Correct. |
to be consistent after serialize / deserialize. | ||
|
||
""" | ||
return ((self.idx, self.cell, self._synapses, self._numDestroyedSynapses, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I find this difficult to read.
I prefer the following, even if it's more verbose:
return (self.idx == other.idx and
self.cell == other.cell and
self._synapses == other._synapses and
self._numDestroyedSynapses == other._numDestroyedSynapses and
self._destroyed == other._destroyed and
self._lastUsedIteration == other._lastUsedIterator)
or:
return all((self.idx == other.idx,
self.cell == other.cell,
self._synapses == other._synapses,
self._numDestroyedSynapses == other._numDestroyedSynapses,
self._destroyed == other._destroyed,
self._lastUsedIteration == other._lastUsedIterator))
👍 |
Fixes #3306
This makes two changes to the Connections and TemporalMemory.
Change 1: Abolish the SegmentOverlap
This problem is described in #3306
The solution is to keep the
numActivePotentialSynapsesForSegment
list that was already being created inside ofcomputeActivity
. Now the TM can look up the "overlap" for any segment by simply doingnumActivePotentialSynapsesForSegment[segment.flatIdx]
.This makes the code simpler and faster. Part of the reason it's faster is because we're not instantiating a
SegmentOverlap
class for every single active / matching segment.Now the
activeSegments
andmatchingSegments
are just simple lists of segments.Change 2: Use
Segment
andSynapse
instances for both data storage and identityThis gets rid of the
SegmentData
and theSynapseData
. Now aSegment
is both the type used inside of the Connections for data storage and the type used to refer to a segment. Same for theSynapse
.The separation between a
Segment
and aSegmentData
makes sense in C++, where structs are passed around by value. In C++ there's an incentive to keep the Segment struct small, and it also wouldn't make sense to include large amounts of data on theSegment
struct because changing it won't have any effect on the Connection instance's internal copy.In Python, we pass around references (because that's what you do in all garbage-collected languages). When you pass around references, it doesn't matter how big the data structure is, and the reference always points to the most up-to-date information. Also in Python, every time you create a new instance of any class, you create something that has to be garbage-collected later, so there's a lot more overhead than in C++ where a struct is just a blob of bits, no different from any other value.
With this change, we only instantiate a
Segment
orSynapse
when creating a new segment or synapse. Afterward we just stop instantiating stuff. When you callsegmentsForCell
, it returns existing Segment instances.Perf results
Before:
After:
Measuring in timesteps-per-second, this is ~60% faster.
TM algorithm results
This does not affect results. It's the same TemporalMemory algorithm.