Abolish the SegmentOverlap. Embrace references. #3307

mrcslws · 2016-08-28T04:04:57Z

This makes two changes to the Connections and TemporalMemory.

Change 1: Abolish the SegmentOverlap

This problem is described in #3306

The solution is to keep the numActivePotentialSynapsesForSegment list that was already being created inside of computeActivity. Now the TM can look up the "overlap" for any segment by simply doing numActivePotentialSynapsesForSegment[segment.flatIdx].

This makes the code simpler and faster. Part of the reason it's faster is because we're not instantiating a SegmentOverlap class for every single active / matching segment.

Now the activeSegments and matchingSegments are just simple lists of segments.

Change 2: Use `Segment` and `Synapse` instances for both data storage and identity

This gets rid of the SegmentData and the SynapseData. Now a Segment is both the type used inside of the Connections for data storage and the type used to refer to a segment. Same for the Synapse.

The separation between a Segment and a SegmentData makes sense in C++, where structs are passed around by value. In C++ there's an incentive to keep the Segment struct small, and it also wouldn't make sense to include large amounts of data on the Segment struct because changing it won't have any effect on the Connection instance's internal copy.

In Python, we pass around references (because that's what you do in all garbage-collected languages). When you pass around references, it doesn't matter how big the data structure is, and the reference always points to the most up-to-date information. Also in Python, every time you create a new instance of any class, you create something that has to be garbage-collected later, so there's a lot more overhead than in C++ where a struct is just a blob of bits, no different from any other value.

With this change, we only instantiate a Segment or Synapse when creating a new segment or synapse. Afterward we just stop instantiating stuff. When you call segmentsForCell, it returns existing Segment instances.

Perf results

Before:

> python scripts/temporal_memory_performance_benchmark.py -i tm_py -t hotgym
Test: hotgym

tm_py: 24.044563s

After:

> python scripts/temporal_memory_performance_benchmark.py -i tm_py -t hotgym
Test: hotgym

tm_py: 14.601669s

Measuring in timesteps-per-second, this is ~60% faster.

TM algorithm results

This does not affect results. It's the same TemporalMemory algorithm.

numenta-ci · 2016-08-28T04:05:08Z

By analyzing the blame information on this pull request, we identified @andrewmalta13, @chetan51 and @david-ragazzi to be potential reviewers

cogmission · 2016-08-29T08:40:52Z

EDIT: For anyone seeing this at a later date. The following comments were made from my lack of understanding of what the internals of Segments represent and the mistake of an "ordinal" for an index. It does not apply any longer.

@mrcslws

I'd like to use the previous code to point out a subtle groupby2() behavior.

After spending some time adjusting subtle differences in Java to get exactly the same output as with the Python version, I want to point out that the previous code had a slight functional bug that doesn't get exposed by the unit tests.

At the end of Connections.computeActivity() the SegmentOverlap(s) for both the matching and active segments were being sorted by a "bad" key.

The key: (s.segment.cell * self.maxSegmentsPerCell + s.segment.idx)

really was arbitrary (due to the cell index being multiplied to self.maxSegmentsPerCell [255 as a default] ) and the "s.segment.idx" was actually always zero and had no effect! (the "flatIndex" should have been used instead).

These sorted segments were used in t+1 as the entries to the new groupby2() function at the beginning of TemporalMemory.compute() - and totally determine the quality and efficiency of groupby2 groupings which then determines what column's cells get marked as "active/matching" and which don't.

At first, I instinctually grouped by the ascending columns of the cells the segments belonged to and I assumed that the above key was doing the same. I was getting superior results but not equal results (my anomaly scores were better) - which was a weird unwanted result because I was shooting for absolute parity. Then I took another look at the key and examined what it was actually producing which led to my discovery of the bug.

Summary:

I no longer am a fan of the "groupby2()" technique (instead preferring your first "walking technique" with the ExcitedColumnGenerators) because groupby2 hides a very subtle potential weirdness in the selection of the active and winner cells.

If we "go" with this "groupby2" technique (because it is faster), please be very careful to sort the source lists (active/matching lists) at the end of Connections.computeActivity() carefully and prove to yourself that we are getting the "group orderings" that allow for the best TM results (best in terms of quality - not speed)?

~~Quality == Prediction Accuracy~~

You are doing a great job, @mrcslws! Keep up the great work!

mrcslws · 2016-08-29T14:28:43Z

A couple quick points:

This PR already has a big description, and your comment has nothing to do with this PR...
That key function is correct. Using the flatIdx would have been incorrect and would cause bugs. If all the of the segments' idx is 0, that just means no cell has multiple segments. If you've ever found two segments with identical cell and idx, that's a bug. The flatIdx is a magic number that is totally uncorrelated with where the segment is. The computeActivity function, as-is, should output segments in the correct sorted order for use in groupby2.
- That key function is equivalent to checking a.cell < b.cell and then a.idx < b.idx as the secondary check. Andrew converted this check into a single scalar check because Python magically makes sorting faster when you return an integer from your key function. It uses the maxSegmentsPerCell because that guarantees correct key ordering -- I thought this was a pretty clever trick on Andrew's part. (This code still exists after this PR, it's just in the TM's activateDendrites method)
It sounds like the bug you're describing is located in computeActivity. This has nothing to do with groupby2. If there were a bug in this key function, it would also affect the ExcitedColumnsGenerator which is really just a specific case of groupby2. groupby2 is the generalized ExcitedColumnsGenerator.
The change to groupby2 had nothing to do with speed. It was a pleasant surprise that it was faster. We changed to it because it's better code, code that describes what to do using reusable primitives. There's no longer this big advanced ExcitedColumnsGenerator thingy that people need to understand. They just have to know what a groupby is. "groupby" is built into multiple programming languages.

cogmission · 2016-08-29T15:43:17Z

cont'd here so as not to clutter this PR with loosely related commentary 😉

mrcslws · 2016-08-29T22:40:33Z

src/nupic/research/connections.py

@@ -187,12 +161,9 @@ def segmentsForCell(self, cell):
    @return (list) Segment objects representing segments on the given cell


I need to change this docstring's return type. It's now a generator.

mrcslws · 2016-08-30T21:18:04Z

Current status: I'm thinking about the Connections interface. Right now it's a little confusing having direct access to all the data, and having to know which data you can/can't access. For example, right now you're supposed to read synapse.permanence and synapse.presynapticCell and segment.cell but you're not supposed to read segment.synapses. Even if we remove destroyed synapses, it will be important that you not manually add to segment.synapses, and instead rely on Connections.createSynapse so that it can do extra bookkeeping.

I think I've made the Connections interface deceptive. It doesn't guide you to do the right thing.

This PR is not ready for review.

mrcslws · 2016-08-31T01:07:41Z

Verdict: I think this is a good change, it just needs to explicitly mark which parts of the Segment and Synapse are internal state (e.g. segment._synapses), and it should avoid editing these objects directly (e.g. synapse.permanence = 0.7).

We could add a layer of indirection, using integers to refer to segments and synapses. The benefit of this would be to avoid incorrect use of the API. But I can't justify that -- using direct references to the data will always be faster. We've seen that these lookups are expensive in Python.

mrcslws · 2016-08-31T16:42:08Z

@scottpurdy ready for review.

My comment above:

I think I've made the Connections interface deceptive.

no longer applies. I reverted parts of my change and then squashed my commits to avoid churning the blame info.

You might prefer to review the two commits individually, rather than the full diff.

scottpurdy · 2016-09-01T17:31:29Z

tests/unit/nupic/research/temporal_memory_test.py

@@ -210,36 +210,6 @@ def testReinforceCorrectlyActiveSegments(self):
    self.assertAlmostEqual(.42, tm.connections.dataForSynapse(is1).permanence)


-  def testNoGrowthOnCorrectlyActiveSegments(self):


Why isn't this a valid test anymore?

It stopped being valid with the recent algorithm change #3293

It was explicitly checking that we don't grow synapses on predicted columns. Now we do.

The test was flawed (my bad) -- it should be failing now, but instead it only sometimes fails, depending on the random number generator. It didn't do a good job of controlling the winner cells, so sometimes no synapse growth happens in today's code. If I were writing this test today, I'd use 1 cell per column to control the winner cells.

scottpurdy · 2016-09-01T17:33:44Z

To make sure I understand this, it doesn't change the algorithm, just how the segment-related data is stored?

mrcslws · 2016-09-01T17:52:49Z

Correct.

oxtopus · 2016-09-01T20:42:58Z

src/nupic/research/connections.py

+    to be consistent after serialize / deserialize.
+
+    """
+    return ((self.idx, self.cell, self._synapses, self._numDestroyedSynapses,


Personally, I find this difficult to read.

I prefer the following, even if it's more verbose:

return (self.idx == other.idx and self.cell == other.cell and self._synapses == other._synapses and self._numDestroyedSynapses == other._numDestroyedSynapses and self._destroyed == other._destroyed and self._lastUsedIteration == other._lastUsedIterator)

or:

return all((self.idx == other.idx, self.cell == other.cell, self._synapses == other._synapses, self._numDestroyedSynapses == other._numDestroyedSynapses, self._destroyed == other._destroyed, self._lastUsedIteration == other._lastUsedIterator))

scottpurdy · 2016-09-02T00:14:08Z

👍

Store "numActivePotentialSynapses". No more "SegmentOverlap".

a3f691f

mrcslws mentioned this pull request Aug 28, 2016

TM: Abolish the SegmentOverlap numenta/nupic.core-legacy#1063

Merged

mrcslws force-pushed the abolish-the-SegmentOverlap branch from b2dea6e to da976ee Compare August 28, 2016 18:42

This was referenced Aug 29, 2016

From NuPIC: #1059: Grow synapses in predicted columns not just bursting columns numenta/htm.java#457

Open

Exact Python TM Output Duplication numenta/htm.java#458

Merged

mrcslws reviewed Aug 29, 2016
View reviewed changes

Merge "Segment" with "SegmentData", "Synapse" with "SynapseData"

8c370f8

mrcslws force-pushed the abolish-the-SegmentOverlap branch from 8e7c5d2 to 8c370f8 Compare August 31, 2016 00:48

mrcslws mentioned this pull request Aug 31, 2016

TM: Get rid of "destroyed" Segments / Synapses numenta/nupic.core-legacy#1064

Closed

cogmission mentioned this pull request Sep 1, 2016

Connections: computeActivity() - segment variable not updated due to scope issue? #3290

Closed

scottpurdy reviewed Sep 1, 2016
View reviewed changes

oxtopus reviewed Sep 1, 2016
View reviewed changes

mrcslws added 2 commits September 1, 2016 15:00

Easier-to-read __eq__ methods.

59dd6f1

Merge branch 'master' into abolish-the-SegmentOverlap

47cf52c

mrcslws merged commit d2aad35 into numenta:master Sep 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abolish the SegmentOverlap. Embrace references. #3307

Abolish the SegmentOverlap. Embrace references. #3307

mrcslws commented Aug 28, 2016

numenta-ci commented Aug 28, 2016

cogmission commented Aug 29, 2016 •

edited

mrcslws commented Aug 29, 2016 •

edited

cogmission commented Aug 29, 2016 •

edited

mrcslws Aug 29, 2016

mrcslws commented Aug 30, 2016

mrcslws commented Aug 31, 2016

mrcslws commented Aug 31, 2016

scottpurdy Sep 1, 2016

mrcslws Sep 1, 2016

scottpurdy commented Sep 1, 2016

mrcslws commented Sep 1, 2016

oxtopus Sep 1, 2016

scottpurdy commented Sep 2, 2016

		@@ -187,12 +161,9 @@ def segmentsForCell(self, cell):
		@return (list) Segment objects representing segments on the given cell

		@@ -210,36 +210,6 @@ def testReinforceCorrectlyActiveSegments(self):
		self.assertAlmostEqual(.42, tm.connections.dataForSynapse(is1).permanence)


		def testNoGrowthOnCorrectlyActiveSegments(self):

Abolish the SegmentOverlap. Embrace references. #3307

Abolish the SegmentOverlap. Embrace references. #3307

Conversation

mrcslws commented Aug 28, 2016

Change 1: Abolish the SegmentOverlap

Change 2: Use Segment and Synapse instances for both data storage and identity

Perf results

TM algorithm results

numenta-ci commented Aug 28, 2016

cogmission commented Aug 29, 2016 • edited

mrcslws commented Aug 29, 2016 • edited

cogmission commented Aug 29, 2016 • edited

mrcslws Aug 29, 2016

Choose a reason for hiding this comment

mrcslws commented Aug 30, 2016

mrcslws commented Aug 31, 2016

mrcslws commented Aug 31, 2016

scottpurdy Sep 1, 2016

Choose a reason for hiding this comment

mrcslws Sep 1, 2016

Choose a reason for hiding this comment

scottpurdy commented Sep 1, 2016

mrcslws commented Sep 1, 2016

oxtopus Sep 1, 2016

Choose a reason for hiding this comment

scottpurdy commented Sep 2, 2016

Change 2: Use `Segment` and `Synapse` instances for both data storage and identity

cogmission commented Aug 29, 2016 •

edited

mrcslws commented Aug 29, 2016 •

edited

cogmission commented Aug 29, 2016 •

edited