Core speedups #1629

atait · 2020-08-04T22:01:41Z

Motivation and context:
General performance improvements to Nengo core simulations. Forum topic

Some minor code tricks

caching result of SignalDict misses
avoiding np.clip from v1.17
initializing outside of loop

New LinearFilter step for one-dimensional ensembles

gives about an order-of-magnitude speedup on the step
since this is a common operation, it gave a 40% time reduction running test_examples.py

How has this been tested?

Watts & Strogatz network with ~30k neurons, ~900k synapses, 1 Ensemble, no NEF
Distributed network with ~200 Ensembles of 10 neurons each, heavy use of NEF (unpublished)

How long should this take to review?

Lengthy

Changes are simple, but they are fairly deep within core. More extensive testing should be done to

ensure nothing breaks
independently verify performance across a broader benchmark suite

Where should a reviewer start?

Reproduce speedup. I saw 40% reduction from these commands

git checkout master
time pytest nengo/tests/test_examples.py
git checkout <this PR>
time pytest nengo/tests/test_examples.py

More extensive testing: use a different benchmark suite
More detail: use a function profiler. I use %%prun in Jupyter.
Assessing each change: make a new branch; cherry pick commits off of this PR. They should each work independently (I did not test this)

Types of changes:

Bug fix (non-breaking change which fixes an issue)
- in the sense of unnecessary time spent during simulating, not in terms of correctness
New feature (non-breaking change which adds functionality)
- OneXOneDim: acts as drop-in for existing functionality

Checklist:

I have read the CONTRIBUTING.rst document.
I have updated the documentation accordingly.
I have included a changelog entry.
I have added tests to cover my changes.
I have run the test suite locally and all tests passed. (Except for linter missing a "noqa" code in nengo/utils/builder.py)

Still to do:
None that I'm aware of

tbekolay

Saw a few style nitpicks while scanning this PR, looks great though!

nengo/neurons.py

nengo/processes.py

nengo/synapses.py

drasmuss · 2020-08-19T14:44:43Z

Pushed some minor fixups:

style fixups
combined the clip changes into one commit
I left the umath check without an upper bound, since I suspect we'd forget to update it when the next minor release comes out, and we don't want things to slow down because of that
renamed OneXOneDim to OneXScalar (just thought that was slightly clearer)

Then I added two new changes that I discovered while looking into this

we were accidentally running all the slow examples during testing
the .get Parameter system lookups were eating up a non-trivial amount of time, so I streamlined those a bit

For those curious, I also checked how much impact each of these changes had (using the examples as a benchmark). The vast majority of the improvement comes from moving the BSR initialization outside the step function:

moving BSR outside loop (~50s)
save SignalDict misses (~8s)
more efficient clip (~10s)
specialized LinearFilter step (~5s)
more efficient param lookup (~12s)

Change ndarray.clip to np.clip or, where possible, np.maximum Change all uses of np.clip to np.umath.clip in numpy>=1.17

atait · 2020-08-19T18:24:10Z

Thanks for the review and changes. Those benchmarking results are interesting because I saw the opposite on my particular test problems, although both sped up. Really nice independent verification!

I went back though again and have one recommended change. @drasmuss, I'll let you do the modify and force push how you'd like.

Plasticity with BsrDotInc:
It is possible for BSR's A to end up writable if the constituent matrices are writable (see initialization of A). I checked this, and it does lead to a change in behavior in this PR (i.e. mat_A does not maintain reference to the data of A). Good news is that writability is shrewdly tracked, so the fix for BsrDotInc is simple (recommended in the forum)

    def make_step(self, signals, dt, rng):
        X = signals[self.X]
        A = signals[self.A]
        Y = signals[self.Y]

        if A.flags.writeable:
            def step_dotinc():
                mat_A = self.bsr_matrix((A, self.indices, self.indptr))
                inc = mat_A.dot(X)
                if self.reshape:
                    inc = inc.reshape(Y.shape)
                Y[...] += inc
        else:
            mat_A = self.bsr_matrix((A, self.indices, self.indptr))
            def step_dotinc():
                inc = mat_A.dot(X)
                if self.reshape:
                    inc = inc.reshape(Y.shape)
                Y[...] += inc

        return step_dotinc

Stressing umath.clip:
I tried pretty hard to break it because it would be very annoying for a current user feeding bad data to something with clip to suddenly get an error. I couldn't break it. For the record, here's what I checked:

umath.clip fails when the bounds are strange: None, nan, byteswapped. I tried feeding bad bounds into objects that use clip, for example ee = nengo.dists.Exponential(2, high=None). This fails eagerly a la NumberParam, so that's good.
umath.clip does cast floats and ints in the first argument when necessary, so that will still work.
nengo.utils.numpy.clip is opt-in, so any user-defined neurons or dependent code using bad bounds won't be affected.
Nengo source opts-in, and, in this PR, all usages have good bounds, so that is also fine

drasmuss · 2020-08-19T19:44:55Z

Do you have a test case that shows the behaviour change (I'm guessing it would involve using online learning rules with a matrix that gets merged into a BsrDotInc?). It'd be good to add that as a unit test, since our existing tests didn't pick up that case.

Changes implemented, but then more changes

atait · 2020-08-19T21:18:58Z

Actually, I might have made the wrong conclusion, sorry. I don't have a test case, just hacked it with this

# make a multi-ensemble network above
break_it = True  # or False
sim = nengo.Simulator(network, optimize=True)
if break_it:
    for op in sim.step_order:
        if type(op).__name__ == 'BsrDotInc':
            arr = sim.signals[op.A]
            arr.setflags(write=1)
            arr[...] = 0
with sim:
    sim.run(1)

What is supposed to happen is that it does in fact disconnect everything and behave differently when break_it=True. That does happen on both master and core-speedups, which means the data values from A_mat is getting the correct view into A and no changes should be needed. My mistake. Changes to sparsity structure (i.e. adding a new connection outside a valid block after building) won't work, but I'm pretty sure that's already disallowed.

I don't use learning rules that much, so it would take some time. Do you think you could help write a realistic test case along the lines of

datas = [None] * 2
for i, learning in enumerate([True, False]):
     if learning:
        network = give_network_with_learning()
    else:
        network = give_network_without_learning()
     sim = nengo.Simulator(network, optimize=True)
    with sim: 
        sim.run(1)
    datas[i] = sim.data[some_probe]
assert not allclose(datas[0], datas[1])

Uses builtin math instead of np array math

drasmuss · 2020-08-20T14:12:09Z

I double checked and this is covered by our existing tests (i.e. we have tests with a BsrDotInc with a writeable A). So I think this is good to go.

drasmuss · 2020-08-20T15:22:38Z

Merged, thanks again for the PR!

tbekolay previously requested changes Aug 19, 2020

View reviewed changes

nengo/neurons.py Outdated Show resolved Hide resolved

nengo/processes.py Outdated Show resolved Hide resolved

nengo/synapses.py Outdated Show resolved Hide resolved

drasmuss force-pushed the core-speedups branch from 762a9d4 to e066052 Compare August 19, 2020 14:26

drasmuss approved these changes Aug 19, 2020

View reviewed changes

atait added 3 commits August 19, 2020 11:50

Initialize BSR matrix outside of run loop

da34d85

Store dict lookup misses in SignalDict

5707f9b

Use more efficient clip implementations

87d05b9

Change ndarray.clip to np.clip or, where possible, np.maximum Change all uses of np.clip to np.umath.clip in numpy>=1.17

drasmuss force-pushed the core-speedups branch from e066052 to ea03342 Compare August 19, 2020 14:51

drasmuss mentioned this pull request Aug 19, 2020

Support Nengo core speedups nengo/nengo-dl#173

Closed

atait and others added 3 commits August 20, 2020 11:09

Add specialized LinearFilter step for scalar input

9286258

Uses builtin math instead of np array math

Fix list of slow examples

197cfeb

Speed up param accesses

eb98cc8

drasmuss force-pushed the core-speedups branch from ea03342 to eb98cc8 Compare August 20, 2020 14:09

drasmuss merged commit eb98cc8 into nengo:master Aug 20, 2020

atait deleted the core-speedups branch August 20, 2020 20:18

atait mentioned this pull request Feb 24, 2021

Sparse ELLPACK nengo/nengo-ocl#188

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core speedups #1629

Core speedups #1629

atait commented Aug 4, 2020

tbekolay left a comment

drasmuss commented Aug 19, 2020 •

edited

Loading

atait commented Aug 19, 2020

drasmuss commented Aug 19, 2020

atait commented Aug 19, 2020

drasmuss commented Aug 20, 2020

drasmuss commented Aug 20, 2020

Core speedups #1629

Core speedups #1629

Conversation

atait commented Aug 4, 2020

tbekolay left a comment

Choose a reason for hiding this comment

drasmuss commented Aug 19, 2020 • edited Loading

atait commented Aug 19, 2020

drasmuss commented Aug 19, 2020

atait commented Aug 19, 2020

drasmuss commented Aug 20, 2020

drasmuss commented Aug 20, 2020

drasmuss commented Aug 19, 2020 •

edited

Loading