Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core speedups #1629

Merged
merged 6 commits into from
Aug 20, 2020
Merged

Core speedups #1629

merged 6 commits into from
Aug 20, 2020

Conversation

atait
Copy link
Contributor

@atait atait commented Aug 4, 2020

Motivation and context:
General performance improvements to Nengo core simulations. Forum topic

  1. Some minor code tricks
  • caching result of SignalDict misses
  • avoiding np.clip from v1.17
  • initializing outside of loop
  1. New LinearFilter step for one-dimensional ensembles
  • gives about an order-of-magnitude speedup on the step
  • since this is a common operation, it gave a 40% time reduction running test_examples.py

How has this been tested?

  1. Watts & Strogatz network with ~30k neurons, ~900k synapses, 1 Ensemble, no NEF
  2. Distributed network with ~200 Ensembles of 10 neurons each, heavy use of NEF (unpublished)

How long should this take to review?

  • Lengthy

Changes are simple, but they are fairly deep within core. More extensive testing should be done to

  1. ensure nothing breaks
  2. independently verify performance across a broader benchmark suite

Where should a reviewer start?

  1. Reproduce speedup. I saw 40% reduction from these commands
git checkout master
time pytest nengo/tests/test_examples.py
git checkout <this PR>
time pytest nengo/tests/test_examples.py
  1. More extensive testing: use a different benchmark suite
  2. More detail: use a function profiler. I use %%prun in Jupyter.
  3. Assessing each change: make a new branch; cherry pick commits off of this PR. They should each work independently (I did not test this)

Types of changes:

  • Bug fix (non-breaking change which fixes an issue)
    • in the sense of unnecessary time spent during simulating, not in terms of correctness
  • New feature (non-breaking change which adds functionality)
    • OneXOneDim: acts as drop-in for existing functionality

Checklist:

  • I have read the CONTRIBUTING.rst document.
  • I have updated the documentation accordingly.
  • I have included a changelog entry.
  • I have added tests to cover my changes.
  • I have run the test suite locally and all tests passed. (Except for linter missing a "noqa" code in nengo/utils/builder.py)

Still to do:
None that I'm aware of

Copy link
Member

@tbekolay tbekolay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saw a few style nitpicks while scanning this PR, looks great though!

nengo/neurons.py Outdated Show resolved Hide resolved
nengo/processes.py Outdated Show resolved Hide resolved
nengo/synapses.py Outdated Show resolved Hide resolved
@drasmuss
Copy link
Member

drasmuss commented Aug 19, 2020

Pushed some minor fixups:

  • style fixups
  • combined the clip changes into one commit
  • I left the umath check without an upper bound, since I suspect we'd forget to update it when the next minor release comes out, and we don't want things to slow down because of that
  • renamed OneXOneDim to OneXScalar (just thought that was slightly clearer)

Then I added two new changes that I discovered while looking into this

  • we were accidentally running all the slow examples during testing
  • the .get Parameter system lookups were eating up a non-trivial amount of time, so I streamlined those a bit

For those curious, I also checked how much impact each of these changes had (using the examples as a benchmark). The vast majority of the improvement comes from moving the BSR initialization outside the step function:

  • moving BSR outside loop (~50s)
  • save SignalDict misses (~8s)
  • more efficient clip (~10s)
  • specialized LinearFilter step (~5s)
  • more efficient param lookup (~12s)

Change ndarray.clip to np.clip or, where possible, np.maximum

Change all uses of np.clip to np.umath.clip in numpy>=1.17
@atait
Copy link
Contributor Author

atait commented Aug 19, 2020

Thanks for the review and changes. Those benchmarking results are interesting because I saw the opposite on my particular test problems, although both sped up. Really nice independent verification!

I went back though again and have one recommended change. @drasmuss, I'll let you do the modify and force push how you'd like.

Plasticity with BsrDotInc:
It is possible for BSR's A to end up writable if the constituent matrices are writable (see initialization of A). I checked this, and it does lead to a change in behavior in this PR (i.e. mat_A does not maintain reference to the data of A). Good news is that writability is shrewdly tracked, so the fix for BsrDotInc is simple (recommended in the forum)

    def make_step(self, signals, dt, rng):
        X = signals[self.X]
        A = signals[self.A]
        Y = signals[self.Y]

        if A.flags.writeable:
            def step_dotinc():
                mat_A = self.bsr_matrix((A, self.indices, self.indptr))
                inc = mat_A.dot(X)
                if self.reshape:
                    inc = inc.reshape(Y.shape)
                Y[...] += inc
        else:
            mat_A = self.bsr_matrix((A, self.indices, self.indptr))
            def step_dotinc():
                inc = mat_A.dot(X)
                if self.reshape:
                    inc = inc.reshape(Y.shape)
                Y[...] += inc

        return step_dotinc

Stressing umath.clip:
I tried pretty hard to break it because it would be very annoying for a current user feeding bad data to something with clip to suddenly get an error. I couldn't break it. For the record, here's what I checked:

  1. umath.clip fails when the bounds are strange: None, nan, byteswapped. I tried feeding bad bounds into objects that use clip, for example ee = nengo.dists.Exponential(2, high=None). This fails eagerly a la NumberParam, so that's good.
  2. umath.clip does cast floats and ints in the first argument when necessary, so that will still work.
  3. nengo.utils.numpy.clip is opt-in, so any user-defined neurons or dependent code using bad bounds won't be affected.
  4. Nengo source opts-in, and, in this PR, all usages have good bounds, so that is also fine

@drasmuss
Copy link
Member

Do you have a test case that shows the behaviour change (I'm guessing it would involve using online learning rules with a matrix that gets merged into a BsrDotInc?). It'd be good to add that as a unit test, since our existing tests didn't pick up that case.

@tbekolay tbekolay dismissed their stale review August 19, 2020 20:07

Changes implemented, but then more changes

@atait
Copy link
Contributor Author

atait commented Aug 19, 2020

Actually, I might have made the wrong conclusion, sorry. I don't have a test case, just hacked it with this

# make a multi-ensemble network above
break_it = True  # or False
sim = nengo.Simulator(network, optimize=True)
if break_it:
    for op in sim.step_order:
        if type(op).__name__ == 'BsrDotInc':
            arr = sim.signals[op.A]
            arr.setflags(write=1)
            arr[...] = 0
with sim:
    sim.run(1)

What is supposed to happen is that it does in fact disconnect everything and behave differently when break_it=True. That does happen on both master and core-speedups, which means the data values from A_mat is getting the correct view into A and no changes should be needed. My mistake. Changes to sparsity structure (i.e. adding a new connection outside a valid block after building) won't work, but I'm pretty sure that's already disallowed.

I don't use learning rules that much, so it would take some time. Do you think you could help write a realistic test case along the lines of

datas = [None] * 2
for i, learning in enumerate([True, False]):
     if learning:
        network = give_network_with_learning()
    else:
        network = give_network_without_learning()
     sim = nengo.Simulator(network, optimize=True)
    with sim: 
        sim.run(1)
    datas[i] = sim.data[some_probe]
assert not allclose(datas[0], datas[1])

@drasmuss
Copy link
Member

I double checked and this is covered by our existing tests (i.e. we have tests with a BsrDotInc with a writeable A). So I think this is good to go.

@drasmuss drasmuss merged commit eb98cc8 into nengo:master Aug 20, 2020
@drasmuss
Copy link
Member

Merged, thanks again for the PR!

@atait atait deleted the core-speedups branch August 20, 2020 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants