Operator graph optimizer #1035

jgosmann · 2016-04-25T22:34:38Z

I updated this description with the new template. The old description can be found below.

Description:
This adds an optimizer to the simulator that goes through the operator graph and merges operators and their signals. This reduces the items to iterate over in slow Python (vs. fast NumPy) and gives larger blocks of sequentially aligned memory for improved CPU caching and prefetching.

Motivation and context:

Not everyone has a powerful GPU and can use nengo_ocl (especially at the summer school it is common to have a laptop without decent GPU). For those people this PR can significantly improve the run time.

This is also useful for running on high-performance clusters (in a serial-farming setup) like Sharcnet which mostly consist of CPU clusters and only few GPU clusters (and usually much more CPU cores than GPUs are provided).

Furthermore, when using large ensemble arrays (as in SPA, especially with spaopt #941, dot products, circular convolutions) this reduces the overhead that is introduced by many small ensembles.

Interactions with other PRs:
No real interactions afaik, but #941 will benefit.

How has this been tested?
The optimizer is enabled by default and all the tests still pass. Furthermore, I am using the optimizer on my TCM model and on @ikajic's spa-rat model. Especially, the latter one has a large arrays of small ensembles (5018 elements in a cleanup memory) and uncovered a number of memory and runtime issues that have been fixed. @xchoo tested a slightly older version of the optimizer on Spaun. Read through the comments for quotes on run time improvements and additions to the build time (quoted build time for Spaun might be lower now with recent improvements).

I did not add a lot of unit tests because for the most part it is not clear to me how to test the optimizer. It is tested through the existing tests (all those models still work) which verifies its correctness in a way, but doesn't test that it actually speed up things or isn't a no-op. Good ideas welcome.

Where should a reviewer start?
optimizer.py is the main algorithm and should be structured going from the high level to the low level of the algorithm. I tried to document it well, but it might not be sufficient to understand everything. Let me know where more comments would be helpful.

Maybe it is best to go through the code with me as it is somewhat complex.

How long should this take to review?

Lengthy (more than 150 lines changed or changes are complicated)

Types of changes:

New feature (non-breaking change which adds functionality)

Checklist:

I have read the CONTRIBUTING.rst document.
I have updated the documentation accordingly. (More documentation should be added, separate issue Document how to get the most speed out of Nengo #1119.)
I have included a changelog entry.
I have added tests to cover my changes.
All new and existing tests passed.

Still to do:
#1119, but I think it should be a separate PR.

Old description for reference

This adds an optimizer to the simulator that goes through the operator graph and merges operators and their signals. This reduces the items to iterate over in slow Python (vs. fast NumPy) and gives larger blocks of sequentially aligned memory for improved CPU caching and prefetching.

At the moment the only operator that's being merged is SimNeurons, but I will look and see if this can be done for other operators too.

There is a trade-off, though. Running the simulation gets faster, but the build process will be slower. The algorithmic complexity of the optimizer is O(n^2 + e) where n is the number of operators and e the number of edges in the dependency graph.

A few "benchmarks" I run:

Simulating an ensemble array with 100 one-dimensional ensembles with 50 neurons: Pure simulation time decreases from 18 to 4 seconds (4.5x speedup! 💨 )
py.test -n 6 from 60s down to 51s
py.test -n 6 --slow from 415s down to 274s
py.test from 83s down to 71s

I haven't tested in on any more complicated models yet, but that should definitely be done. It would be nice to have #937 implemented to easily see the effect on build time.

It probably would be a good idea to have an option to turn the optimization on and off. Probably with an rc setting, but I feel like this could be something people want to change per instantiation of the Simulator. Also wondering if there should be some more fine grained control which operators to merge (or if we add other optimizations which set of these should be done).

~~I think there was an issue about adding optimizations to the simulator, but I can't find it right now.~~ Found it, this addresses #778.

TODO

jgosmann · 2016-04-26T01:26:39Z

General note:

The optimization algorithm is greedy in figuring out what to merge. Finding the optimal solution is equivalent to finding cliques in a graph which is NP-complete.

Important note to self to fix:

If two operators access the same signal and both get merged with other operators, things will (probably) break.

jgosmann · 2016-04-26T03:37:18Z

Another todo:

add some logging for debugging purposes
check for shared signals between operators is too inefficient
split up operator list by type to reduce runtime (different types can never be merged, O(a_1^2 + a_2^2 + ... + a_n^2) is better than O((a_1 + a_2 + ... + a_n)^2))
Don't even process lists of operator types that can never be merged

Seanny123 · 2016-04-26T15:35:41Z

Whenever I hear NP-complete, I think "throw Simulated Annealing at it!". Would this be feasible for this optimisation and would it offer any significant advantage over the greedy-algo?

jgosmann · 2016-04-26T15:42:32Z

It's probably more effort that it's worth (either in developer time or computation time). Given the constraints under which a merge is performed it might even be that the greedy algorithm is already optimal (at least I can't easily construct a counter example). If you'd loosen those constraints, such situation could occur more easily, but that would make correct merging of the operators also much more complicated (and probably wouldn't allow for a single NumPy operation anymore which is half the reason for the merge).

jgosmann · 2016-04-26T20:43:25Z

Continuing my todo list:

Merging of views has been implemented for Copy in the op-opt-view branch with a lot of code duplication. Refactor this.
Look at view merging for other operators
Look at general merging for other operators
Add more documentation about how the optimization is performed and some of the finecky details (views get broken in the first pass, then fixed later, then view merging is done because first pass introduces more views)
Could we need more than two passes? Or in other words could view merging introduce more views because of associated non-view signals? (--> only relevant if view merging is supported for other things than Copy)
Add test for signal merging (this should be simple enough to unit test and is part of the public Signal API). Not sure if it makes sense to test the optimizer in any other way than checking that it does not break any of the existing tests. Complex interactions are hard to cover in any other way anyways. Not sure if it is worth to construct and check simple cases. Maybe that certain optimizations are done at all?
Config options

jgosmann · 2016-04-28T17:58:51Z

I figured out how to merge DotInc operations given SciPy is installed (with bsr_matrix) and realized how to merge even more stuff by correctly propagating memory layouts from on operation to another (though this requires somewhat more optimization passes and makes the build a bit slower). On my circular convolution test case this reduces the simulation time for 75 to 5 seconds (15x speedup! 💨 )

Still have to clean up my code quite a bit before I can commit this, though.

hunse · 2016-04-28T18:07:42Z

How fast is your convolution benchmark without combining the DotIncs?

jgosmann · 2016-04-28T18:18:50Z

18s (5227 operators instead of just 79).

hunse · 2016-04-28T18:29:27Z

That's pretty good. Looking at it, it actually just calls gemv for all the individual matrices. But the fact that it's in a tight C loop removes all the Python overhead and obviously really helps. I was kind of worried it would be using some specialized sparse matrix code that won't be as fast as BLAS for large matrices, but that's obviously not the case.

The only drawback is that all the matrices have to be the same size, but that is often true because typically we have quite a few EnsembleArrays in large models. So overall, this should really help. Nice!

jgosmann · 2016-05-04T16:05:36Z

According to @xchoo the build times for Spaun are reasonable with the latest improvements to the algorithm (though, I have to get back to him for exact numbers). The simulation time is cut down by 90% apparently. 💨 However, that is still slower than nengo_ocl. But then again, there might be situations where people cannot or don't want to install nengo_ocl or don't have a supported GPU. Also, maybe some of this can be used to speed up nengo_ocl as well?

jgosmann · 2016-05-04T16:06:34Z

If I remember correctly the operator count was decreased to a few hundred thousand from a couple of million for Spaun.

xchoo · 2016-05-04T19:10:43Z

Specifically, 1,334,851 operators to 245,760 operators. The build time was 2588s, and the runtime for 0.1s was 147.3s.

On the master nengo branch, the build time is typically ~1200s, and the runtime for 0.1s is ~700s

jgosmann · 2016-05-04T20:09:22Z

How many optimization passes?

xchoo · 2016-05-04T22:13:49Z

15 passes.

jgosmann · 2016-06-02T20:49:46Z

Under some circumstances the optimizer seems to require a lot of memory. I should investigate this.
Always performs a minimum of two passes even if the first pass didn't lead to a reduction in the number of operators.

jgosmann · 2016-06-27T03:43:13Z

I could use a memory allocation guru ... under certain circumstances the optimizer will lead to a very high memory usage (tripled usage during the first pass). My guess right now is that this is due to memory fragmentation because a lot of small memory segments are being merged into one big memory segment which does not fit in any of the freed areas of the small memory segments.

My best idea to solve this is to limit the size of newly created memory segments. Unfortunately, I have no idea what a good limit would be.

Seanny123 · 2017-04-01T03:01:13Z

So is this ready for a final review? *rolls up sleeves and puts on code-reading goggles*

…

On Sat, Apr 1, 2017 at 6:05 AM, Jan Gosmann ***@***.***> wrote: The issue with Python 3.6 has been identified and there is an issue in the Python bug tracker <https://bugs.python.org/issue29949>. It seems that the change introducing this issue might get reverted. There might also be workarounds on our side possible by construction sets/frozensets in a different way, but I would rather not increase the complexity of our code (and their might be also a cost in runtime). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1035 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB1HabFhlfmMYRvZzB3poMcClix_BCdjks5rrWqPgaJpZM4IPbnL> .

tbekolay · 2017-04-03T19:16:52Z

Sorry for the slow reply, this is indeed ready for a final review! @Seanny123 do you think you can get to this in the next few days?

Seanny123 · 2017-04-04T02:39:02Z

Yep, I'll do a review in the next few days!

Seanny123 · 2017-04-05T09:11:06Z

Btw, the fix to the Python 3.6 error got merged and should be in the next release of Python.

Seanny123 · 2017-04-05T09:16:54Z

nengo/builder/tests/test_optimizer.py

+    SigMerger.check_views([s1[:1], s1[1:]])
+
+    # compatible along second axis
+    SigMerger.check_views([s1[0:1, :1], s1[0:1, 1:]], axis=1)


Nitpick: why write 0:1 instead of :1. If there's no reason, then I'll make a fixup commit to address it.

Can't think of a reason off the top of my head!

Seanny123 · 2017-04-05T09:42:37Z

nengo/builder/optimizer.py

+        self.merged_dependents = set()
+        self.opinfo = OpInfo()
+
+    def __call__(self, only_merge_ops_with_view):


Silly inexperienced coder question: why make this a __call__ method instead of a regular method, named something like run_pass?

@tbekolay might say more on why he decided for this approach, what comes to my mind is:

It might highlight the main entry point and sets it apart from the other methods that are not supposed to be called from outside of the class.

The class is more used like a function (on steroids because it is actually an object and has some state) so it might make sense to make it a function object?

👉 👉 nailed it, those were my reasons. Really, the whole class could be seen as one big long function, but doing it that way would be really hard to follow and be nested 5 or 6 indentation levels deep, hence why there are methods to organize things. But, the "user" (optimize in this case) should never call those functions, they could treat the class as though it is just one big long function, even though it does some stuff inside the class.

Seanny123 · 2017-04-05T09:52:44Z

nengo/builder/optimizer.py

+        with Timer() as t:
+            single_pass(only_merge_ops_with_view)
+
+        after = len(dg)


Just to make sure I'm understanding this correctly, dg is passed into OpMergePass by reference, like all things in Python. Once OpMergePass is called, the dg is modified.

This was a bit hard for me to follow, because if something is modified, I kind of expect:
a) it to be returned, or
b) to be passed in as an argument to a function call

Is this a weird expectation to have? Hypothetically, if the call would return and over-write the dg instance, would it affect performance?

To be clear, the reason I have this expectation is that I have a mental model of Classes as containers for data that are only interacted with by accessing via attributes or calling methods. I wouldn't expect the data inside of the Class to connected to the outside by a reference, nor would I expect calling the Class to update an outside method.

Is this a weird expectation to have?

I don't think those are pretty reasonable expectations, but unfortunately things are a bit more complicated here.

if the call would return and over-write the dg instance, would it affect performance?

Yes, it would increase memory usage; potentially by a lot for large models. The dependency graph can get quite large itself already, but then it also references the initial set of operators and those reference the initial set of signals that require even more memory with their initial values. By keeping the old dependency graph around all those objects would be prevented from being freed.

Maybe it would be clearer to use single_pass.dg.forward instead of dg to make it clear that it is an attribute of the single_pass instance that might get modified by the object?

I have a mental model of Classes as containers for data

I think that is an appropriate mental model in many cases, but not all. Especially here OpMergePass is more like a function object (sometimes called functor). It is used pretty much like a function, but by using an object it allows us to split up the algorithm into smaller functions, but still group them. It also allows to easily share data between those functions and preserve some state across invocations.

I initially had a similar thought that it would be nice to return an optimized version of the dg and model rather than mutating them, but as Jan says, it affects performance significantly. There's a difficult tradeoff to make in this PR, as optimized code tends to be less readable than slow but obvious code. I think this PR manages to strike a good balance between readable and efficient, though it's of course good to get more opinions, and is why we want multiple reviewers for each PR 😄

Seanny123

I left a few minor comments that I would like to discuss. Other than those few easily addressable concerns, I think this pull request is ready to be merged.

Seanny123 · 2017-04-06T01:10:05Z

I added my tiny fixup commits. If they look good to @jgosmann then this is ready to merge!

jgosmann · 2017-04-06T01:28:22Z

LGTM. I think @tbekolay wanted to drop bd5bb5a (and e525dba accordingly)?

This makes it easier to debug issues in which the Signal fails an assert in the `__init__` prior to the name being set.

This is equivalent to `Signal.itemsize * Signal.size`, but more explicit.

Reuses set instances which can reduce memory usage considerably in some instances.

The optimizer merges together operators at build time in order to reduce simulation time for many types of common models. The optimizer required some modifications to existing graph functions. In particular, `reverse_edges` was changed to return the complete graph (the old implementation would remove nodes without edges). Also, we now consistently use sets for collections of graph edges. This commit includes some internal tests, as well as an integration test ensuring that complex models are simulated exactly the same. If we uncover any models that do change with the optimizer in the future, it is easy to add tests for those cases.

By doing this, we ensure that no attributes of an Op, other than the `sets`, `incs`, `reads`, and `updates` lists, contain signals. This speeds up the optimizer by removing the need to look at all properties of every operator when replacing signals.

tbekolay

Rebased this to master, made some history modifications, and added the optimizer to the docs, so this is good to go for me.

@Seanny123 could you file a review if it looks good to you? I'll merge on your positive review.

Sidenote: @jgosmann I noticed while going through the history that some commits had the wrong authorship information (e.g. 161a69d, 2ae579c). In the future could you make sure to make commits with --author="Author <e@mail.com>" if moving around the history loses authorship info? Thanks! 😄 🌈

jgosmann · 2017-04-10T17:36:06Z

I'm not sure what caused the authorship information to get lost ... I was expecting rebases to keep it? So I didn't even consider checking that.

tbekolay · 2017-04-10T17:39:43Z

If I had to guess, it would be in situations where you split a commit into multiple commits... in the past when I've done that, only one of the commits retains the authorship information. At least in the way that I was splitting commits at the time ¯\_(ツ)_/¯ In any case no bigs!

jgosmann · 2017-04-10T17:41:02Z

I'll try to keep it in mind for the next complicated rebase.

Seanny123 · 2017-04-11T00:57:57Z

Yep. Everything still looks good for me. Merge away!

jgosmann self-assigned this Apr 25, 2016

jgosmann added work in progress discussion labels Apr 25, 2016

jgosmann mentioned this pull request Apr 25, 2016

Combining small ensembles within the same layer #778

Closed

jgosmann mentioned this pull request Apr 26, 2016

Remove Copy operator #1034

Closed

jgosmann force-pushed the op-opt branch 2 times, most recently from 270ee35 to 87657ab Compare April 26, 2016 16:40

jgosmann mentioned this pull request Apr 28, 2016

Store operators in a more organized way #986

Closed

jgosmann force-pushed the op-opt branch from 86e55c4 to 56cb1e6 Compare May 6, 2016 18:37

jgosmann force-pushed the op-opt branch from 56cb1e6 to 296755a Compare May 17, 2016 21:09

jgosmann force-pushed the op-opt branch 2 times, most recently from 76b0e1e to a24a239 Compare June 15, 2016 21:15

jgosmann force-pushed the op-opt branch from 7a98093 to f4aa0e4 Compare June 26, 2016 15:43

jgosmann added this to the 2.2.0 release milestone Jun 27, 2016

Seanny123 reviewed Apr 5, 2017

View reviewed changes

tbekolay and others added 5 commits April 10, 2017 10:48

Improve Signal string representations

a96b21c

Improve Signal offset documentation.

5f786d5

Set Signal._name first

c6a23ec

This makes it easier to debug issues in which the Signal fails an assert in the `__init__` prior to the name being set.

Add nbytes property to Signal

d4500fa

This is equivalent to `Signal.itemsize * Signal.size`, but more explicit.

Add function for transitive graph closure.

0a4c319

Reuses set instances which can reduce memory usage considerably in some instances.

tbekolay force-pushed the op-opt branch from e525dba to f844f83 Compare April 10, 2017 16:09

jgosmann and others added 2 commits April 10, 2017 13:14

tbekolay force-pushed the op-opt branch from f844f83 to 0693da3 Compare April 10, 2017 17:21

tbekolay approved these changes Apr 10, 2017

View reviewed changes

tbekolay unassigned jgosmann Apr 11, 2017

tbekolay added reviewed and removed needs second review labels Apr 11, 2017

tbekolay merged commit 0693da3 into master Apr 11, 2017

tbekolay deleted the op-opt branch April 11, 2017 15:52

jgosmann mentioned this pull request May 19, 2017

Fix ReLU example for optimizer. #1315

Merged

4 tasks

Operator graph optimizer #1035

Operator graph optimizer #1035

Conversation

jgosmann commented Apr 25, 2016 • edited Loading

Old description for reference

TODO

jgosmann commented Apr 26, 2016 • edited Loading

jgosmann commented Apr 26, 2016 • edited Loading

Seanny123 commented Apr 26, 2016

jgosmann commented Apr 26, 2016

jgosmann commented Apr 26, 2016 • edited Loading

jgosmann commented Apr 28, 2016

hunse commented Apr 28, 2016

jgosmann commented Apr 28, 2016

hunse commented Apr 28, 2016

jgosmann commented May 4, 2016

jgosmann commented May 4, 2016

xchoo commented May 4, 2016

jgosmann commented May 4, 2016

xchoo commented May 4, 2016

jgosmann commented Jun 2, 2016 • edited Loading

jgosmann commented Jun 27, 2016

Seanny123 commented Apr 1, 2017 via email

tbekolay commented Apr 3, 2017

Seanny123 commented Apr 4, 2017

Seanny123 commented Apr 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Seanny123 Apr 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Seanny123 left a comment

Choose a reason for hiding this comment

Seanny123 commented Apr 6, 2017

jgosmann commented Apr 6, 2017

tbekolay left a comment

Choose a reason for hiding this comment

jgosmann commented Apr 10, 2017

tbekolay commented Apr 10, 2017

jgosmann commented Apr 10, 2017

Seanny123 commented Apr 11, 2017

jgosmann commented Apr 25, 2016 •

edited

Loading

jgosmann commented Apr 26, 2016 •

edited

Loading

jgosmann commented Apr 26, 2016 •

edited

Loading

jgosmann commented Apr 26, 2016 •

edited

Loading

jgosmann commented Jun 2, 2016 •

edited

Loading

Seanny123 Apr 5, 2017 •

edited

Loading