Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator graph optimizer #1035

Merged
merged 7 commits into from
Apr 11, 2017
Merged

Operator graph optimizer #1035

merged 7 commits into from
Apr 11, 2017

Conversation

jgosmann
Copy link
Collaborator

@jgosmann jgosmann commented Apr 25, 2016

I updated this description with the new template. The old description can be found below.

Description:
This adds an optimizer to the simulator that goes through the operator graph and merges operators and their signals. This reduces the items to iterate over in slow Python (vs. fast NumPy) and gives larger blocks of sequentially aligned memory for improved CPU caching and prefetching.

Motivation and context:

Not everyone has a powerful GPU and can use nengo_ocl (especially at the summer school it is common to have a laptop without decent GPU). For those people this PR can significantly improve the run time.

This is also useful for running on high-performance clusters (in a serial-farming setup) like Sharcnet which mostly consist of CPU clusters and only few GPU clusters (and usually much more CPU cores than GPUs are provided).

Furthermore, when using large ensemble arrays (as in SPA, especially with spaopt #941, dot products, circular convolutions) this reduces the overhead that is introduced by many small ensembles.

Interactions with other PRs:
No real interactions afaik, but #941 will benefit.

How has this been tested?
The optimizer is enabled by default and all the tests still pass. Furthermore, I am using the optimizer on my TCM model and on @ikajic's spa-rat model. Especially, the latter one has a large arrays of small ensembles (5018 elements in a cleanup memory) and uncovered a number of memory and runtime issues that have been fixed. @xchoo tested a slightly older version of the optimizer on Spaun. Read through the comments for quotes on run time improvements and additions to the build time (quoted build time for Spaun might be lower now with recent improvements).

I did not add a lot of unit tests because for the most part it is not clear to me how to test the optimizer. It is tested through the existing tests (all those models still work) which verifies its correctness in a way, but doesn't test that it actually speed up things or isn't a no-op. Good ideas welcome.

Where should a reviewer start?
optimizer.py is the main algorithm and should be structured going from the high level to the low level of the algorithm. I tried to document it well, but it might not be sufficient to understand everything. Let me know where more comments would be helpful.

Maybe it is best to go through the code with me as it is somewhat complex.

How long should this take to review?

  • Lengthy (more than 150 lines changed or changes are complicated)

Types of changes:

  • New feature (non-breaking change which adds functionality)

Checklist:

  • I have read the CONTRIBUTING.rst document.
  • I have updated the documentation accordingly. (More documentation should be added, separate issue Document how to get the most speed out of Nengo #1119.)
  • I have included a changelog entry.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Still to do:
#1119, but I think it should be a separate PR.

Old description for reference

This adds an optimizer to the simulator that goes through the operator graph and merges operators and their signals. This reduces the items to iterate over in slow Python (vs. fast NumPy) and gives larger blocks of sequentially aligned memory for improved CPU caching and prefetching.

At the moment the only operator that's being merged is SimNeurons, but I will look and see if this can be done for other operators too.

There is a trade-off, though. Running the simulation gets faster, but the build process will be slower. The algorithmic complexity of the optimizer is O(n^2 + e) where n is the number of operators and e the number of edges in the dependency graph.

A few "benchmarks" I run:

  • Simulating an ensemble array with 100 one-dimensional ensembles with 50 neurons: Pure simulation time decreases from 18 to 4 seconds (4.5x speedup! 💨 )
  • py.test -n 6 from 60s down to 51s
  • py.test -n 6 --slow from 415s down to 274s
  • py.test from 83s down to 71s

I haven't tested in on any more complicated models yet, but that should definitely be done. It would be nice to have #937 implemented to easily see the effect on build time.

It probably would be a good idea to have an option to turn the optimization on and off. Probably with an rc setting, but I feel like this could be something people want to change per instantiation of the Simulator. Also wondering if there should be some more fine grained control which operators to merge (or if we add other optimizations which set of these should be done).

I think there was an issue about adding optimizations to the simulator, but I can't find it right now. Found it, this addresses #778.

TODO

  • Merging of views has been implemented for Copy in the op-opt-view branch with a lot of code duplication. Refactor this.
  • Look at view merging for other operators
  • Look at general merging for other operators
  • Add more documentation about how the optimization is performed and some of the finecky details (views get broken in the first pass, then fixed later, then view merging is done because first pass introduces more views)
  • Could we need more than two passes? Or in other words could view merging introduce more views because of associated non-view signals? (--> only relevant if view merging is supported for other things than Copy)
  • Add test for signal merging (this should be simple enough to unit test and is part of the public Signal API). Not sure if it makes sense to test the optimizer in any other way than checking that it does not break any of the existing tests. Complex interactions are hard to cover in any other way anyways. Not sure if it is worth to construct and check simple cases. Maybe that certain optimizations are done at all?
  • Config options
  • Under some circumstances the optimizer seems to require a lot of memory. I should investigate this.
  • Always performs a minimum of two passes even if the first pass didn't lead to a reduction in the number of operators.
  • This currently use the np.stack function in one place which requires NumPy 1.10. Replace it with an equivalent available in earlier versions. Also, add a test because that part of the code is apparently never hit in the tests (no failure on Travis-CI).

@jgosmann
Copy link
Collaborator Author

jgosmann commented Apr 26, 2016

General note:

  • The optimization algorithm is greedy in figuring out what to merge. Finding the optimal solution is equivalent to finding cliques in a graph which is NP-complete.

Important note to self to fix:

  • If two operators access the same signal and both get merged with other operators, things will (probably) break.

@jgosmann
Copy link
Collaborator Author

jgosmann commented Apr 26, 2016

Another todo:

  • add some logging for debugging purposes
  • check for shared signals between operators is too inefficient
  • split up operator list by type to reduce runtime (different types can never be merged, O(a_1^2 + a_2^2 + ... + a_n^2) is better than O((a_1 + a_2 + ... + a_n)^2))
  • Don't even process lists of operator types that can never be merged

@jgosmann jgosmann mentioned this pull request Apr 26, 2016
@Seanny123
Copy link
Contributor

Whenever I hear NP-complete, I think "throw Simulated Annealing at it!". Would this be feasible for this optimisation and would it offer any significant advantage over the greedy-algo?

@jgosmann
Copy link
Collaborator Author

It's probably more effort that it's worth (either in developer time or computation time). Given the constraints under which a merge is performed it might even be that the greedy algorithm is already optimal (at least I can't easily construct a counter example). If you'd loosen those constraints, such situation could occur more easily, but that would make correct merging of the operators also much more complicated (and probably wouldn't allow for a single NumPy operation anymore which is half the reason for the merge).

@jgosmann jgosmann force-pushed the op-opt branch 2 times, most recently from 270ee35 to 87657ab Compare April 26, 2016 16:40
@jgosmann
Copy link
Collaborator Author

jgosmann commented Apr 26, 2016

Continuing my todo list:

  • Merging of views has been implemented for Copy in the op-opt-view branch with a lot of code duplication. Refactor this.
  • Look at view merging for other operators
  • Look at general merging for other operators
  • Add more documentation about how the optimization is performed and some of the finecky details (views get broken in the first pass, then fixed later, then view merging is done because first pass introduces more views)
  • Could we need more than two passes? Or in other words could view merging introduce more views because of associated non-view signals? (--> only relevant if view merging is supported for other things than Copy)
  • Add test for signal merging (this should be simple enough to unit test and is part of the public Signal API). Not sure if it makes sense to test the optimizer in any other way than checking that it does not break any of the existing tests. Complex interactions are hard to cover in any other way anyways. Not sure if it is worth to construct and check simple cases. Maybe that certain optimizations are done at all?
  • Config options

@jgosmann
Copy link
Collaborator Author

I figured out how to merge DotInc operations given SciPy is installed (with bsr_matrix) and realized how to merge even more stuff by correctly propagating memory layouts from on operation to another (though this requires somewhat more optimization passes and makes the build a bit slower). On my circular convolution test case this reduces the simulation time for 75 to 5 seconds (15x speedup! 💨 )

Still have to clean up my code quite a bit before I can commit this, though.

@hunse
Copy link
Collaborator

hunse commented Apr 28, 2016

How fast is your convolution benchmark without combining the DotIncs?

@jgosmann
Copy link
Collaborator Author

18s (5227 operators instead of just 79).

@hunse
Copy link
Collaborator

hunse commented Apr 28, 2016

That's pretty good. Looking at it, it actually just calls gemv for all the individual matrices. But the fact that it's in a tight C loop removes all the Python overhead and obviously really helps. I was kind of worried it would be using some specialized sparse matrix code that won't be as fast as BLAS for large matrices, but that's obviously not the case.

The only drawback is that all the matrices have to be the same size, but that is often true because typically we have quite a few EnsembleArrays in large models. So overall, this should really help. Nice!

@jgosmann
Copy link
Collaborator Author

jgosmann commented May 4, 2016

According to @xchoo the build times for Spaun are reasonable with the latest improvements to the algorithm (though, I have to get back to him for exact numbers). The simulation time is cut down by 90% apparently. 💨 However, that is still slower than nengo_ocl. But then again, there might be situations where people cannot or don't want to install nengo_ocl or don't have a supported GPU. Also, maybe some of this can be used to speed up nengo_ocl as well?

@jgosmann
Copy link
Collaborator Author

jgosmann commented May 4, 2016

If I remember correctly the operator count was decreased to a few hundred thousand from a couple of million for Spaun.

@xchoo
Copy link
Member

xchoo commented May 4, 2016

Specifically, 1,334,851 operators to 245,760 operators. The build time was 2588s, and the runtime for 0.1s was 147.3s.

On the master nengo branch, the build time is typically ~1200s, and the runtime for 0.1s is ~700s

@jgosmann
Copy link
Collaborator Author

jgosmann commented May 4, 2016

How many optimization passes?

@xchoo
Copy link
Member

xchoo commented May 4, 2016

15 passes.

@jgosmann
Copy link
Collaborator Author

jgosmann commented Jun 2, 2016

  • Under some circumstances the optimizer seems to require a lot of memory. I should investigate this.
  • Always performs a minimum of two passes even if the first pass didn't lead to a reduction in the number of operators.

@jgosmann
Copy link
Collaborator Author

I could use a memory allocation guru ... under certain circumstances the optimizer will lead to a very high memory usage (tripled usage during the first pass). My guess right now is that this is due to memory fragmentation because a lot of small memory segments are being merged into one big memory segment which does not fit in any of the freed areas of the small memory segments.

My best idea to solve this is to limit the size of newly created memory segments. Unfortunately, I have no idea what a good limit would be.

@jgosmann jgosmann added this to the 2.2.0 release milestone Jun 27, 2016
@Seanny123
Copy link
Contributor

Seanny123 commented Apr 1, 2017 via email

@tbekolay
Copy link
Member

tbekolay commented Apr 3, 2017

Sorry for the slow reply, this is indeed ready for a final review! @Seanny123 do you think you can get to this in the next few days?

@Seanny123
Copy link
Contributor

Yep, I'll do a review in the next few days!

@Seanny123
Copy link
Contributor

Btw, the fix to the Python 3.6 error got merged and should be in the next release of Python.

SigMerger.check_views([s1[:1], s1[1:]])

# compatible along second axis
SigMerger.check_views([s1[0:1, :1], s1[0:1, 1:]], axis=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: why write 0:1 instead of :1. If there's no reason, then I'll make a fixup commit to address it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't think of a reason off the top of my head!

self.merged_dependents = set()
self.opinfo = OpInfo()

def __call__(self, only_merge_ops_with_view):
Copy link
Contributor

@Seanny123 Seanny123 Apr 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silly inexperienced coder question: why make this a __call__ method instead of a regular method, named something like run_pass?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbekolay might say more on why he decided for this approach, what comes to my mind is:

  • It might highlight the main entry point and sets it apart from the other methods that are not supposed to be called from outside of the class.
  • The class is more used like a function (on steroids because it is actually an object and has some state) so it might make sense to make it a function object?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👉 👉 nailed it, those were my reasons. Really, the whole class could be seen as one big long function, but doing it that way would be really hard to follow and be nested 5 or 6 indentation levels deep, hence why there are methods to organize things. But, the "user" (optimize in this case) should never call those functions, they could treat the class as though it is just one big long function, even though it does some stuff inside the class.

with Timer() as t:
single_pass(only_merge_ops_with_view)

after = len(dg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I'm understanding this correctly, dg is passed into OpMergePass by reference, like all things in Python. Once OpMergePass is called, the dg is modified.

This was a bit hard for me to follow, because if something is modified, I kind of expect:
a) it to be returned, or
b) to be passed in as an argument to a function call

Is this a weird expectation to have? Hypothetically, if the call would return and over-write the dg instance, would it affect performance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, the reason I have this expectation is that I have a mental model of Classes as containers for data that are only interacted with by accessing via attributes or calling methods. I wouldn't expect the data inside of the Class to connected to the outside by a reference, nor would I expect calling the Class to update an outside method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a weird expectation to have?

I don't think those are pretty reasonable expectations, but unfortunately things are a bit more complicated here.

if the call would return and over-write the dg instance, would it affect performance?

Yes, it would increase memory usage; potentially by a lot for large models. The dependency graph can get quite large itself already, but then it also references the initial set of operators and those reference the initial set of signals that require even more memory with their initial values. By keeping the old dependency graph around all those objects would be prevented from being freed.

Maybe it would be clearer to use single_pass.dg.forward instead of dg to make it clear that it is an attribute of the single_pass instance that might get modified by the object?

I have a mental model of Classes as containers for data

I think that is an appropriate mental model in many cases, but not all. Especially here OpMergePass is more like a function object (sometimes called functor). It is used pretty much like a function, but by using an object it allows us to split up the algorithm into smaller functions, but still group them. It also allows to easily share data between those functions and preserve some state across invocations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially had a similar thought that it would be nice to return an optimized version of the dg and model rather than mutating them, but as Jan says, it affects performance significantly. There's a difficult tradeoff to make in this PR, as optimized code tends to be less readable than slow but obvious code. I think this PR manages to strike a good balance between readable and efficient, though it's of course good to get more opinions, and is why we want multiple reviewers for each PR 😄

Copy link
Contributor

@Seanny123 Seanny123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few minor comments that I would like to discuss. Other than those few easily addressable concerns, I think this pull request is ready to be merged.

@Seanny123
Copy link
Contributor

I added my tiny fixup commits. If they look good to @jgosmann then this is ready to merge!

@jgosmann
Copy link
Collaborator Author

jgosmann commented Apr 6, 2017

LGTM. I think @tbekolay wanted to drop bd5bb5a (and e525dba accordingly)?

tbekolay and others added 5 commits April 10, 2017 10:48
This makes it easier to debug issues in which the Signal fails
an assert in the `__init__` prior to the name being set.
This is equivalent to `Signal.itemsize * Signal.size`,
but more explicit.
Reuses set instances which can reduce memory usage considerably in
some instances.
jgosmann and others added 2 commits April 10, 2017 13:14
The optimizer merges together operators at build time in order to
reduce simulation time for many types of common models.

The optimizer required some modifications to existing graph functions.
In particular, `reverse_edges` was changed to return the complete
graph (the old implementation would remove nodes without edges).
Also, we now consistently use sets for collections of graph edges.

This commit includes some internal tests, as well as an integration
test ensuring that complex models are simulated exactly the same.
If we uncover any models that do change with the optimizer in the
future, it is easy to add tests for those cases.
By doing this, we ensure that no attributes of an Op, other than
the `sets`, `incs`, `reads`, and `updates` lists, contain signals.
This speeds up the optimizer by removing the need to look at all
properties of every operator when replacing signals.
Copy link
Member

@tbekolay tbekolay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased this to master, made some history modifications, and added the optimizer to the docs, so this is good to go for me.

@Seanny123 could you file a review if it looks good to you? I'll merge on your positive review.

Sidenote: @jgosmann I noticed while going through the history that some commits had the wrong authorship information (e.g. 161a69d, 2ae579c). In the future could you make sure to make commits with --author="Author <e@mail.com>" if moving around the history loses authorship info? Thanks! 😄 🌈

@jgosmann
Copy link
Collaborator Author

I'm not sure what caused the authorship information to get lost ... I was expecting rebases to keep it? So I didn't even consider checking that.

@tbekolay
Copy link
Member

If I had to guess, it would be in situations where you split a commit into multiple commits... in the past when I've done that, only one of the commits retains the authorship information. At least in the way that I was splitting commits at the time ¯\_(ツ)_/¯ In any case no bigs!

@jgosmann
Copy link
Collaborator Author

I'll try to keep it in mind for the next complicated rebase.

@Seanny123
Copy link
Contributor

Yep. Everything still looks good for me. Merge away!

@tbekolay tbekolay merged commit 0693da3 into master Apr 11, 2017
@tbekolay tbekolay deleted the op-opt branch April 11, 2017 15:52
@jgosmann jgosmann mentioned this pull request May 19, 2017
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants