-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operator graph optimizer #1035
Operator graph optimizer #1035
Conversation
General note:
Important note to self to fix:
|
Another todo:
|
Whenever I hear NP-complete, I think "throw Simulated Annealing at it!". Would this be feasible for this optimisation and would it offer any significant advantage over the greedy-algo? |
It's probably more effort that it's worth (either in developer time or computation time). Given the constraints under which a merge is performed it might even be that the greedy algorithm is already optimal (at least I can't easily construct a counter example). If you'd loosen those constraints, such situation could occur more easily, but that would make correct merging of the operators also much more complicated (and probably wouldn't allow for a single NumPy operation anymore which is half the reason for the merge). |
270ee35
to
87657ab
Compare
Continuing my todo list:
|
I figured out how to merge Still have to clean up my code quite a bit before I can commit this, though. |
How fast is your convolution benchmark without combining the DotIncs? |
18s (5227 operators instead of just 79). |
That's pretty good. Looking at it, it actually just calls gemv for all the individual matrices. But the fact that it's in a tight C loop removes all the Python overhead and obviously really helps. I was kind of worried it would be using some specialized sparse matrix code that won't be as fast as BLAS for large matrices, but that's obviously not the case. The only drawback is that all the matrices have to be the same size, but that is often true because typically we have quite a few EnsembleArrays in large models. So overall, this should really help. Nice! |
According to @xchoo the build times for Spaun are reasonable with the latest improvements to the algorithm (though, I have to get back to him for exact numbers). The simulation time is cut down by 90% apparently. 💨 However, that is still slower than nengo_ocl. But then again, there might be situations where people cannot or don't want to install nengo_ocl or don't have a supported GPU. Also, maybe some of this can be used to speed up nengo_ocl as well? |
If I remember correctly the operator count was decreased to a few hundred thousand from a couple of million for Spaun. |
Specifically, 1,334,851 operators to 245,760 operators. The build time was 2588s, and the runtime for 0.1s was 147.3s. On the master nengo branch, the build time is typically ~1200s, and the runtime for 0.1s is ~700s |
How many optimization passes? |
15 passes. |
|
76b0e1e
to
a24a239
Compare
I could use a memory allocation guru ... under certain circumstances the optimizer will lead to a very high memory usage (tripled usage during the first pass). My guess right now is that this is due to memory fragmentation because a lot of small memory segments are being merged into one big memory segment which does not fit in any of the freed areas of the small memory segments. My best idea to solve this is to limit the size of newly created memory segments. Unfortunately, I have no idea what a good limit would be. |
So is this ready for a final review? *rolls up sleeves and puts on
code-reading goggles*
…On Sat, Apr 1, 2017 at 6:05 AM, Jan Gosmann ***@***.***> wrote:
The issue with Python 3.6 has been identified and there is an issue in
the Python bug tracker <https://bugs.python.org/issue29949>. It seems
that the change introducing this issue might get reverted. There might also
be workarounds on our side possible by construction sets/frozensets in a
different way, but I would rather not increase the complexity of our code
(and their might be also a cost in runtime).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1035 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AB1HabFhlfmMYRvZzB3poMcClix_BCdjks5rrWqPgaJpZM4IPbnL>
.
|
Sorry for the slow reply, this is indeed ready for a final review! @Seanny123 do you think you can get to this in the next few days? |
Yep, I'll do a review in the next few days! |
Btw, the fix to the Python 3.6 error got merged and should be in the next release of Python. |
SigMerger.check_views([s1[:1], s1[1:]]) | ||
|
||
# compatible along second axis | ||
SigMerger.check_views([s1[0:1, :1], s1[0:1, 1:]], axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: why write 0:1
instead of :1
. If there's no reason, then I'll make a fixup commit to address it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't think of a reason off the top of my head!
self.merged_dependents = set() | ||
self.opinfo = OpInfo() | ||
|
||
def __call__(self, only_merge_ops_with_view): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Silly inexperienced coder question: why make this a __call__
method instead of a regular method, named something like run_pass
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tbekolay might say more on why he decided for this approach, what comes to my mind is:
- It might highlight the main entry point and sets it apart from the other methods that are not supposed to be called from outside of the class.
- The class is more used like a function (on steroids because it is actually an object and has some state) so it might make sense to make it a function object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👉 👉 nailed it, those were my reasons. Really, the whole class could be seen as one big long function, but doing it that way would be really hard to follow and be nested 5 or 6 indentation levels deep, hence why there are methods to organize things. But, the "user" (optimize
in this case) should never call those functions, they could treat the class as though it is just one big long function, even though it does some stuff inside the class.
nengo/builder/optimizer.py
Outdated
with Timer() as t: | ||
single_pass(only_merge_ops_with_view) | ||
|
||
after = len(dg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I'm understanding this correctly, dg
is passed into OpMergePass
by reference, like all things in Python. Once OpMergePass
is called, the dg
is modified.
This was a bit hard for me to follow, because if something is modified, I kind of expect:
a) it to be returned, or
b) to be passed in as an argument to a function call
Is this a weird expectation to have? Hypothetically, if the call
would return and over-write the dg
instance, would it affect performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, the reason I have this expectation is that I have a mental model of Classes as containers for data that are only interacted with by accessing via attributes or calling methods. I wouldn't expect the data inside of the Class to connected to the outside by a reference, nor would I expect calling the Class to update an outside method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a weird expectation to have?
I don't think those are pretty reasonable expectations, but unfortunately things are a bit more complicated here.
if the call would return and over-write the dg instance, would it affect performance?
Yes, it would increase memory usage; potentially by a lot for large models. The dependency graph can get quite large itself already, but then it also references the initial set of operators and those reference the initial set of signals that require even more memory with their initial values. By keeping the old dependency graph around all those objects would be prevented from being freed.
Maybe it would be clearer to use single_pass.dg.forward
instead of dg
to make it clear that it is an attribute of the single_pass
instance that might get modified by the object?
I have a mental model of Classes as containers for data
I think that is an appropriate mental model in many cases, but not all. Especially here OpMergePass
is more like a function object (sometimes called functor). It is used pretty much like a function, but by using an object it allows us to split up the algorithm into smaller functions, but still group them. It also allows to easily share data between those functions and preserve some state across invocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially had a similar thought that it would be nice to return an optimized version of the dg
and model
rather than mutating them, but as Jan says, it affects performance significantly. There's a difficult tradeoff to make in this PR, as optimized code tends to be less readable than slow but obvious code. I think this PR manages to strike a good balance between readable and efficient, though it's of course good to get more opinions, and is why we want multiple reviewers for each PR 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few minor comments that I would like to discuss. Other than those few easily addressable concerns, I think this pull request is ready to be merged.
I added my tiny fixup commits. If they look good to @jgosmann then this is ready to merge! |
This makes it easier to debug issues in which the Signal fails an assert in the `__init__` prior to the name being set.
This is equivalent to `Signal.itemsize * Signal.size`, but more explicit.
Reuses set instances which can reduce memory usage considerably in some instances.
The optimizer merges together operators at build time in order to reduce simulation time for many types of common models. The optimizer required some modifications to existing graph functions. In particular, `reverse_edges` was changed to return the complete graph (the old implementation would remove nodes without edges). Also, we now consistently use sets for collections of graph edges. This commit includes some internal tests, as well as an integration test ensuring that complex models are simulated exactly the same. If we uncover any models that do change with the optimizer in the future, it is easy to add tests for those cases.
By doing this, we ensure that no attributes of an Op, other than the `sets`, `incs`, `reads`, and `updates` lists, contain signals. This speeds up the optimizer by removing the need to look at all properties of every operator when replacing signals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebased this to master, made some history modifications, and added the optimizer to the docs, so this is good to go for me.
@Seanny123 could you file a review if it looks good to you? I'll merge on your positive review.
Sidenote: @jgosmann I noticed while going through the history that some commits had the wrong authorship information (e.g. 161a69d, 2ae579c). In the future could you make sure to make commits with --author="Author <e@mail.com>"
if moving around the history loses authorship info? Thanks! 😄 🌈
I'm not sure what caused the authorship information to get lost ... I was expecting rebases to keep it? So I didn't even consider checking that. |
If I had to guess, it would be in situations where you split a commit into multiple commits... in the past when I've done that, only one of the commits retains the authorship information. At least in the way that I was splitting commits at the time ¯\_(ツ)_/¯ In any case no bigs! |
I'll try to keep it in mind for the next complicated rebase. |
Yep. Everything still looks good for me. Merge away! |
I updated this description with the new template. The old description can be found below.
Description:
This adds an optimizer to the simulator that goes through the operator graph and merges operators and their signals. This reduces the items to iterate over in slow Python (vs. fast NumPy) and gives larger blocks of sequentially aligned memory for improved CPU caching and prefetching.
Motivation and context:
Not everyone has a powerful GPU and can use nengo_ocl (especially at the summer school it is common to have a laptop without decent GPU). For those people this PR can significantly improve the run time.
This is also useful for running on high-performance clusters (in a serial-farming setup) like Sharcnet which mostly consist of CPU clusters and only few GPU clusters (and usually much more CPU cores than GPUs are provided).
Furthermore, when using large ensemble arrays (as in SPA, especially with spaopt #941, dot products, circular convolutions) this reduces the overhead that is introduced by many small ensembles.
Interactions with other PRs:
No real interactions afaik, but #941 will benefit.
How has this been tested?
The optimizer is enabled by default and all the tests still pass. Furthermore, I am using the optimizer on my TCM model and on @ikajic's spa-rat model. Especially, the latter one has a large arrays of small ensembles (5018 elements in a cleanup memory) and uncovered a number of memory and runtime issues that have been fixed. @xchoo tested a slightly older version of the optimizer on Spaun. Read through the comments for quotes on run time improvements and additions to the build time (quoted build time for Spaun might be lower now with recent improvements).
I did not add a lot of unit tests because for the most part it is not clear to me how to test the optimizer. It is tested through the existing tests (all those models still work) which verifies its correctness in a way, but doesn't test that it actually speed up things or isn't a no-op. Good ideas welcome.
Where should a reviewer start?
optimizer.py
is the main algorithm and should be structured going from the high level to the low level of the algorithm. I tried to document it well, but it might not be sufficient to understand everything. Let me know where more comments would be helpful.Maybe it is best to go through the code with me as it is somewhat complex.
How long should this take to review?
Types of changes:
Checklist:
Still to do:
#1119, but I think it should be a separate PR.
Old description for reference
This adds an optimizer to the simulator that goes through the operator graph and merges operators and their signals. This reduces the items to iterate over in slow Python (vs. fast NumPy) and gives larger blocks of sequentially aligned memory for improved CPU caching and prefetching.
At the moment the only operator that's being merged is
SimNeurons
, but I will look and see if this can be done for other operators too.There is a trade-off, though. Running the simulation gets faster, but the build process will be slower. The algorithmic complexity of the optimizer is O(n^2 + e) where n is the number of operators and e the number of edges in the dependency graph.
A few "benchmarks" I run:
py.test -n 6
from 60s down to 51spy.test -n 6 --slow
from 415s down to 274spy.test
from 83s down to 71sI haven't tested in on any more complicated models yet, but that should definitely be done. It would be nice to have #937 implemented to easily see the effect on build time.
It probably would be a good idea to have an option to turn the optimization on and off. Probably with an rc setting, but I feel like this could be something people want to change per instantiation of the
Simulator
. Also wondering if there should be some more fine grained control which operators to merge (or if we add other optimizations which set of these should be done).I think there was an issue about adding optimizations to the simulator, but I can't find it right now.Found it, this addresses #778.TODO
Copy
in theop-opt-view
branch with a lot of code duplication. Refactor this.Copy
)Signal
API). Not sure if it makes sense to test the optimizer in any other way than checking that it does not break any of the existing tests. Complex interactions are hard to cover in any other way anyways. Not sure if it is worth to construct and check simple cases. Maybe that certain optimizations are done at all?np.stack
function in one place which requires NumPy 1.10. Replace it with an equivalent available in earlier versions. Also, add a test because that part of the code is apparently never hit in the tests (no failure on Travis-CI).