New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numba LIF model reduces simulation time by 30% or more #1482
Conversation
FYI, I tried speeding up the
Benchmarking snippet for future reference: import nengo
from nengo.utils.least_squares_solvers import Cholesky, NumbaCholesky
from nengo.solvers import LstsqL2
import numpy as np
N = 3000
M = 3000
D = 256
rng = np.random.RandomState(seed=0)
A = rng.randn(M, N)
Y = rng.randn(M, D)
for subsolver in (Cholesky(), NumbaCholesky()):
solver = LstsqL2(solver=subsolver)
d, info = solver(A, Y)
print(type(subsolver).__name__, info['time']) |
Edit: Found a work-around using closures that doesn't hurt performance. I don't completely like this solution, but now we at least get the same performance graph from above, without duplicating the |
Is it possible to substitute the |
Should be doable. But wondering if there is any reason to allow the standard numpy implementation to still be accessible, even when |
Very cool! And would be even cooler if we can apply it to more aspects of the sim!
Do you have a few possible ways forward in mind that we can vote on? I can think of a few possibilities but figured that you have more of a sense of what's possible having implemented it. |
With some help, I figured out the correct way to extend Numba to support the missing Note that
I also tried changing Then I added another test to this PR that compares the spike-trains of |
Here's my current thinking:
|
I would lean towards 1 or 2 right now (no strong preference between them), but with an eye to eventually go to option 5. I could also see going straight to 5. I'm not too worried about the problem of letting the users know that they could get a speedup if they install numba -- we currently have exactly that same situation with letting people know that if they install an optimized numpy, then things will go much faster. We should have a "how to make your models go faster" document somewhere and include both of those two things as suggestions. |
|
After a quick look, they should be compatible with each other, in the sense that you could subclass a network from both of them. All my network does is change a couple of defaults. Maybe a better solution is needed if people want to do this though.
My understanding is you actually don't! If you use |
With this information 4 seems viable to me too give thorough prior testing on as much different systems as we can reasonably do. |
I'm surprised too. Getting That said, I'm still wary of introducing another required dependency. While a lot of people will want the speed, for others (perhaps interested in trying some tutorials or running some small models to get started) speed is not so important. The other thing to remember is that while the plot at the top shows a 25-30% speedup in the neuron model, this is in the neuron model only. For many (most?) models, the majority of the computation is spent on multiplying weights, so the overall speedup might be more like 10%. I think my long term ideal is option 6. I like having the choice to turn it off, if only for debugging and development. I don't think adding an extra config option is a big burden. Short-term, I'm not sure if we need to use 1 or 2 as a stepping stone or not. It would be good to test this out on some other models and with other backends. My worry with 1 or 2 is that we'll put it there but none of us will really test it. I think I'd rather go with option 6 but have it default to disabled as Aaron suggested. All of us can then set our config files to use it (we should perhaps also have a print statement just to remind/ensure that it's being used). Once it's been in use on our end for a bit, then we can default it to enabled. |
You are right that I haven't done enough here to explore the impact outside of a model that contains one ensemble and no connections. Below, I've tried something a bit more thorough by running the complete suite of nengo-benchmarks. Each benchmark, except for "Parsing" (*), and "MNIST" (raises an error), performed 7-48% better in simulation time averaged across 10 trials. Lorenz had the best improvement at 48%. (*) There is considerable variability trial-to-trial. Running the "Parsing" benchmark again with more trials happened to give a 24% improvement this time around. The number of trials needs to be increased and confidence intervals should be considered, otherwise take each sample with a grain of salt. The mean across all benchmarks is 16% -- pretty consistent with your guess of 10%. I'm optimistic that adding JIT support to other important computations such as dot-products and elementwise-products will make further improvements, consistent with the intuition that the majority of time is spent multiplying weights. from collections import defaultdict
import nengo
import nengo_benchmarks
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import DataFrame
benchmarks = [
nengo_benchmarks.CommunicationChannel,
nengo_benchmarks.ConvolutionCleanup,
nengo_benchmarks.CircularConvolution,
nengo_benchmarks.InhibitionTrial,
nengo_benchmarks.LearningSpeedup,
nengo_benchmarks.Lorenz,
nengo_benchmarks.MatrixMultiply,
nengo_benchmarks.SemanticMemory,
nengo_benchmarks.SemanticMemoryWithRecall,
# nengo_benchmarks.MNIST,
nengo_benchmarks.Oscillator,
nengo_benchmarks.Parsing,
nengo_benchmarks.SPASequence,
nengo_benchmarks.SPASequenceRouted,
]
data = defaultdict(list)
kwargs = {
'verbose': False,
}
for benchmark in benchmarks:
for trial in range(10):
result_ref = benchmark().run(**kwargs)
result_numba = benchmark().run(neuron_type=nengo.neurons.NumbaLIF(), **kwargs)
name = benchmark.__name__
print(name, trial)
data['Benchmark'].append(name)
data['Trial'].append(trial)
# speed is inversely proportional to time, so this computes numba's improvement
data['Improvement'].append(result_numba['speed'] / result_ref['speed'])
df = DataFrame(data)
plt.figure(figsize=(16, 6))
plt.hlines(1, -1, len(benchmarks), linestyle='--')
sns.boxplot(data=df, x='Benchmark', y='Improvement')
plt.xticks(rotation=50)
plt.show()
avg = np.mean(df['Improvement'])
plt.figure()
sns.kdeplot(df['Improvement'], shade=True)
plt.vlines([avg], 0, 1, linestyle='--')
plt.text(avg, 1.1, "%.2f" % avg, horizontalalignment='center')
plt.xlabel("Improvement")
plt.ylabel("Density")
plt.show()
for benchmark in df['Benchmark'].unique():
print(benchmark, np.mean(df[df['Benchmark'] == benchmark]['Improvement']))
|
Dot products are already done by a highly-optimized BLAS library though, assuming you've set up your Numpy correctly. The main way I can see saving there is doing the things that I think the optimizer already does, namely grouping the operations together so that you need fewer calls. Elementwise multiply might be slightly easier for the JIT compiler to optimize, but again it's essentially one Numpy call per Operator, so the overhead isn't huge. I think what makes neuron models such a good target is that there's a number of very simple Numpy calls within one neuron model (e.g. logs, inverses, multiplies, indexing) making for a lot of overhead that the JIT compiler can optimize out. Anyway, I'm not saying it's not worth trying the other operations, I'm just managing expectations. It is interesting that some models (like Lorenz) have such big improvements. That's an even bigger improvement than you saw with a raw population of neurons (I assume that's what's in |
Commenting from the view of building & running large models (i.e., spaun), with the As context, running Spaun using a blas installation of nengo takes about an hour, and even a 48% improvement does not compare to the sub-minute runtimes in nengo_ocl. |
It can be in issue if your forced to use the reference backend, e.g. because of learning rules not implemented in Nengo OCL. |
For what it's worth, you can check whether numpy is using OpenBLAS via
In my experiment below, I found that Numba is 70% faster at dot-products than Numpy with OpenBLAS. import time
import numpy as np
from numba import njit
def np_dot(A, B):
return np.dot(A, B)
@njit
def numba_dot(A, B):
return np.dot(A, B)
trials = 500
n = 1000
d = 10
m = 5000
A = np.random.randn(n, d)
B = np.random.randn(d, m)
print(np.__config__.openblas_info)
for func in (np_dot, numba_dot):
t = time.time()
for i in range(trials):
func(A, B)
td = time.time() - t
print(func.__name__, td)
If we can optimize away any loops within the reference simulator there should be more gains. Although I am running into some challenges elsewhere.
For the plot at the very top, 30% was the worst-case improvement. For
From this perspective, supposing Numba can be successfully applied to core parts of the builder, such as sampling, collection of tuning curves, solving for decoders, etc., then options 1 and 2 would in fact prevent the builder from having access to Numba and these potential improvements in build time. Would this change your vote? If build time is the big win for Spaun, I can try to prove out some improvements on the builder side before taking this to a vote? |
The instructions are here, just for anyone else looking for this information in the future |
Ah, thank you. Specifically where it says "If speed is an issue and you know your way around a terminal, installing NumPy from source is flexible and performant. See the detailed instructions [here]." I was looking for keywords such as "BLAS", "ATLAS", "LAPACK", or Fortran. Ironically the 2.0.0 pypi readme does have this keyword because the hyperlink markup is not parsed which exposes the keyword in the URL to Google. |
I tried that numba dot script on my machine, and got the following:
I think that Numpy with some sort of BLAS is much easier to install now than in the past. I've got the OpenBLAS binaries from I'm actually finding the opposite for EDIT: Perhaps we should have new instructions for installing fast Numpy, since it does appear to be much easier now (my old ones that we still link to make it seem overly complicated). On any Ubuntu variant, I think it should be as straightforward as |
There could certainly be some differences from machine to machine. I encourage others to also run both the benchmark test in this branch (see first post) and the script above (#1482 (comment)). Here's details of my architecture given by the
I tried running the script from a fresh environment, via:
and got the same results as before. One thing I discovered though, is that without the |
If you do |
Wow! Using
compared to before, using
|
My workstation also has a 3.2x speedup with Numba on the dot script with
|
I get similar results on my workstation:
Without MKL:
With MKL:
You can see I've done two tests, one with smaller matrices and one with larger. Both |
After some internal discussion we decided to go with option 1 (moving this to |
Motivation and context:
The neuron model is one of the inner-most loops within the reference backend. Speeding this code up with JIT compilation would offer a significant reduction in simulation times to virtually all Nengo models using the default simulator.
To this end,
nengo.neurons.NumbaLIF
works as a drop-in replacement fornengo.LIF
. This new neuron model uses Numba'snopython
just-in-time compilation feature to execute optimized machine code (instead of using the Python interpreter). See details: http://numba.pydata.org/numba-doc/latest/user/performance-tips.html.How has this been tested?
pip install numba
py.test nengo/tests/test_neurons.py::test_numbalif_benchmark --plots=plots --slow
xdg-open plots/test_neurons.test_numbalif_benchmark.pdf
How long should this take to review?
Where should a reviewer start?
This is a work in progress. How should this be exposed to the user? If
numba
is already installed, should a factory substitute this in forLIF
? Should the config system use this as the default if available?Types of changes:
Checklist:
Still to do:
seaborn
andpandas
NumbaLIF
LIF.step_math
code.MoveExtended Numba to supportcustom_clip
to a utility file for numba-related functions?np.clip