Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numba LIF model reduces simulation time by 30% or more #1482

Closed
wants to merge 2 commits into from
Closed

Conversation

arvoelke
Copy link
Collaborator

@arvoelke arvoelke commented Oct 31, 2018

Motivation and context:
The neuron model is one of the inner-most loops within the reference backend. Speeding this code up with JIT compilation would offer a significant reduction in simulation times to virtually all Nengo models using the default simulator.

To this end, nengo.neurons.NumbaLIF works as a drop-in replacement for nengo.LIF. This new neuron model uses Numba's nopython just-in-time compilation feature to execute optimized machine code (instead of using the Python interpreter). See details: http://numba.pydata.org/numba-doc/latest/user/performance-tips.html.

How has this been tested?

  • pip install numba
  • py.test nengo/tests/test_neurons.py::test_numbalif_benchmark --plots=plots --slow
  • xdg-open plots/test_neurons.test_numbalif_benchmark.pdf

profile_numbalif

How long should this take to review?

  • Lengthy (more than 150 lines changed or changes are complicated)

Where should a reviewer start?
This is a work in progress. How should this be exposed to the user? If numba is already installed, should a factory substitute this in for LIF? Should the config system use this as the default if available?

Types of changes:

  • New feature (non-breaking change which adds functionality)

Checklist:

  • I have read the CONTRIBUTING.rst document.
  • I have updated the documentation accordingly.
  • I have included a changelog entry.
  • I have added tests to cover my changes.
  • I have run the test suite locally and all tests passed.

Still to do:

  • Test for correctness / consistency
  • Test depends on seaborn and pandas
  • Extend to other models in Nengo, including synapse models
  • Rerun existing Nengo test suite against NumbaLIF
  • Refactor code to eliminate duplication of LIF.step_math code.
  • Move custom_clip to a utility file for numba-related functions? Extended Numba to support np.clip

@arvoelke
Copy link
Collaborator Author

@arvoelke arvoelke commented Nov 1, 2018

FYI, I tried speeding up the Cholesky sub-solver using Numba, in three different ways, none of which worked:

  • Apply @njit to:

    L = np.linalg.cholesky(G)
    L = np.linalg.inv(L.T)
    X = np.dot(L, np.dot(L.T, b))

    This ended up being slower than scipy.

  • Apply @njit to:

    factor = scipy.linalg.cho_factor(G, overwrite_a=True)
    X = scipy.linalg.cho_solve(factor, b)

    This gave an unsupported method error.

  • Apply @jit to the previous code.
    This gave no noticeable improvement for large systems.

Benchmarking snippet for future reference:

import nengo
from nengo.utils.least_squares_solvers import Cholesky, NumbaCholesky
from nengo.solvers import LstsqL2

import numpy as np

N = 3000
M = 3000
D = 256

rng = np.random.RandomState(seed=0)
A = rng.randn(M, N)
Y = rng.randn(M, D)

for subsolver in (Cholesky(), NumbaCholesky()):
    solver = LstsqL2(solver=subsolver)
    d, info = solver(A, Y)
    print(type(subsolver).__name__, info['time'])

@arvoelke
Copy link
Collaborator Author

@arvoelke arvoelke commented Nov 1, 2018

Also, I tried to deduplicate the code between LIF and NumbaLIF, by allowing the custom clip method to be provided as an additional argument to the static _lif_step_math method. However, it seems that passing a function pointer to a numba-compiled function has a significant associated cost. Specifically, it halved the NumbaLIF improvement at 10,000 neurons, without changing the time for LIF.

When I tried to get around this by wrapping the function, with the clip argument partially applied, Numba's type-inference machinery hit a snag when compiling the wrapper function.

Edit: Found a work-around using closures that doesn't hurt performance. I don't completely like this solution, but now we at least get the same performance graph from above, without duplicating the step_math code.

@jgosmann
Copy link
Collaborator

@jgosmann jgosmann commented Nov 1, 2018

How should this be exposed to the user? If numba is already installed, should a factory substitute this in for LIF?

Is it possible to substitute the numba implentation if the numba module can be imported and otherwise use the default implementation?

@arvoelke
Copy link
Collaborator Author

@arvoelke arvoelke commented Nov 1, 2018

Is it possible to substitute the numba implentation if the numba module can be imported and otherwise use the default implementation?

Should be doable. But wondering if there is any reason to allow the standard numpy implementation to still be accessible, even when numba is installed? And how should this work for inheritance (e.g., AdaptiveLIF calls LIF.step_math)? I also want to figure out what everyone would be happy with before applying this to the other neuron models.

@tbekolay
Copy link
Member

@tbekolay tbekolay commented Nov 2, 2018

Very cool! And would be even cooler if we can apply it to more aspects of the sim!

want to figure out what everyone would be happy with

Do you have a few possible ways forward in mind that we can vote on? I can think of a few possibilities but figured that you have more of a sense of what's possible having implemented it.

@arvoelke
Copy link
Collaborator Author

@arvoelke arvoelke commented Nov 12, 2018

With some help, I figured out the correct way to extend Numba to support the missing np.clip implementation. I pushed this as a PR to Numba: numba/numba#3468 and included the relevant (self-contained) portion in this PR to avoid depending on some future Numba release. With this utility file, it basically becomes a one-line change to make the LIF model Numba-compatible.

Note that numba depends on:

I also tried changing LIF entirely over to the Numba implementation, and re-ran the test suite locally. All tests passed except for nengo/tests/test_neurons.py::test_lif_builtin, because this uses a 2D-array for the current J. See #1419 and #1437 for relevant documentation fixes / changes that might affect this test. This helps give some assurances that the implementation is up to spec.

Then I added another test to this PR that compares the spike-trains of LIF against NumbaLIF for numerical precision. This new test passes on my machine, which gives more supporting evidence that the implementation is correct.

@arvoelke
Copy link
Collaborator Author

@arvoelke arvoelke commented Nov 12, 2018

Do you have a few possible ways forward in mind that we can vote on? I can think of a few possibilities but figured that you have more of a sense of what's possible having implemented it.

Here's my current thinking:

  1. Move this to nengo_extras. This would mean the user must make the effort to install numba, nengo_extras, and then switch instances of nengo.LIF() over to nengo_extras.NumbaLIF() either on a per-ensemble basis or at the config level. It may be difficult or infeasible to extend Numba-compatibility to other parts of the reference simulator.

  2. Move this to nengolib. This would be the same as option 1, except I could provide the support in terms of documentation / testing / maintenance, and I can also make it the default neuron model (if numba is installed) via nengolib.Network().

  3. Keep this in nengo as a separate neuron model, as the PR currently stands. This would be similar to option 1, but could improve user-discovery and reduce the amount of work needed to install (one fewer package).

  4. Replace nengo.LIF() with nengo.NumbaLIF() and force users to install numba by adding it as a pip dependency. If they can't install it, then they can't use the reference simulator. This does have the benefit of reducing the surface area for maintenance / testing / debugging / etc (relative to options 5 and 6). It also forces discovery of the improvement. Last, it makes it the easiest to extend Numba-compatibility to any part of the simulator that we want, either now or in the future.

  5. Replace nengo.LIF() with nengo.NumbaLIF() if numba is already installed, and otherwise fall back to the standard implementation. Assuming the implementation is okay, users may experience a free speed-up without having to do much. However, they would need to be made aware of the suggestion to install numba. We may also want to expose some flag that reveals whether or not the switch was made.

  6. Replace nengo.LIF() with nengo.NumbaLIF() if numba is already installed and some flag is set to True in the config. This flag could default to either value, e.g., perhaps defaulting to False for the first couple of releases, and then switched over to True if there prove to be no serious issues. This has the benefits of option 5, while allowing a user to turn it off for whatever reason, even if numba is still installed. This is the most flexible, but also the most complicated, which means more maintenance / testing / etc.

@tcstewar
Copy link
Contributor

@tcstewar tcstewar commented Nov 13, 2018

I would lean towards 1 or 2 right now (no strong preference between them), but with an eye to eventually go to option 5. I could also see going straight to 5. I'm not too worried about the problem of letting the users know that they could get a speedup if they install numba -- we currently have exactly that same situation with letting people know that if they install an optimized numpy, then things will go much faster. We should have a "how to make your models go faster" document somewhere and include both of those two things as suggestions.

@jgosmann
Copy link
Collaborator

@jgosmann jgosmann commented Nov 13, 2018

  • 1 + 2 For any serious large scale models, one would want the most performant install. Each additional step to achieve this is an annoyance to me.
  • 2 How would this work together with nengo_spa.Network?
  • 4 I assume this might require a working C compiler? This would be quite a departure from not having to worry about such things once Numpy is running.
  • 5 would be the option I favour. We could consider giving a warning if Numba isn't installed (we do the same with Scipy in some cases which gives better accuracy in the Basal Ganglia and gives additional speed ups with the optimizer). The optimized Numpy mentioned by @tcstewar is a slightly different case because afaik we cannot detect whether the installed Numpy is optimized.
  • 6 My first reaction is to better not introduce a new config option. But that might be only because I'm currently dealing with a system at work where pretty much everything is configurable and it turns out to be pretty annoying.

@arvoelke
Copy link
Collaborator Author

@arvoelke arvoelke commented Nov 13, 2018

  • 2 How would this work together with nengo_spa.Network?

After a quick look, they should be compatible with each other, in the sense that you could subclass a network from both of them. All my network does is change a couple of defaults. Maybe a better solution is needed if people want to do this though.

  • 4 I assume this might require a working C compiler? This would be quite a departure from not having to worry about such things once Numpy is running.

My understanding is you actually don't! If you use pip or conda then you get a pre-built binary where llvmlite has been statically linked to the required subset of the LLVM compiler. Only if you want to build the dev branch of llvmlite do you have to worry about having a compiler, AFAIK. llvmlite is the only dependency other than numpy>=1.9. Details on compatibility here: http://numba.pydata.org/numba-doc/latest/user/installing.html

@jgosmann
Copy link
Collaborator

@jgosmann jgosmann commented Nov 13, 2018

With this information 4 seems viable to me too give thorough prior testing on as much different systems as we can reasonably do.

@hunse
Copy link
Collaborator

@hunse hunse commented Nov 14, 2018

I'm surprised too. Getting numba installed seems much easier than I expected.

That said, I'm still wary of introducing another required dependency. While a lot of people will want the speed, for others (perhaps interested in trying some tutorials or running some small models to get started) speed is not so important. The other thing to remember is that while the plot at the top shows a 25-30% speedup in the neuron model, this is in the neuron model only. For many (most?) models, the majority of the computation is spent on multiplying weights, so the overall speedup might be more like 10%.

I think my long term ideal is option 6. I like having the choice to turn it off, if only for debugging and development. I don't think adding an extra config option is a big burden.

Short-term, I'm not sure if we need to use 1 or 2 as a stepping stone or not. It would be good to test this out on some other models and with other backends. My worry with 1 or 2 is that we'll put it there but none of us will really test it. I think I'd rather go with option 6 but have it default to disabled as Aaron suggested. All of us can then set our config files to use it (we should perhaps also have a print statement just to remind/ensure that it's being used). Once it's been in use on our end for a bit, then we can default it to enabled.

@arvoelke
Copy link
Collaborator Author

@arvoelke arvoelke commented Nov 14, 2018

The other thing to remember is that while the plot at the top shows a 25-30% speedup in the neuron model, this is in the neuron model only. For many (most?) models, the majority of the computation is spent on multiplying weights, so the overall speedup might be more like 10%.

You are right that I haven't done enough here to explore the impact outside of a model that contains one ensemble and no connections. Below, I've tried something a bit more thorough by running the complete suite of nengo-benchmarks. Each benchmark, except for "Parsing" (*), and "MNIST" (raises an error), performed 7-48% better in simulation time averaged across 10 trials. Lorenz had the best improvement at 48%.

(*) There is considerable variability trial-to-trial. Running the "Parsing" benchmark again with more trials happened to give a 24% improvement this time around. The number of trials needs to be increased and confidence intervals should be considered, otherwise take each sample with a grain of salt.

The mean across all benchmarks is 16% -- pretty consistent with your guess of 10%. I'm optimistic that adding JIT support to other important computations such as dot-products and elementwise-products will make further improvements, consistent with the intuition that the majority of time is spent multiplying weights.

benchmark_individual

benchmark_combined

from collections import defaultdict

import nengo
import nengo_benchmarks

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import DataFrame

benchmarks = [
    nengo_benchmarks.CommunicationChannel,
    nengo_benchmarks.ConvolutionCleanup,
    nengo_benchmarks.CircularConvolution,
    nengo_benchmarks.InhibitionTrial,
    nengo_benchmarks.LearningSpeedup,
    nengo_benchmarks.Lorenz,
    nengo_benchmarks.MatrixMultiply,
    nengo_benchmarks.SemanticMemory,
    nengo_benchmarks.SemanticMemoryWithRecall,
    # nengo_benchmarks.MNIST,
    nengo_benchmarks.Oscillator,
    nengo_benchmarks.Parsing,
    nengo_benchmarks.SPASequence,
    nengo_benchmarks.SPASequenceRouted,
]

data = defaultdict(list)

kwargs = {
    'verbose': False,
}

for benchmark in benchmarks:
    for trial in range(10):
        result_ref = benchmark().run(**kwargs)
        result_numba = benchmark().run(neuron_type=nengo.neurons.NumbaLIF(), **kwargs)

        name = benchmark.__name__
        print(name, trial)

        data['Benchmark'].append(name)
        data['Trial'].append(trial)
        # speed is inversely proportional to time, so this computes numba's improvement
        data['Improvement'].append(result_numba['speed'] / result_ref['speed'])

df = DataFrame(data)

plt.figure(figsize=(16, 6))
plt.hlines(1, -1, len(benchmarks), linestyle='--')
sns.boxplot(data=df, x='Benchmark', y='Improvement')
plt.xticks(rotation=50)
plt.show()

avg = np.mean(df['Improvement'])

plt.figure()
sns.kdeplot(df['Improvement'], shade=True)
plt.vlines([avg], 0, 1, linestyle='--')
plt.text(avg, 1.1, "%.2f" % avg, horizontalalignment='center')
plt.xlabel("Improvement")
plt.ylabel("Density")
plt.show()

for benchmark in df['Benchmark'].unique():
    print(benchmark, np.mean(df[df['Benchmark'] == benchmark]['Improvement']))
...
('CommunicationChannel', 1.0711993000693962)
('ConvolutionCleanup', 1.083407656191113)
('CircularConvolution', 1.1739388343539425)
('InhibitionTrial', 1.2655979857474466)
('LearningSpeedup', 1.1582004179285226)
('Lorenz', 1.4812901392236884)
('MatrixMultiply', 1.15535191082224)
('SemanticMemory', 1.171235946916829)
('SemanticMemoryWithRecall', 1.1197507715888813)
('Oscillator', 1.265969636887784)
('Parsing', 0.8497689418473513)
('SPASequence', 1.1256255668578001)
('SPASequenceRouted', 1.1679054635787818)

@hunse
Copy link
Collaborator

@hunse hunse commented Nov 14, 2018

Dot products are already done by a highly-optimized BLAS library though, assuming you've set up your Numpy correctly. The main way I can see saving there is doing the things that I think the optimizer already does, namely grouping the operations together so that you need fewer calls. Elementwise multiply might be slightly easier for the JIT compiler to optimize, but again it's essentially one Numpy call per Operator, so the overhead isn't huge. I think what makes neuron models such a good target is that there's a number of very simple Numpy calls within one neuron model (e.g. logs, inverses, multiplies, indexing) making for a lot of overhead that the JIT compiler can optimize out. Anyway, I'm not saying it's not worth trying the other operations, I'm just managing expectations.

It is interesting that some models (like Lorenz) have such big improvements. That's an even bigger improvement than you saw with a raw population of neurons (I assume that's what's in test_numbalif_benchmark), which is surprising.

@xchoo
Copy link
Member

@xchoo xchoo commented Nov 14, 2018

Commenting from the view of building & running large models (i.e., spaun), with the nengo_ocl backend, simulation time actually is not an issue compared to the time it takes to actually construct the model (30s simulation vs 45min build time). That being the case, I'd favour the solution that prefers a slimmer nengo install (i.e., options 1 & 2).

As context, running Spaun using a blas installation of nengo takes about an hour, and even a 48% improvement does not compare to the sub-minute runtimes in nengo_ocl.

@jgosmann
Copy link
Collaborator

@jgosmann jgosmann commented Nov 14, 2018

Commenting from the view of building large models (i.e., spaun), with the nengo_ocl backend, simulation time actually is not an issue compared to the time it takes to actually construct the model.

It can be in issue if your forced to use the reference backend, e.g. because of learning rules not implemented in Nengo OCL.

@arvoelke
Copy link
Collaborator Author

@arvoelke arvoelke commented Nov 14, 2018

afaik we cannot detect whether the installed Numpy is optimized

For what it's worth, you can check whether numpy is using OpenBLAS via np.__config__.show() or np.__config__.openblas_info (numpy/numpy#3912).

Dot products are already done by a highly-optimized BLAS library though, assuming you've set up your Numpy correctly

In my experiment below, I found that Numba is 70% faster at dot-products than Numpy with OpenBLAS. np.__config__.openblas_info returns information on openblas, while np.__config__.show() does not show any info on ATLAS, and so I think I do have it properly installed and configured. I remember this being relatively difficult to do, and think it would be much easier for an average person to pip install numba. And I could only find the instructions linked from the original nengo==2.0.0 pypi release readme: https://pypi.org/project/nengo/2.0.0/. I could not find instructions elsewhere in the Nengo documentation.

import time

import numpy as np
from numba import njit

def np_dot(A, B):
    return np.dot(A, B)

@njit
def numba_dot(A, B):
    return np.dot(A, B)

trials = 500
n = 1000
d = 10
m = 5000

A = np.random.randn(n, d)
B = np.random.randn(d, m)

print(np.__config__.openblas_info)
for func in (np_dot, numba_dot):
    t = time.time()
    for i in range(trials):
        func(A, B)
    td = time.time() - t
    print(func.__name__, td)
{'define_macros': [('HAVE_CBLAS', None)],
 'language': 'c',
 'libraries': ['openblas', 'openblas'],
 'library_dirs': ['/usr/local/lib']}
('np_dot', 14.827863931655884)
('numba_dot', 8.773243188858032)

If we can optimize away any loops within the reference simulator there should be more gains. Although I am running into some challenges elsewhere.

It is interesting that some models (like Lorenz) have such big improvements. That's an even bigger improvement than you saw with a raw population of neurons (I assume that's what's in test_numbalif_benchmark), which is surprising.

For the plot at the very top, 30% was the worst-case improvement. For n=2000 neurons, which is how many the Lorenz benchmark uses, the improvement is closer to 50%. This also makes sense considering the Lorenz benchmark is a single ensemble with one connection and a probe.

Commenting from the view of building & running large models (i.e., spaun), with the nengo_ocl backend, simulation time actually is not an issue compared to the time it takes to actually construct the model (30s simulation vs 45min build time). That being the case, I'd favour the solution that prefers a slimmer nengo install (i.e., options 1 & 2).

From this perspective, supposing Numba can be successfully applied to core parts of the builder, such as sampling, collection of tuning curves, solving for decoders, etc., then options 1 and 2 would in fact prevent the builder from having access to Numba and these potential improvements in build time. Would this change your vote? If build time is the big win for Spaun, I can try to prove out some improvements on the builder side before taking this to a vote?

@drasmuss
Copy link
Member

@drasmuss drasmuss commented Nov 14, 2018

And I could only find the instructions linked from the original nengo==2.0.0 pypi release readme: https://pypi.org/project/nengo/2.0.0/. I could not find instructions elsewhere in the Nengo documentation.

The instructions are here, just for anyone else looking for this information in the future 😃
https://www.nengo.ai/nengo/getting_started.html#installing-numpy

@arvoelke
Copy link
Collaborator Author

@arvoelke arvoelke commented Nov 14, 2018

The instructions are here, just for anyone else looking for this information in the future
https://www.nengo.ai/nengo/getting_started.html#installing-numpy

Ah, thank you. Specifically where it says "If speed is an issue and you know your way around a terminal, installing NumPy from source is flexible and performant. See the detailed instructions [here]." I was looking for keywords such as "BLAS", "ATLAS", "LAPACK", or Fortran. Ironically the 2.0.0 pypi readme does have this keyword because the hyperlink markup is not parsed which exposes the keyword in the URL to Google.

@hunse
Copy link
Collaborator

@hunse hunse commented Nov 14, 2018

I tried that numba dot script on my machine, and got the following:

{'define_macros': [('HAVE_CBLAS', None)], 'library_dirs': ['/usr/local/lib'], 'libraries': ['openblas', 'openblas'], 'language': 'c'}
np_dot 16.1305673122406
numba_dot 19.067010164260864

I think that Numpy with some sort of BLAS is much easier to install now than in the past. I've got the OpenBLAS binaries from apt-get installed, and if I do pip install numpy in a fresh virtualenv it detects them and uses them.

I'm actually finding the opposite for numba. When I try to pip install it in a fresh virtualenv, it's not finding the pre-build wheels for some reason and is trying to build them, and fails because it can't find llvm-config. So I wasn't actually successful in comparing them in a fresh virtualenv, which is what I set out to do.

EDIT: Perhaps we should have new instructions for installing fast Numpy, since it does appear to be much easier now (my old ones that we still link to make it seem overly complicated). On any Ubuntu variant, I think it should be as straightforward as sudo apt-get install libopenblas-dev and then pip install numpy. Other OSes might be best to go with Conda. I'm also curious what it would do if you just pip install numpy with no OpenBLAS on there.

@arvoelke
Copy link
Collaborator Author

@arvoelke arvoelke commented Nov 14, 2018

There could certainly be some differences from machine to machine. I encourage others to also run both the benchmark test in this branch (see first post) and the script above (#1482 (comment)). Here's details of my architecture given by the lscpu command on Ubuntu 16.04 LTS:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 60
Model name:            Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
Stepping:              3
CPU MHz:               3265.195
CPU max MHz:           3900.0000
CPU min MHz:           800.0000
BogoMIPS:              6784.94
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb invpcid_single ssbd ibrs ibpb stibp kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l1d

I tried running the script from a fresh environment, via:

conda create --name temp python=3.6
conda activate temp
pip install numpy scipy numba
python -c "<copy-script-here>"

and got the same results as before. One thing I discovered though, is that without the scipy install this produces the error scipy 0.16+ is required for linear algebra, which curiously only seems to be documented in a "Note" here: http://numba.pydata.org/numba-doc/latest/reference/numpysupported.html#linear-algebra. This is only true for the above script which uses np.dot (not for the njit'd code in this branch, which uses only standard ufuncs and array indexing).

@xchoo
Copy link
Member

@xchoo xchoo commented Nov 14, 2018

If you do conda install numpy instead of pip install numpy conda will also install prebuilt mkl packages that should improve the run time of numpy code.

@arvoelke
Copy link
Collaborator Author

@arvoelke arvoelke commented Nov 14, 2018

If you do conda install numpy instead of pip install numpy conda will also install prebuilt mkl packages that should improve the run time of numpy code.

Wow! Using mkl instead of openblas gives me way better results for both numba and numpy, while providing an even larger relative Numba speedup -- the dot-product script is over 3.2x faster than numpy!

mkl:

np_dot 9.89888334274292
numba_dot 3.046924352645874

compared to before, using openblas:

np_dot 14.827863931655884
numba_dot 8.773243188858032

np.__config__.show()

mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/arvoelke/anaconda3/envs/temp/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/arvoelke/anaconda3/envs/temp/include']
blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/arvoelke/anaconda3/envs/temp/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/arvoelke/anaconda3/envs/temp/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/arvoelke/anaconda3/envs/temp/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/arvoelke/anaconda3/envs/temp/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/arvoelke/anaconda3/envs/temp/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/arvoelke/anaconda3/envs/temp/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/arvoelke/anaconda3/envs/temp/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/arvoelke/anaconda3/envs/temp/include']

@arvoelke
Copy link
Collaborator Author

@arvoelke arvoelke commented Nov 15, 2018

My workstation also has a 3.2x speedup with Numba on the dot script with mkl.

lscpu

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              12
On-line CPU(s) list: 0-11
Thread(s) per core:  2
Core(s) per socket:  6
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
Stepping:            2
CPU MHz:             1199.489
CPU max MHz:         3800.0000
CPU min MHz:         1200.0000
BogoMIPS:            6996.69
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            15360K
NUMA node0 CPU(s):   0-11
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts flush_l1d
conda create --yes --name temp python=3.6 numpy scipy numba
conda activate temp
python -c "<copy-script-here-with-nengo.__config__.show()>"
mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/arvoelke/anaconda3/envs/temp/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/arvoelke/anaconda3/envs/temp/include']
blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/arvoelke/anaconda3/envs/temp/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/arvoelke/anaconda3/envs/temp/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/arvoelke/anaconda3/envs/temp/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/arvoelke/anaconda3/envs/temp/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/arvoelke/anaconda3/envs/temp/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/arvoelke/anaconda3/envs/temp/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/arvoelke/anaconda3/envs/temp/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/arvoelke/anaconda3/envs/temp/include']
np_dot 7.823083162307739
numba_dot 2.4039652347564697

@hunse
Copy link
Collaborator

@hunse hunse commented Nov 15, 2018

I get similar results on my workstation:
lscpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 94
Model name:            Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Stepping:              3
CPU MHz:               800.047
CPU max MHz:           4200.0000
CPU min MHz:           800.0000
BogoMIPS:              8016.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7

Without MKL:

Test: A (1000, 10), B (10, 5000), trials=500
np_dot 10.026745319366455
numba_dot 4.355346918106079
Test: A (3000, 3000), B (3000, 5000), trials=10
np_dot 4.72563362121582
numba_dot 4.389530181884766

With MKL:

Test: A (1000, 10), B (10, 5000), trials=500
np_dot 7.384211301803589
numba_dot 2.3511486053466797
Test: A (3000, 3000), B (3000, 5000), trials=10
np_dot 4.1761345863342285
numba_dot 3.9460630416870117

You can see I've done two tests, one with smaller matrices and one with larger. Both numba and MKL seem to make more of a difference in the case where we have many smaller matrices. Arguably, though, this is more like what we typically do in Nengo (though the optimizer does change that some by grouping things together). Anyway, definitely worth trying numba across more of Nengo to see how much it helps.

@drasmuss
Copy link
Member

@drasmuss drasmuss commented May 24, 2019

After some internal discussion we decided to go with option 1 (moving this to nengo_extras), for now. See nengo/nengo-extras#86. That doesn't preclude moving this back to nengo core in the future if we find ourselves using it a lot. nengo-extras just seemed like a more lightweight first step for numba integration, before we start adding numba logic (even if optional) into the core codebase.

@drasmuss drasmuss closed this May 24, 2019
@hunse hunse deleted the numba branch May 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

7 participants