Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numba support #2

Open
MainRo opened this issue Jun 21, 2020 · 13 comments
Open

Numba support #2

MainRo opened this issue Jun 21, 2020 · 13 comments

Comments

@MainRo
Copy link
Member

MainRo commented Jun 21, 2020

From belm0:

Performance wise, I consider pypy as the only option today on a real application.

I'm not sure what this means. Pypy has limitations (it is not a 1:1 replacement for CPython), and there are legacy applications which cannot transition to pypy easily, or which are heavily dependent on numpy. For such applications, a numpy + numba implementation is useful. Having such an implementation does not preclude having a pure Python implementation which supports Pypy. They can exist along side each other.

I do not want to use numpy because on a streaming application it will never allow CPython to close the gap with pypy, an numpy just breaks pypy's jit. However I am interested in numba support,

A numba-only implementation will not perform, because numba does not support fast mode with Python arrays. The only way to get performance on this algorithm with numba is via numpy arrays.

I adapted the distogram bench to streamhist to compare the update function. On CPython distogram is 25% faster, and on pypy distogram is 13 times faster.

I measured my pure Python implementation (no numa or numpy) vs. distogram. It is 20% faster (and less code, but I didn't compare closely). The implementation uses a "maintain cost function array" approach just as distogram does. So distogram appears to have some room for improvement.

My numba+numpy implementation is 20x faster than streamhist (with 64 bins).

@MainRo
Copy link
Member Author

MainRo commented Jun 21, 2020

To be honest I did not try numba yet but only read the doc and viewed some talks on it. My understanding is that its jit can unroll loops to simd instructions. So I expect this to work with plain python types, but I may be wrong. I this is correct I think that numba will be interesting also with pypy, hence my interest in it.
Concerning your implementation, I did not find it in your streamhist fork. Did you publish it somewhere ?
For numpy, it do not want its support not clutter the code. btw I am interested in why numpy speeds up your implementation while it slows down mine even on CPython (maybe because I did not try to optimize for it by explicitly using numpy functions).

@MainRo MainRo mentioned this issue Jun 21, 2020
@belm0
Copy link

belm0 commented Jun 22, 2020

After spending many weeks on this last year, I'm fairly confident saying 1) numba does not operate in its fast mode with python arrays, and 2) numpy by itself will not be fast because this isn't a vectorized algorithm (the data point additions are incremental). Notably, the numpy.insert/delete performance is terrible.

I haven't published my approximate histogram code yet. It isn't derived from streamhist or anything else. I guess I'll finally get around to it since nothing else is filling my needs.

As a quick example, here is timing of stdlib bisect_left vs. your hand-coded one (4x slower) vs. njit of your hand-coded one (not faster).

import numba
import bisect

def _bisect_left(a, x):
    lo = 0
    hi = len(a)
    while lo < hi:
        mid = (lo+hi)//2
        if a[mid] < x: lo = mid+1
        else: hi = mid
    return lo

_bisect_left_njit = numba.njit(_bisect_left)

a = list(range(1000))
na = numba.typed.List(a)

%timeit bisect.bisect_left(a, 15)
%timeit _bisect_left(a, 15)
%timeit _bisect_left_njit(na, 15)
381 ns ± 45.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.53 µs ± 69.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.88 µs ± 52.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

@belm0
Copy link

belm0 commented Jun 22, 2020

shows clearly that numba only works well with numpy arrays:

import numba
import numpy as np

def argmin_diff(a):
    i, i_m = 1, -1
    m = inf
    last_item = a[0]
    for item in a[1:]:
        d = item - last_item
        if d < m:
            m = d
            i_m = i-1
        last_item = item
        i += 1
    return i_m, m

argmin_diff_jit = njit(argmin_diff)

a = list(range(1000))
na = numba.typed.List(a)
npa = np.array(a)
%timeit argmin_diff(a)
%timeit argmin_diff_jit(na)
%timeit argmin_diff_jit(npa)
89.5 µs ± 557 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
65.3 µs ± 2.85 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.68 µs ± 37 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

@MainRo
Copy link
Member Author

MainRo commented Jun 22, 2020

As a quick example, here is timing of stdlib bisect_left vs. your hand-coded one (4x slower)

There is something wrong here, because the bisect code in distogram is a copy of the stdlib. So it is exactly the same code:
https://github.com/python/cpython/blob/master/Lib/bisect.py#L50
Performances should be the same.

@belm0
Copy link

belm0 commented Jun 23, 2020

There is something wrong here, because the bisect code in distogram is a copy of the stdlib

read below that stdlib function definition:

# Overwrite above definitions with a fast C implementation
try:
    from _bisect import *
except ImportError:
    pass

(if tuple values are used, there is no reason to re-implement stdlib bisect in the first place)

@MainRo
Copy link
Member Author

MainRo commented Jun 24, 2020

ok thanks, I did not see that previously.
I copied the bisect function so that I can work on bins directly. Otherwise I have to convert the bins to a value-only list. I guess that to be able to use it, i have to split the bin in two value/count lists. But this means two inserts/removal instead of one in some branches of trim. I will check this.

@belm0
Copy link

belm0 commented Jun 24, 2020

Otherwise I have to convert the bins to a value-only list

In the implementations I've seen, (point, count) tuple works fine and gives the most performance opportunities in pure Python. To update an item, simply write a new tuple.

@belm0
Copy link

belm0 commented Jun 30, 2020

My ApproximateHistogram in Python using only stdlib is now contained within a performance timer project I've published:

https://github.com/belm0/perf-timer/blob/063ef693702a9a8dc227d7be14b89c096c643873/src/perf_timer/_histogram.py

@belm0
Copy link

belm0 commented Jul 23, 2020

So using stdlib bisect / tuples made a huge difference, or the old numbers were mistaken?

Interpreter   Operation   Numpy         Req/s
============  ==========  =======  ==========
- CPython 3.7   update      no            65763
- CPython 3.7   update      yes           39277
+ CPython 3.7   update      no           436709
+ CPython 3.7   update      yes          251603

@MainRo
Copy link
Member Author

MainRo commented Jul 23, 2020

Yes, for CPython this is a huge bump.

@belm0
Copy link

belm0 commented Jul 24, 2020

It looks like the laptop numbers need to be updated?

On a i7-9800X Intel CPU, performances are:

============  ==========  =======  ==========
Interpreter   Operation   Numpy         Req/s
============  ==========  =======  ==========
CPython 3.7   update      no           436709
CPython 3.7   update      yes          251603
============  ==========  =======  ==========

On a modest 2014 13" macbook pro, performances are:

============  ==========  =======  ==========
Interpreter   Operation   Numpy         Req/s
============  ==========  =======  ==========
CPython 3.7   update      no           107345
CPython 3.7   update      yes           71843
============  ==========  =======  ==========

@belm0
Copy link

belm0 commented Mar 4, 2021

Yes, for CPython this is a huge bump.

@MainRo would you consider adding me to the README credits as I helped make this package about 10x faster and produce exact histograms when under capacity?

@MainRo
Copy link
Member Author

MainRo commented Mar 6, 2021

sure, I just updated the credits section of the README.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants