Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cupy / numba to speed up computation? #48

Open
janpipek opened this issue Oct 3, 2018 · 6 comments
Open

Use cupy / numba to speed up computation? #48

janpipek opened this issue Oct 3, 2018 · 6 comments

Comments

@janpipek
Copy link
Owner

janpipek commented Oct 3, 2018

Use https://github.com/cupy/cupy as computing backend? Given the API similarity to numpy, it should be trivial.

Investigate if numba can lead to any speed up.

@janpipek janpipek added this to the 0.4 milestone Oct 3, 2018
@janpipek janpipek self-assigned this Oct 3, 2018
@janpipek
Copy link
Owner Author

janpipek commented Oct 3, 2018

https://github.com/astrofrog/fast-histogram also looks promising

@henryiii
Copy link

henryiii commented Oct 31, 2018

By the way, I looked at speed a bit, and I think there's a bit to be gained! I had 128 1M element arrays I needed 2D histograms from, and I had switched to Physt to make it simple to rebin and sum them (which it did!). But I found using numpy then giving it to Physt gave 3x better performance. I then redid it in numba and got 8-9x more performance. I hadn't tried fast-histogram, but it looks pretty similar to numba. Here are the timings. I originally did this on a Xeon Phi, which has 64 very slow cores and 4x hyperthreading. Obviously, none of these solutions is using it very well (to be honest, not much does). I reran it on my laptop too.

import numpy as np
import numba
import math
from physt import h2
import fast_histogram 
from physt.histogram_nd import Histogram2D
bins=(100, 100)
ranges=((-1,1),(-1,1))
bins = np.asarray(bins).astype(np.int64)
ranges = np.asarray(ranges).astype(np.float64)
    
edges = (np.linspace(*ranges[0,:], bins[0]+1),
         np.linspace(*ranges[1,:], bins[1]+1))
# This is 8-9 times faster than np.histogram
# We give the signature here so it gets precompmiled
# In theory, this could even be threaded (nogil!)
@numba.njit(nogil=True)
def fast_histogram2d(tracks, bins, ranges):
    H = np.zeros((bins[0], bins[1]), dtype=np.uint64)
    delta = 1/((ranges[:,1] - ranges[:,0]) / bins)
    for t in range(tracks.shape[1]):
        i = math.floor((tracks[0,t] - ranges[0,0]) * delta[0])
        j = math.floor((tracks[1,t] - ranges[1,0]) * delta[1])
        if 0 <= i < bins[0] and 0 <= j < bins[1]:
            H[i,j] += 1
    return H
vals = np.random.normal(size=[2, 1000000]).astype(np.float32)
%%timeit
h2(*vals, bins=edges)
Xeon Phi: 1.14 s ± 556 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
MacBook Pro: 293 ms ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
H, *ledges = np.histogram2d(*vals, bins=bins, range=ranges)
Histogram2D(ledges, H)
Xeon Phi: 437 ms ± 2.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
MacBook Pro: 114 ms ± 7.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit

H = fast_histogram.histogram2d(*vals, bins=100, range=((-1,1),(-1,1)))
Histogram2D(edges, H)
Xeon Phi: 53.7 ms ± 88.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
MacBook Pro: 10.4 ms ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit

H = fast_histogram2d(vals, bins=bins, ranges=ranges)
Histogram2D(edges, H)
Xeon Phi: 41.6 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
MacBook Pro: 10.2 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

@henryiii
Copy link

henryiii commented Oct 31, 2018

I also think it would be helpful to add the numpy to physt example to the docs, since there's a speed benefit.

@henryiii
Copy link

henryiii commented Oct 31, 2018

Extra example: With numba, you can also release the GIL, though for some reason it only works on my macbook or a normal Xeon:

from concurrent.futures import ThreadPoolExecutor
from functools import partial
%%timeit
splits = 4
with ThreadPoolExecutor(max_workers=splits) as pool:
    chunk = vals.shape[1] // splits
    chunks = (vals[:,i*chunk:(i+1)*chunk] for i in range(splits))
    f = partial(fast_histogram2d, bins=bins, ranges=ranges)
    results = pool.map(f, chunks)
    results = sum(results)
MacBook Pro: 4.23 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@janpipek
Copy link
Owner Author

janpipek commented Nov 1, 2018

Hi Henry,

thanks a lot for the ideas. I know that physt is a bit slow (especially for cases that are trivial with numpy). In the new design (if it ever comes), I aim to distinguish cases that require special approach (various clever binnings) and those that don't (and computation can be passed to numpy directly). The very basic kick out for supporting various backends (and their prioritisation) started in the branch https://github.com/janpipek/physt/tree/v04-engines/ btw.

I'd like to incorporate numba as well (as optional - I don't want this as hard dependency).

Unfortunately, I won't be able to touch the code until the end of November (or even attend a discussion about it). I am putting this issue in my internal queue and will get to it (and you) immediately afterwards. Sorry for the necessary delay :-(

Cheers,
Jan

@henryiii
Copy link

henryiii commented Nov 1, 2018

Here's my post on histogram performance: https://iscinumpy.gitlab.io/post/histogram-speeds-in-python/ (all examples eventually produce a Physt histogram)

@janpipek janpipek added this to Backlog in Core and API Redesign Dec 21, 2018
@janpipek janpipek removed this from the 0.5 milestone Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

2 participants