Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.5 times slower using jit for a combined groupby and rank #4620

Closed
davidwynter opened this issue Sep 26, 2019 · 11 comments
Closed

2.5 times slower using jit for a combined groupby and rank #4620

davidwynter opened this issue Sep 26, 2019 · 11 comments

Comments

@davidwynter
Copy link

davidwynter commented Sep 26, 2019

I have created some test data like this:

df = pd.DataFrame(columns=['values', 'grp_by'])
seed(4)
group_ids = ['aa', 'ab', 'ac', 'ad', 'ae', 'bg', 'bh', 'bi', 'bj', 'bk', 'er', 'es', 'et', 'eu', 'ev', 'ew', 'ex', 'ey']
values = []
for i in range(500000):
    value = random.uniform(0, 10)
    grp_by = group_ids[random.randint(0, 17)]
    values.append({'grp_by': grp_by, 'values': value})

df = df.append(values, ignore_index=True)

Tested pandas like this:

%%timeit
df['rank'] = df.groupby(df['grp_by'])['values'].rank(ascending=0,method='dense')

Output:

313 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Too slow so experimented with numba. Most of the grp are strings so need to hash to an integer. The code looks like this:

grp_by = df['grp_by'].values
f = lambda x: xxhash.xxh64(x).intdigest()
grp = np.array([f(xi) for xi in grp_by])

%%timeit
values = df['values'].to_numpy()
output = np.zeros(len(grp))
grp_by_rank_numba = numba.jit(group_by_rank, nopython=True)
grp_by_rank_numba(grp, values, 0, output)

Output:

895 ms ± 4.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Here is my function doing the groupby and ranking (I tested it and it produces the same output as the pandas function you see above.):

def group_by_rank(grp_by, values, desc, output):
    inx = grp_by.argsort()
    prev = -1
    current_group = []
    current_idx = []
    count = 0

    for r in inx:
        count = count + 1
        if grp_by[r] == prev:
            current_group.append(values[r])
            current_idx.append(r)
            if count == len(values):
                if desc == 0:
                    ranks = np.array(current_group)[::-1].argsort()
                else:
                    ranks = np.array(current_group).argsort()
                
                for i in range(len(current_group)):
                    output[current_idx[i]] = ranks[i]+1
            
        else:
            if len(current_group) > 0:
                if desc == 0:
                    ranks = np.array(current_group)[::-1].argsort()
                else:
                    ranks = np.array(current_group).argsort()
                
                for i in range(len(current_group)):
                    output[current_idx[i]] = ranks[i]+1
            
            current_group = [values[r]]
            current_idx = [r]
            prev = grp_by[r]    

It runs now, but is slower than pandas and the straight python group_by_rank (717ms). Is there a better way?

In fact it would be very useful to have examples for ranking, groupby and combined groupby with ranking for pandas equivalent in the repo.

@stuartarchibald
Copy link
Contributor

Thanks for the report. I think the first question is answered by this:

In [8]: a = [10]                                                                                                                

In [9]: type(a)                                                                                                                 
Out[9]: list

In [10]: a += np.zeros((1,))[0]                                                                                                 

In [11]: type(a)                                                                                                                
Out[11]: numpy.ndarray

and that Numba is not performing that behaviour.

@davidwynter davidwynter changed the title Question on jit for a combined groupby and rank Slow performance on jit for a combined groupby and rank Sep 27, 2019
@davidwynter davidwynter changed the title Slow performance on jit for a combined groupby and rank 200 times slower using jit for a combined groupby and rank Sep 29, 2019
@davidwynter
Copy link
Author

Will I get a wider audience for this by people experienced at numba if I also post on Stack Overflow? I need to find a solution to speeding up compared to pandas for group by, rank and the combination of both ASAP

@stuartarchibald
Copy link
Contributor

@davidwynter stack overflow might be a better forum for this. It's not clear that these algorithms are directly comparable and the algorithm posted is optimal, so it's hard to verify that this is a performance issue caused by Numba.

@esc
Copy link
Member

esc commented Oct 1, 2019

I recall an older blog post which might help figure out what is going on here: https://wesmckinney.com/blog/mastering-high-performance-data-algorithms-i-group-by/

@davidwynter
Copy link
Author

I tested group_by_rank function with the same variables you see in the dataframe construction at the top and they are equivalent. I read that numba likes loops and the group_by_rank function uses one. So reading the documentation provided it looks like I did the right things. I could use pandas's factorize function instead of the hash function, but having moved that outside the timing loop I found it made a minuscule difference. So at this point Stack overflow is my only hope. If not there then back to databases.

@esc
Copy link
Member

esc commented Oct 2, 2019

@davidwynter it sounds like a plan. Good luck and let us know if you manage to resolve it!

@davidwynter
Copy link
Author

davidwynter commented Oct 3, 2019

I altered my issue so that anyone can produce the same results with a decent sized dataset. @stuartarchibald suggested the algorithm may not be optimal, can he point me at some resource that explains how to make a function optimal for use with numba? Will @guvectorize give me better results for example?

update:
To answer my own Q about guvectorise

190 ms ± 806 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Still not the gains I am looking for

@davidwynter davidwynter changed the title 200 times slower using jit for a combined groupby and rank 2.5 times slower using jit for a combined groupby and rank Oct 3, 2019
@stuartarchibald
Copy link
Contributor

@davidwynter it looks like you are including compilation time in the execution time for Numba:

%%timeit
values = df['values'].to_numpy()
output = np.zeros(len(grp))
grp_by_rank_numba = numba.jit(group_by_rank, nopython=True)
grp_by_rank_numba(grp, values, 0, output)

would recommend reading http://numba.pydata.org/numba-doc/latest/user/5minguide.html#how-to-measure-the-performance-of-numba

if timed correctly I find:

%%timeit
df['rank'] = df.groupby(df['grp_by'])['values'].rank(ascending=0,method='dense')
283 ms ± 9.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

vs

%%timeit
grp_by_rank_numba(grp, values, 0, output)
223 ms ± 7.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@davidwynter
Copy link
Author

davidwynter commented Oct 4, 2019 via email

@stuartarchibald
Copy link
Contributor

stuartarchibald commented Oct 4, 2019

Numba is a JIT compiler, when you call your JIT decorated function it looks at the data types, then compiles a specialisation of the function for your data types. If in the same process you call the function again with the same data types (not the same data, just the same types) then it'll just fish the compiled specialisation out of an in memory cache and use that, i.e. it will not have to compile a new one. If you want to persist the cached functions to disk to save having to keep on compiling them for data types that don't change then cache=True can be supplied to the JIT decorator (docs). If you are in a situation where you can precompile functions, Numba also supports ahead-of-time compilation. Whatever happens, in both the case of Pandas and Numba someone somewhere has to pay the cost of compiling. It just so happens that Pandas is precompiled so the cost isn't yours, but with Numba you get a specialised-to-your-hardware compiled function and there is a cost to pay for that, but there are plenty of ways to amortise that cost as noted.

If you are genuinely constantly churning data types on every call and actually need to include compilation time in the total time then this is a really hard problem as you're at the mercy of many working parts, a lot of which are non-trivial to control.

@esc
Copy link
Member

esc commented Sep 10, 2020

This issue hasn't seen any activity recently, so I am assuming it has been resolved and close it. If this is not the case, please feel free to open a discussion on discourse here: https://numba.discourse.group/ - Thanks!

@esc esc closed this as completed Sep 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants