New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2.5 times slower using jit for a combined groupby and rank #4620
Comments
Thanks for the report. I think the first question is answered by this:
and that Numba is not performing that behaviour. |
Will I get a wider audience for this by people experienced at numba if I also post on Stack Overflow? I need to find a solution to speeding up compared to pandas for group by, rank and the combination of both ASAP |
@davidwynter stack overflow might be a better forum for this. It's not clear that these algorithms are directly comparable and the algorithm posted is optimal, so it's hard to verify that this is a performance issue caused by Numba. |
I recall an older blog post which might help figure out what is going on here: https://wesmckinney.com/blog/mastering-high-performance-data-algorithms-i-group-by/ |
I tested group_by_rank function with the same variables you see in the dataframe construction at the top and they are equivalent. I read that numba likes loops and the group_by_rank function uses one. So reading the documentation provided it looks like I did the right things. I could use pandas's factorize function instead of the hash function, but having moved that outside the timing loop I found it made a minuscule difference. So at this point Stack overflow is my only hope. If not there then back to databases. |
@davidwynter it sounds like a plan. Good luck and let us know if you manage to resolve it! |
I altered my issue so that anyone can produce the same results with a decent sized dataset. @stuartarchibald suggested the algorithm may not be optimal, can he point me at some resource that explains how to make a function optimal for use with numba? Will @guvectorize give me better results for example? update:
Still not the gains I am looking for |
@davidwynter it looks like you are including compilation time in the execution time for Numba:
would recommend reading http://numba.pydata.org/numba-doc/latest/user/5minguide.html#how-to-measure-the-performance-of-numba if timed correctly I find:
vs
|
I am trying to time the equivalent of the dataframe groupby, and thus it is
valid to include the preparation steps for the source dataframe to make it
suitable for calling the numba function.
…On Fri, 4 Oct 2019 at 10:42, stuartarchibald ***@***.***> wrote:
@davidwynter <https://github.com/davidwynter> it looks like you are
including compilation time in the execution time for Numba:
%%timeit
values = df['values'].to_numpy()
output = np.zeros(len(grp))
grp_by_rank_numba = numba.jit(group_by_rank, nopython=True)
grp_by_rank_numba(grp, values, 0, output)
would recommend reading
http://numba.pydata.org/numba-doc/latest/user/5minguide.html#how-to-measure-the-performance-of-numba
if timed correctly I find:
%%timeit
df['rank'] = df.groupby(df['grp_by'])['values'].rank(ascending=0,method='dense')
283 ms ± 9.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
vs
%%timeit
grp_by_rank_numba(grp, values, 0, output)
223 ms ± 7.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4620?email_source=notifications&email_token=AA3YP3JKH4FW4XLOU46W6FTQM4FX5A5CNFSM4I2YWRNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEALD2ZA#issuecomment-538328420>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA3YP3KSB5XZX7KYAM4A2Y3QM4FX5ANCNFSM4I2YWRNA>
.
|
Numba is a JIT compiler, when you call your JIT decorated function it looks at the data types, then compiles a specialisation of the function for your data types. If in the same process you call the function again with the same data types (not the same data, just the same types) then it'll just fish the compiled specialisation out of an in memory cache and use that, i.e. it will not have to compile a new one. If you want to persist the cached functions to disk to save having to keep on compiling them for data types that don't change then If you are genuinely constantly churning data types on every call and actually need to include compilation time in the total time then this is a really hard problem as you're at the mercy of many working parts, a lot of which are non-trivial to control. |
This issue hasn't seen any activity recently, so I am assuming it has been resolved and close it. If this is not the case, please feel free to open a discussion on discourse here: https://numba.discourse.group/ - Thanks! |
I have created some test data like this:
Tested pandas like this:
Output:
Too slow so experimented with numba. Most of the grp are strings so need to hash to an integer. The code looks like this:
Output:
Here is my function doing the groupby and ranking (I tested it and it produces the same output as the pandas function you see above.):
It runs now, but is slower than pandas and the straight python group_by_rank (717ms). Is there a better way?
In fact it would be very useful to have examples for ranking, groupby and combined groupby with ranking for pandas equivalent in the repo.
The text was updated successfully, but these errors were encountered: