2.5 times slower using jit for a combined groupby and rank #4620

davidwynter · 2019-09-26T11:24:49Z

I have created some test data like this:

df = pd.DataFrame(columns=['values', 'grp_by'])
seed(4)
group_ids = ['aa', 'ab', 'ac', 'ad', 'ae', 'bg', 'bh', 'bi', 'bj', 'bk', 'er', 'es', 'et', 'eu', 'ev', 'ew', 'ex', 'ey']
values = []
for i in range(500000):
    value = random.uniform(0, 10)
    grp_by = group_ids[random.randint(0, 17)]
    values.append({'grp_by': grp_by, 'values': value})

df = df.append(values, ignore_index=True)

Tested pandas like this:

%%timeit
df['rank'] = df.groupby(df['grp_by'])['values'].rank(ascending=0,method='dense')

Output:

313 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Too slow so experimented with numba. Most of the grp are strings so need to hash to an integer. The code looks like this:

grp_by = df['grp_by'].values
f = lambda x: xxhash.xxh64(x).intdigest()
grp = np.array([f(xi) for xi in grp_by])

%%timeit
values = df['values'].to_numpy()
output = np.zeros(len(grp))
grp_by_rank_numba = numba.jit(group_by_rank, nopython=True)
grp_by_rank_numba(grp, values, 0, output)

Output:

895 ms ± 4.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Here is my function doing the groupby and ranking (I tested it and it produces the same output as the pandas function you see above.):

def group_by_rank(grp_by, values, desc, output):
    inx = grp_by.argsort()
    prev = -1
    current_group = []
    current_idx = []
    count = 0

    for r in inx:
        count = count + 1
        if grp_by[r] == prev:
            current_group.append(values[r])
            current_idx.append(r)
            if count == len(values):
                if desc == 0:
                    ranks = np.array(current_group)[::-1].argsort()
                else:
                    ranks = np.array(current_group).argsort()
                
                for i in range(len(current_group)):
                    output[current_idx[i]] = ranks[i]+1
            
        else:
            if len(current_group) > 0:
                if desc == 0:
                    ranks = np.array(current_group)[::-1].argsort()
                else:
                    ranks = np.array(current_group).argsort()
                
                for i in range(len(current_group)):
                    output[current_idx[i]] = ranks[i]+1
            
            current_group = [values[r]]
            current_idx = [r]
            prev = grp_by[r]

It runs now, but is slower than pandas and the straight python group_by_rank (717ms). Is there a better way?

In fact it would be very useful to have examples for ranking, groupby and combined groupby with ranking for pandas equivalent in the repo.

The text was updated successfully, but these errors were encountered:

stuartarchibald · 2019-09-26T11:51:31Z

Thanks for the report. I think the first question is answered by this:

In [8]: a = [10]                                                                                                                

In [9]: type(a)                                                                                                                 
Out[9]: list

In [10]: a += np.zeros((1,))[0]                                                                                                 

In [11]: type(a)                                                                                                                
Out[11]: numpy.ndarray

and that Numba is not performing that behaviour.

davidwynter · 2019-10-01T08:53:35Z

Will I get a wider audience for this by people experienced at numba if I also post on Stack Overflow? I need to find a solution to speeding up compared to pandas for group by, rank and the combination of both ASAP

stuartarchibald · 2019-10-01T08:56:07Z

@davidwynter stack overflow might be a better forum for this. It's not clear that these algorithms are directly comparable and the algorithm posted is optimal, so it's hard to verify that this is a performance issue caused by Numba.

esc · 2019-10-01T09:35:49Z

I recall an older blog post which might help figure out what is going on here: https://wesmckinney.com/blog/mastering-high-performance-data-algorithms-i-group-by/

davidwynter · 2019-10-01T21:26:01Z

I tested group_by_rank function with the same variables you see in the dataframe construction at the top and they are equivalent. I read that numba likes loops and the group_by_rank function uses one. So reading the documentation provided it looks like I did the right things. I could use pandas's factorize function instead of the hash function, but having moved that outside the timing loop I found it made a minuscule difference. So at this point Stack overflow is my only hope. If not there then back to databases.

esc · 2019-10-02T09:36:08Z

@davidwynter it sounds like a plan. Good luck and let us know if you manage to resolve it!

davidwynter · 2019-10-03T17:49:27Z

I altered my issue so that anyone can produce the same results with a decent sized dataset. @stuartarchibald suggested the algorithm may not be optimal, can he point me at some resource that explains how to make a function optimal for use with numba? Will @guvectorize give me better results for example?

update:
To answer my own Q about guvectorise

190 ms ± 806 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Still not the gains I am looking for

stuartarchibald · 2019-10-04T09:42:07Z

@davidwynter it looks like you are including compilation time in the execution time for Numba:

%%timeit
values = df['values'].to_numpy()
output = np.zeros(len(grp))
grp_by_rank_numba = numba.jit(group_by_rank, nopython=True)
grp_by_rank_numba(grp, values, 0, output)

would recommend reading http://numba.pydata.org/numba-doc/latest/user/5minguide.html#how-to-measure-the-performance-of-numba

if timed correctly I find:

%%timeit
df['rank'] = df.groupby(df['grp_by'])['values'].rank(ascending=0,method='dense')
283 ms ± 9.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

vs

%%timeit
grp_by_rank_numba(grp, values, 0, output)
223 ms ± 7.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

davidwynter · 2019-10-04T10:43:19Z

I am trying to time the equivalent of the dataframe groupby, and thus it is valid to include the preparation steps for the source dataframe to make it suitable for calling the numba function.

…

On Fri, 4 Oct 2019 at 10:42, stuartarchibald ***@***.***> wrote: @davidwynter <https://github.com/davidwynter> it looks like you are including compilation time in the execution time for Numba: %%timeit values = df['values'].to_numpy() output = np.zeros(len(grp)) grp_by_rank_numba = numba.jit(group_by_rank, nopython=True) grp_by_rank_numba(grp, values, 0, output) would recommend reading http://numba.pydata.org/numba-doc/latest/user/5minguide.html#how-to-measure-the-performance-of-numba if timed correctly I find: %%timeit df['rank'] = df.groupby(df['grp_by'])['values'].rank(ascending=0,method='dense') 283 ms ± 9.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) vs %%timeit grp_by_rank_numba(grp, values, 0, output) 223 ms ± 7.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4620?email_source=notifications&email_token=AA3YP3JKH4FW4XLOU46W6FTQM4FX5A5CNFSM4I2YWRNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEALD2ZA#issuecomment-538328420>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA3YP3KSB5XZX7KYAM4A2Y3QM4FX5ANCNFSM4I2YWRNA> .

stuartarchibald · 2019-10-04T11:01:11Z

Numba is a JIT compiler, when you call your JIT decorated function it looks at the data types, then compiles a specialisation of the function for your data types. If in the same process you call the function again with the same data types (not the same data, just the same types) then it'll just fish the compiled specialisation out of an in memory cache and use that, i.e. it will not have to compile a new one. If you want to persist the cached functions to disk to save having to keep on compiling them for data types that don't change then cache=True can be supplied to the JIT decorator (docs). If you are in a situation where you can precompile functions, Numba also supports ahead-of-time compilation. Whatever happens, in both the case of Pandas and Numba someone somewhere has to pay the cost of compiling. It just so happens that Pandas is precompiled so the cost isn't yours, but with Numba you get a specialised-to-your-hardware compiled function and there is a cost to pay for that, but there are plenty of ways to amortise that cost as noted.

If you are genuinely constantly churning data types on every call and actually need to include compilation time in the total time then this is a really hard problem as you're at the mercy of many working parts, a lot of which are non-trivial to control.

esc · 2020-09-10T15:42:51Z

This issue hasn't seen any activity recently, so I am assuming it has been resolved and close it. If this is not the case, please feel free to open a discussion on discourse here: https://numba.discourse.group/ - Thanks!

stuartarchibald added the needtriage label Sep 26, 2019

davidwynter changed the title ~~Question on jit for a combined groupby and rank~~ Slow performance on jit for a combined groupby and rank Sep 27, 2019

davidwynter changed the title ~~Slow performance on jit for a combined groupby and rank~~ 200 times slower using jit for a combined groupby and rank Sep 29, 2019

davidwynter changed the title ~~200 times slower using jit for a combined groupby and rank~~ 2.5 times slower using jit for a combined groupby and rank Oct 3, 2019

esc closed this as completed Sep 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.5 times slower using jit for a combined groupby and rank #4620

2.5 times slower using jit for a combined groupby and rank #4620

davidwynter commented Sep 26, 2019 •

edited

stuartarchibald commented Sep 26, 2019

davidwynter commented Oct 1, 2019

stuartarchibald commented Oct 1, 2019

esc commented Oct 1, 2019

davidwynter commented Oct 1, 2019

esc commented Oct 2, 2019

davidwynter commented Oct 3, 2019 •

edited

stuartarchibald commented Oct 4, 2019

davidwynter commented Oct 4, 2019 via email

stuartarchibald commented Oct 4, 2019 •

edited

esc commented Sep 10, 2020

2.5 times slower using jit for a combined groupby and rank #4620

2.5 times slower using jit for a combined groupby and rank #4620

Comments

davidwynter commented Sep 26, 2019 • edited

stuartarchibald commented Sep 26, 2019

davidwynter commented Oct 1, 2019

stuartarchibald commented Oct 1, 2019

esc commented Oct 1, 2019

davidwynter commented Oct 1, 2019

esc commented Oct 2, 2019

davidwynter commented Oct 3, 2019 • edited

stuartarchibald commented Oct 4, 2019

davidwynter commented Oct 4, 2019 via email

stuartarchibald commented Oct 4, 2019 • edited

esc commented Sep 10, 2020

davidwynter commented Sep 26, 2019 •

edited

davidwynter commented Oct 3, 2019 •

edited

stuartarchibald commented Oct 4, 2019 •

edited