Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow aggregation performance on object type column with Decimal datatype values #25168

Open
joetl opened this issue Feb 5, 2019 · 6 comments
Labels
Apply Apply, Aggregate, Transform, Map Groupby Performance Memory or execution speed performance

Comments

@joetl
Copy link

joetl commented Feb 5, 2019

Problem description

I have a dataframe with a million row and 10 columns. I want to groupby on 3 columns and sum on 1 column. If the column to be aggregated is float it takes less than a second to get the result. But if the column is a object with Decimal datatype values it take 70 seconds to return the result. I know it is not a numpy datatype column and it would take longer but is 70 seconds for aggregating million rows seems reasonable. Is there any way to get results faster with Decimal

@WillAyd
Copy link
Member

WillAyd commented Feb 5, 2019

What kind of performance are you seeing if just using the NumPy array alone?

@gfyoung gfyoung added Groupby Performance Memory or execution speed performance Needs Info Clarification about behavior needed to assess issue labels Feb 7, 2019
@nmusolino
Copy link
Contributor

Just to clarify, are you grouping on a Decimal column?

I am seeing that a Decimal column does impose a performance penalty, but the case below has runtime of 1.5 seconds for 1,000,000 rows and 4 columns, which is much less than 70 seconds.

Can you provide a runnable example that exhibits a 70-second groupby time?

I suspect that most of the extra cost from Decimal values comes from allocation costs during while assembling a data frame for each group.

In [5]: N = 1000000                                                                                                                                                                                                                                                                                                                                                                                                                   

In [21]: df = pandas.DataFrame({'a': list(itertools.islice(itertools.cycle([0, 1, 2, 3]), N)), 'b': list(range(N))})                                                                                                                                                                                                                                                                                                                  

In [22]: df = df.assign(d=df['a'].apply(decimal.Decimal), e=df['b'].apply(decimal.Decimal))                                                                                                                                                                                                                                                                                                                                           

In [27]: df.dtypes                                                                                                                                                                                                                                                                                                                                                                                                                    
Out[27]: 
a     int64
b     int64
d    object
e    object
dtype: object

In [23]: df.head(6)                                                                                                                                                                                                                                                                                                                                                                                                                   
Out[23]: 
   a  b  d  e
0  0  0  0  0
1  1  1  1  1
2  2  2  2  2
3  3  3  3  3
4  0  4  0  4
5  1  5  1  5

# Base case:  group by int and sum int column. 
In [28]: %time df.groupby('a').sum()                                                                                                                                                                                                                                                                                                                                                                                                  
CPU times: user 55.5 ms, sys: 6.39 ms, total: 61.9 ms
Wall time: 61.3 ms
Out[28]: 
              b
a              
0  124999500000
1  124999750000
2  125000000000
3  125000250000

# Groupby int and sum decimal values.
In [24]: %time df.groupby('a').apply(lambda df: df.sum())                                                                                                                                                                                                                                                                                                                                                                             
CPU times: user 1.45 s, sys: 48.4 ms, total: 1.5 s
Wall time: 1.5 s
Out[24]: 
          a             b         d             e
a                                                
0       0.0  1.249995e+11       0.0  1.249995e+11
1  250000.0  1.249998e+11  250000.0  1.249998e+11
2  500000.0  1.250000e+11  500000.0  1.250000e+11
3  750000.0  1.250002e+11  750000.0  1.250002e+11

# Group by decimal column.
In [26]: %time df.groupby('d').apply(lambda df: df.sum())                                                                                                                                                                                                                                                                                                                                                                             
CPU times: user 1.6 s, sys: 49.2 ms, total: 1.65 s
Wall time: 1.65 s
Out[26]: 
          a             b         d             e
d                                                
0       0.0  1.249995e+11       0.0  1.249995e+11
1  250000.0  1.249998e+11  250000.0  1.249998e+11
2  500000.0  1.250000e+11  500000.0  1.250000e+11
3  750000.0  1.250002e+11  750000.0  1.250002e+11

@jreback
Copy link
Contributor

jreback commented Feb 22, 2019

you can go do a lot better by using DecimalArray here

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 22, 2019 via email

@joetl
Copy link
Author

joetl commented Feb 28, 2019

The performance is very slow if the number of groups is very higher.
In the above example if we groupby on "b" instead "a" it takes minutes as to milliseconds. For a given record count if the number of groups increases the aggregation is taking way longer on decimal columns

import pandas as pd
import itertools
N = 1000000
df = pd.DataFrame({'a': list(itertools.islice(itertools.cycle([0, 1, 2, 3]), N)), 'b': list(range(N))})
import decimal
df = df.assign(d=df['a'].apply(decimal.Decimal), e=df['b'].apply(decimal.Decimal))
df.dtypes
a int64
b int64
d object
e object
dtype: object
time();df.groupby("a").d.agg(sum);time()
1551322304.14313
a
0 0
1 250000
2 500000
3 750000
Name: d, dtype: object
1551322304.535903
time();df.groupby("b").d.agg(sum);time()
1551322322.8474317
b
0 0
1 1
2 2
3 3
4 0
5 1
6 2
7 3
8 0
9 1
10 2
11 3
12 0
13 1
14 2
15 3
16 0
17 1
18 2
19 3
20 0
21 1
22 2
23 3
24 0
25 1
26 2
27 3
28 0
29 1
..
999970 2
999971 3
999972 0
999973 1
999974 2
999975 3
999976 0
999977 1
999978 2
999979 3
999980 0
999981 1
999982 2
999983 3
999984 0
999985 1
999986 2
999987 3
999988 0
999989 1
999990 2
999991 3
999992 0
999993 1
999994 2
999995 3
999996 0
999997 1
999998 2
999999 3
Name: d, dtype: object
1551322486.930871

@joetl
Copy link
Author

joetl commented Feb 28, 2019

I am just looking at Decimal Array. We get protobuf object which we convert to dict and then to pandas dataframe. Is there a small example to convert the decimal columns as decimal array when creating pandas dataframe from dictionary

@mroeschke mroeschke removed the Needs Info Clarification about behavior needed to assess issue label Mar 8, 2020
@mroeschke mroeschke added Apply Apply, Aggregate, Transform, Map ExtensionArray Extending pandas with custom dtypes or arrays. labels Jun 26, 2021
@jbrockmendel jbrockmendel removed the ExtensionArray Extending pandas with custom dtypes or arrays. label Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Groupby Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

8 participants