Very slow aggregation performance on object type column with Decimal datatype values #25168

joetl · 2019-02-05T21:24:07Z

Problem description

I have a dataframe with a million row and 10 columns. I want to groupby on 3 columns and sum on 1 column. If the column to be aggregated is float it takes less than a second to get the result. But if the column is a object with Decimal datatype values it take 70 seconds to return the result. I know it is not a numpy datatype column and it would take longer but is 70 seconds for aggregating million rows seems reasonable. Is there any way to get results faster with Decimal

WillAyd · 2019-02-05T21:27:42Z

What kind of performance are you seeing if just using the NumPy array alone?

nmusolino · 2019-02-22T04:11:31Z

Just to clarify, are you grouping on a Decimal column?

I am seeing that a Decimal column does impose a performance penalty, but the case below has runtime of 1.5 seconds for 1,000,000 rows and 4 columns, which is much less than 70 seconds.

Can you provide a runnable example that exhibits a 70-second groupby time?

I suspect that most of the extra cost from Decimal values comes from allocation costs during while assembling a data frame for each group.

In [5]: N = 1000000                                                                                                                                                                                                                                                                                                                                                                                                                   

In [21]: df = pandas.DataFrame({'a': list(itertools.islice(itertools.cycle([0, 1, 2, 3]), N)), 'b': list(range(N))})                                                                                                                                                                                                                                                                                                                  

In [22]: df = df.assign(d=df['a'].apply(decimal.Decimal), e=df['b'].apply(decimal.Decimal))                                                                                                                                                                                                                                                                                                                                           

In [27]: df.dtypes                                                                                                                                                                                                                                                                                                                                                                                                                    
Out[27]: 
a     int64
b     int64
d    object
e    object
dtype: object

In [23]: df.head(6)                                                                                                                                                                                                                                                                                                                                                                                                                   
Out[23]: 
   a  b  d  e
0  0  0  0  0
1  1  1  1  1
2  2  2  2  2
3  3  3  3  3
4  0  4  0  4
5  1  5  1  5

# Base case:  group by int and sum int column. 
In [28]: %time df.groupby('a').sum()                                                                                                                                                                                                                                                                                                                                                                                                  
CPU times: user 55.5 ms, sys: 6.39 ms, total: 61.9 ms
Wall time: 61.3 ms
Out[28]: 
              b
a              
0  124999500000
1  124999750000
2  125000000000
3  125000250000

# Groupby int and sum decimal values.
In [24]: %time df.groupby('a').apply(lambda df: df.sum())                                                                                                                                                                                                                                                                                                                                                                             
CPU times: user 1.45 s, sys: 48.4 ms, total: 1.5 s
Wall time: 1.5 s
Out[24]: 
          a             b         d             e
a                                                
0       0.0  1.249995e+11       0.0  1.249995e+11
1  250000.0  1.249998e+11  250000.0  1.249998e+11
2  500000.0  1.250000e+11  500000.0  1.250000e+11
3  750000.0  1.250002e+11  750000.0  1.250002e+11

# Group by decimal column.
In [26]: %time df.groupby('d').apply(lambda df: df.sum())                                                                                                                                                                                                                                                                                                                                                                             
CPU times: user 1.6 s, sys: 49.2 ms, total: 1.65 s
Wall time: 1.65 s
Out[26]: 
          a             b         d             e
d                                                
0       0.0  1.249995e+11       0.0  1.249995e+11
1  250000.0  1.249998e+11  250000.0  1.249998e+11
2  500000.0  1.250000e+11  500000.0  1.250000e+11
3  750000.0  1.250002e+11  750000.0  1.250002e+11

jreback · 2019-02-22T11:11:36Z

you can go do a lot better by using DecimalArray here

TomAugspurger · 2019-02-22T11:54:18Z

Our DecimalArray isn't public FYI. Perhaps you mean using pyarrow's and fletcher?

…

On Fri, Feb 22, 2019 at 5:11 AM Jeff Reback ***@***.***> wrote: you can go do a lot better by using DecimalArray here — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#25168 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIu-NP86QDha1Ehja94VfDMrMd-YSks5vP9BtgaJpZM4akGcL> .

joetl · 2019-02-28T02:56:24Z

The performance is very slow if the number of groups is very higher.
In the above example if we groupby on "b" instead "a" it takes minutes as to milliseconds. For a given record count if the number of groups increases the aggregation is taking way longer on decimal columns

import pandas as pd
import itertools
N = 1000000
df = pd.DataFrame({'a': list(itertools.islice(itertools.cycle([0, 1, 2, 3]), N)), 'b': list(range(N))})
import decimal
df = df.assign(d=df['a'].apply(decimal.Decimal), e=df['b'].apply(decimal.Decimal))
df.dtypes
a int64
b int64
d object
e object
dtype: object
time();df.groupby("a").d.agg(sum);time()
1551322304.14313
a
0 0
1 250000
2 500000
3 750000
Name: d, dtype: object
1551322304.535903
time();df.groupby("b").d.agg(sum);time()
1551322322.8474317
b
0 0
1 1
2 2
3 3
4 0
5 1
6 2
7 3
8 0
9 1
10 2
11 3
12 0
13 1
14 2
15 3
16 0
17 1
18 2
19 3
20 0
21 1
22 2
23 3
24 0
25 1
26 2
27 3
28 0
29 1
..
999970 2
999971 3
999972 0
999973 1
999974 2
999975 3
999976 0
999977 1
999978 2
999979 3
999980 0
999981 1
999982 2
999983 3
999984 0
999985 1
999986 2
999987 3
999988 0
999989 1
999990 2
999991 3
999992 0
999993 1
999994 2
999995 3
999996 0
999997 1
999998 2
999999 3
Name: d, dtype: object
1551322486.930871

joetl · 2019-02-28T03:39:30Z

I am just looking at Decimal Array. We get protobuf object which we convert to dict and then to pandas dataframe. Is there a small example to convert the decimal columns as decimal array when creating pandas dataframe from dictionary

gfyoung added Groupby Performance Memory or execution speed performance Needs Info Clarification about behavior needed to assess issue labels Feb 7, 2019

mroeschke removed the Needs Info Clarification about behavior needed to assess issue label Mar 8, 2020

mroeschke added Apply Apply, Aggregate, Transform, Map ExtensionArray Extending pandas with custom dtypes or arrays. labels Jun 26, 2021

jbrockmendel mentioned this issue Apr 13, 2023

Make pyarrow a required dependency #52509

Closed

jbrockmendel removed the ExtensionArray Extending pandas with custom dtypes or arrays. label Jul 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow aggregation performance on object type column with Decimal datatype values #25168

Very slow aggregation performance on object type column with Decimal datatype values #25168

joetl commented Feb 5, 2019

WillAyd commented Feb 5, 2019 •

edited

Loading

nmusolino commented Feb 22, 2019

jreback commented Feb 22, 2019

TomAugspurger commented Feb 22, 2019 via email

joetl commented Feb 28, 2019 •

edited

Loading

joetl commented Feb 28, 2019

Very slow aggregation performance on object type column with Decimal datatype values #25168

Very slow aggregation performance on object type column with Decimal datatype values #25168

Comments

joetl commented Feb 5, 2019

Problem description

WillAyd commented Feb 5, 2019 • edited Loading

nmusolino commented Feb 22, 2019

jreback commented Feb 22, 2019

TomAugspurger commented Feb 22, 2019 via email

joetl commented Feb 28, 2019 • edited Loading

joetl commented Feb 28, 2019

WillAyd commented Feb 5, 2019 •

edited

Loading

joetl commented Feb 28, 2019 •

edited

Loading