DataFrame groupby is extremely slow when grouping by a column of pandas Period values #18053

nmusolino · 2017-10-31T19:09:54Z

Steps to reproduce

In [1]: import pandas

In [2]: import datetime

In [3]: months = [(2017, month) for month in range(1, 11)] * 10000

In [5]: month_pydates = pandas.Series([datetime.date(year, month, 1) for year, month in months])

In [7]: df = pandas.DataFrame({
    'x': list(range(len(months))),
    'month_periods': pandas.to_datetime(month_pydates).dt.to_period('M'),
    'month_pydates': month_pydates,
    'month_int': [year * 100 + month for year, month in months]})

In [8]: df.head()
Out[8]:
   month_int month_periods month_pydates  x
0     201701       2017-01    2017-01-01  0
1     201702       2017-02    2017-02-01  1
2     201703       2017-03    2017-03-01  2
3     201704       2017-04    2017-04-01  3
4     201705       2017-05    2017-05-01  4

In [9]: df.dtypes
Out[9]:
month_int         int64
month_periods    object
month_pydates    object
x                 int64
dtype: object

In [9]: df.loc[0, 'month_periods']
Out[9]: Period('2017-01', 'M')

In [10]: %timeit  df.groupby('month_int')['x'].sum()
100 loops, best of 3: 2.32 ms per loop

In [11]: %timeit  df.groupby('month_pydates')['x'].sum()
100 loops, best of 3: 6.7 ms per loop

In [12]: %timeit  df.groupby('month_periods')['x'].sum()
1 loop, best of 3: 2.37 s per loop

Problem description

When a DataFrame column contains pandas.Period values, and the user attempts to groupby this column, the resulting operation is very, very slow, when compared to grouping by columns of integers or by columns of Python objects.

In the example above, a DataFrame with 120,000 rows is created, and a groupby operation is performed on three columns. On the integer column, the groupby-sum took 2.3 milliseconds; on the column containing datetime.date objects, the groupby-sum took 6.7 milliseconds; and on the column containing pandas.Period objects, the groupby-sum took 2.4 seconds.

Note that in this case, the dtype of the 'month_periods' column is object. I attempted to convert this column to a period-specific data type using df['month_periods'] .astype('period[M]'), but this lead to a TypeError: TypeError: data type "period[M]" not understood.

In any case, the series was returned by .dt.to_period('M'), so I would expect this to be a well-formed series of periods.

Expected Behavior

When grouping on a period column, it should be possible to group by the underlying integer values used for storing periods, and thus the performance should roughly match the performance of grouping by integers.

In the worst case, the performance should match the performance of comparing small Python objects (i.e. those with trivial __eq__ functions).

Workaround

Making the column categorical avoids the performance hit, and roughly matches the integer column performance:

In [21]: df['month_periods'] = df['month_periods'].astype('category')

In [22]: %timeit  df.groupby('month_periods')['x'].sum()
100 loops, best of 3: 1.97 ms per loop

Output of `pd.show_versions()`


In [16]: pandas.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.5.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.4.8
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.7
blosc: 1.5.0
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.3
html5lib: 0.999
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.1.3
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-10-31T19:37:37Z

of course, Periods are object dtypes.

jreback · 2017-10-31T19:40:09Z

well its also related to the offset caching issues that @jbrockmendel has been looking at

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   400016    1.299    0.000    1.859    0.000 offsets.py:297(_params)
   400016    0.273    0.000    0.273    0.000 offsets.py:302(<listcomp>)
   200008    0.242    0.000    2.166    0.000 offsets.py:352(__eq__)
        1    0.175    0.175    2.596    2.596 {method 'get_labels' of 'pandas._libs.hashtable.PyObjectHashTable' objects}
   200021    0.137    0.000    0.186    0.000 offsets.py:487(freqstr)
   400016    0.129    0.000    0.129    0.000 {built-in method builtins.sorted}
  1200048    0.098    0.000    0.098    0.000 {method 'items' of 'dict' objects}
   200008    0.068    0.000    2.234    0.000 offsets.py:366(__ne__)
   400334    0.065    0.000    0.065    0.000 {built-in method builtins.isinstance}
   400016    0.060    0.000    0.060    0.000 {built-in method builtins.vars}
   200021    0.029    0.000    0.029    0.000 offsets.py:483(rule_code)
   200021    0.021    0.000    0.021    0.000 offsets.py:508(_offset_str)

jbrockmendel · 2017-10-31T19:42:32Z

We're a few steps away from having this fixed. In the interim the workaround I've been using is casting to strings, then groupby/sort/whatever, then cast back to Period.

nmusolino · 2017-10-31T20:17:38Z

@jreback, it is fine that a series of pandas Periods has dtype object.

But grouping by pandas.Period objects is about 300 times slower than grouping by other series with dtype: object, such as series of datetime.date objects or simple tuples. (I'm comparing 2.4 seconds to about 7 milliseconds; see the second timing invocation in the original report, or the example below.)

In [25]: df['month_tuples'] = months

In [26]: df[['month_tuples', 'month_periods']].head()
Out[26]:
  month_tuples month_periods
0    (2017, 1)       2017-01
1    (2017, 2)       2017-02
2    (2017, 3)       2017-03
3    (2017, 4)       2017-04
4    (2017, 5)       2017-05

In [27]: %timeit  df.groupby('month_tuples')['x'].sum()
100 loops, best of 3: 7.18 ms per loop

In [28]: %timeit  df.groupby('month_periods')['x'].sum()
1 loop, best of 3: 2.36 s per loop

In [29]: df[['month_tuples', 'month_periods']].dtypes
Out[29]:
month_tuples     object
month_periods    object
dtype: object

jreback · 2017-10-31T20:21:47Z

@nmusolino and you are welcome to have a look

TomAugspurger · 2018-10-25T13:43:06Z

Some updated timings.

In [2]: %timeit  df.groupby('month_int')['x'].sum()
1.82 ms ± 65.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [3]: %timeit  df.groupby('month_pydates')['x'].sum()
6.13 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [2]: %timeit  df.groupby('month_periods')['x'].sum()
161 ms ± 3.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

On the PeriodArray PR, we're down to

In [4]: %timeit  df.groupby('month_periods')['x'].sum()
6.2 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Which I think is good enough to call this closed, though @nmusolino if you want to do more optimizations after #22862 lands, then feel free.

nmusolino · 2018-10-25T22:42:46Z

That's great news, thank you very much!

Pandas Period objects are slow to groupby: pandas-dev/pandas#18053 Fix by using `dt.to_period` and converting to strings before groupby.

jreback added Difficulty Intermediate Performance Memory or execution speed performance Period Period data type labels Oct 31, 2017

jreback added this to the Next Major Release milestone Oct 31, 2017

wezzman mentioned this issue Apr 11, 2018

DataFrame.groupby.sum() is extremely slow when dtype is timedelta64[ns] compared to int64. #20660

Closed

TomAugspurger mentioned this issue Oct 25, 2018

REF: Make PeriodArray an ExtensionArray #22862

Merged

jreback closed this as completed in #22862 Oct 25, 2018

MichaelSchreier mentioned this issue Dec 13, 2018

Speedup of summary_data_from_transaction_data CamDavidsonPilon/lifetimes#237

Merged

orenshk added a commit to orenshk/lifetimes that referenced this issue Aug 13, 2019

Speed up calibration_and_holdout_data

dff04c8

Pandas Period objects are slow to groupby: pandas-dev/pandas#18053 Fix by using `dt.to_period` and converting to strings before groupby.

orenshk mentioned this issue Aug 13, 2019

Speed up calibration_and_holdout_data CamDavidsonPilon/lifetimes#311

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame groupby is extremely slow when grouping by a column of pandas Period values #18053

DataFrame groupby is extremely slow when grouping by a column of pandas Period values #18053

nmusolino commented Oct 31, 2017 •

edited

Loading

jreback commented Oct 31, 2017

jreback commented Oct 31, 2017

jbrockmendel commented Oct 31, 2017

nmusolino commented Oct 31, 2017 •

edited

Loading

jreback commented Oct 31, 2017

TomAugspurger commented Oct 25, 2018

nmusolino commented Oct 25, 2018

DataFrame groupby is extremely slow when grouping by a column of pandas Period values #18053

DataFrame groupby is extremely slow when grouping by a column of pandas Period values #18053

Comments

nmusolino commented Oct 31, 2017 • edited Loading

Steps to reproduce

Problem description

Expected Behavior

Workaround

Output of pd.show_versions()

jreback commented Oct 31, 2017

jreback commented Oct 31, 2017

jbrockmendel commented Oct 31, 2017

nmusolino commented Oct 31, 2017 • edited Loading

jreback commented Oct 31, 2017

TomAugspurger commented Oct 25, 2018

nmusolino commented Oct 25, 2018

nmusolino commented Oct 31, 2017 •

edited

Loading

Output of `pd.show_versions()`

nmusolino commented Oct 31, 2017 •

edited

Loading