Skip to content

DataFrame.groupby() is terribly slow when grouping by multiple columns and one of them is Categorical #24333

@pawel-wiejacha

Description

@pawel-wiejacha

Code Sample, a copy-pastable example if possible

import pandas as pd

df_str = pd.DataFrame(dict(key_cat=['a', 'b', 'c', 'd'] * 250, key_num1=range(1000), key_num2=range(1000), val=[1]*1000))
df_cat = df_str.copy() 
df_cat.key_cat = pd.Categorical(df_cat.key_cat) 

%%time
df_cat.groupby(["key_cat", "key_num1", "key_num2"]).count()
# CPU times: user 10.8 s, sys: 3 s, total: 13.8 s
# Wall time: 13.8 s # 14 sec for a dataframe with 1000 elements?!!

%%time
df_str.groupby(["key_cat", "key_num1", "key_num2"]).count()
# CPU times: user 0 ns, sys: 4 ms, total: 4 ms
# Wall time: 4.05 ms # expected performance when using a column with string type

Problem description

Changing column dtype to categorical makes groupby() operation 3500 times slower.

The problem occurs both in pandas-0.23.4 and in pandas-0.24.0 (untagged.1.g216986d)

Expected Output

Similar execution times.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.0-35-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.6.3
pip: 18.0
setuptools: 39.0.1
Cython: 0.28.2
numpy: 1.15.1
scipy: 1.1.0
pyarrow: 0.8.0
xarray: None
IPython: 6.3.1
sphinx: None
patsy: 0.4.1
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.0
pymysql: None
psycopg2: 2.7.3.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: 0.1.2
fastparquet: None
pandas_gbq: 0.2.1
pandas_datareader: None

Profiling output

The output of %%prun -s cumtime:

         14574 function calls (14438 primitive calls) in 14.267 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   14.267   14.267 {built-in method builtins.exec}
        1    0.140    0.140   14.267   14.267 <string>:2(<module>)
        1    0.000    0.000   14.125   14.125 groupby.py:4843(count)
        1    0.000    0.000   14.122   14.122 groupby.py:4739(_wrap_agged_blocks)
        1    0.000    0.000   14.118   14.118 groupby.py:4757(_reindex_output)
        2    0.000    0.000   13.833    6.917 _decorators.py:185(wrapper)
        2    0.000    0.000   13.833    6.917 frame.py:3552(reindex)
        2    0.000    0.000   13.833    6.917 generic.py:3647(reindex)
        2    0.000    0.000   13.833    6.917 frame.py:3489(_reindex_axes)
        1    0.000    0.000   13.833   13.833 frame.py:3505(_reindex_index)
        1    0.000    0.000   13.823   13.823 multi.py:2023(reindex)
        1    0.000    0.000   13.823   13.823 multi.py:1981(get_indexer)
        1    3.043    3.043   13.823   13.823 {method 'get_indexer' of 'pandas._libs.index.BaseMultiIndexCodesEngine' objects}
        1    0.000    0.000    7.333    7.333 category.py:560(get_indexer)
        1    1.277    1.277    6.440    6.440 {method 'get_indexer_non_unique' of 'pandas._libs.index.IndexEngine' objects}
      400    0.029    0.000    5.163    0.013 fromnumeric.py:1173(resize)
      403    5.120    0.013    5.120    0.013 {built-in method numpy.core.multiarray.concatenate}
        1    0.000    0.000    2.788    2.788 base.py:912(__iter__)
        1    0.000    0.000    2.788    2.788 base.py:893(tolist)
    33/29    0.009    0.000    2.743    0.095 base.py:677(_values)
        3    0.000    0.000    2.734    0.911 multi.py:807(values)
        1    2.621    2.621    2.621    2.621 {pandas._libs.lib.fast_zip}
       12    0.000    0.000    0.863    0.072 base.py:3219(get_indexer)
        1    0.008    0.008    0.687    0.687 category.py:247(equals)
       40    0.030    0.001    0.680    0.017 base.py:4914(_ensure_index)
    13/10    0.000    0.000    0.651    0.065 base.py:255(__new__)
        1    0.000    0.000    0.628    0.628 category.py:218(_is_dtype_compat)
        1    0.000    0.000    0.491    0.491 base.py:3471(isin)
      2/1    0.000    0.000    0.491    0.491 algorithms.py:384(isin)
        1    0.010    0.010    0.491    0.491 categorical.py:2292(isin)
        2    0.230    0.115    0.371    0.185 base.py:850(_try_convert_to_int_index)
       12    0.271    0.023    0.271    0.023 {method 'get_indexer' of 'pandas._libs.index.IndexEngine' objects}
  504/500    0.202    0.000    0.263    0.001 {built-in method numpy.core.multiarray.array}
        1    0.000    0.000    0.248    0.248 multi.py:1878(sortlevel)
        1    0.008    0.008    0.238    0.238 algorithms.py:426(<lambda>)
        1    0.005    0.005    0.230    0.230 arraysetops.py:438(in1d)
        1    0.000    0.000    0.227    0.227 sorting.py:177(indexer_from_factorized)
        2    0.000    0.000    0.225    0.112 arraysetops.py:121(unique)
        2    0.049    0.024    0.225    0.112 arraysetops.py:268(_unique1d)
    84/82    0.000    0.000    0.209    0.003 numeric.py:433(asarray)
       11    0.208    0.019    0.208    0.019 {pandas._libs.lib.infer_dtype}
        8    0.000    0.000    0.151    0.019 common.py:301(_asarray_tuplesafe)
       14    0.000    0.000    0.150    0.011 category.py:139(_create_categorical)
       25    0.143    0.006    0.143    0.006 {method 'astype' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.143    0.143 sorting.py:321(get_group_index_sorter)
        1    0.143    0.143    0.143    0.143 {pandas._libs.algos.groupsort_indexer}
       12    0.000    0.000    0.138    0.012 categorical.py:267(__init__)
        2    0.000    0.000    0.136    0.068 categorical.py:2420(_get_codes_for_values)
    23/22    0.010    0.000    0.130    0.006 algorithms.py:1548(take_nd)
        2    0.001    0.000    0.117    0.058 missing.py:189(_isna_ndarraylike)
        2    0.000    0.000    0.117    0.058 missing.py:32(isna)
        2    0.000    0.000    0.117    0.058 missing.py:112(_isna_new)
        1    0.116    0.116    0.116    0.116 {pandas._libs.missing.isnaobj}
        8    0.000    0.000    0.081    0.010 algorithms.py:224(_get_data_algo)
        2    0.062    0.031    0.080    0.040 sorting.py:20(get_group_index)
       11    0.077    0.007    0.077    0.007 {method 'argsort' of 'numpy.ndarray' objects}
        3    0.000    0.000    0.062    0.021 categorical.py:1248(__array__)
        3    0.000    0.000    0.060    0.020 multi.py:1007(_get_level_values)
      437    0.055    0.000    0.055    0.000 {method 'reduce' of 'numpy.ufunc' objects}
      406    0.001    0.000    0.055    0.000 fromnumeric.py:49(_wrapfunc)
        1    0.000    0.000    0.053    0.053 category.py:302(get_values)
        1    0.000    0.000    0.053    0.053 categorical.py:1451(get_values)
        2    0.053    0.026    0.053    0.026 {method 'lookup' of 'pandas._libs.hashtable.StringHashTable' objects}
        3    0.000    0.000    0.051    0.017 missing.py:376(array_equivalent)
        1    0.000    0.000    0.051    0.051 category.py:342(__array__)
       14    0.000    0.000    0.047    0.003 algorithms.py:48(_ensure_data)
        1    0.045    0.045    0.045    0.045 {method 'sort' of 'numpy.ndarray' objects}
        1    0.044    0.044    0.044    0.044 {method 'tolist' of 'numpy.ndarray' objects}
       36    0.000    0.000    0.040    0.001 common.py:1578(is_bool_dtype)
        3    0.000    0.000    0.040    0.013 base.py:1964(inferred_type)
        1    0.000    0.000    0.037    0.037 multi.py:1332(from_product)
        3    0.034    0.011    0.034    0.011 {pandas._libs.algos.take_1d_object_object}
       38    0.033    0.001    0.033    0.001 {built-in method pandas._libs.algos.ensure_int64}
        2    0.011    0.006    0.033    0.016 multi.py:58(_codes_to_ints)
        2    0.000    0.000    0.032    0.016 fromnumeric.py:2092(cumsum)
        2    0.032    0.016    0.032    0.016 {method 'cumsum' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.024    0.024 util.py:16(cartesian_product)
        1    0.000    0.000    0.023    0.023 util.py:63(<listcomp>)
        5    0.022    0.004    0.022    0.004 {method 'repeat' of 'numpy.ndarray' objects}
        2    0.022    0.011    0.022    0.011 {method 'flatten' of 'numpy.ndarray' objects}
        3    0.000    0.000    0.020    0.007 fromnumeric.py:404(repeat)
        7    0.020    0.003    0.020    0.003 {method 'take' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.020    0.020 multi.py:1940(<listcomp>)
        6    0.000    0.000    0.018    0.003 {method 'max' of 'numpy.ndarray' objects}
        6    0.000    0.000    0.018    0.003 _methods.py:26(_amax)
        1    0.000    0.000    0.018    0.018 categorical.py:1774(take_nd)
       39    0.015    0.000    0.015    0.000 {built-in method numpy.core.multiarray.empty}
        4    0.013    0.003    0.014    0.004 sorting.py:55(maybe_lift)
        4    0.000    0.000    0.013    0.003 multi.py:212(__new__)
       14    0.000    0.000    0.013    0.001 base.py:510(_shallow_copy)
        5    0.000    0.000    0.013    0.003 category.py:193(_shallow_copy)
        9    0.000    0.000    0.013    0.001 category.py:178(_simple_new)
        5    0.000    0.000    0.013    0.003 categorical.py:717(_set_dtype)
       10    0.013    0.001    0.013    0.001 {pandas._libs.algos.take_1d_int64_int64}
        6    0.000    0.000    0.013    0.002 categorical.py:2437(_recode_for_categories)
        1    0.000    0.000    0.012    0.012 algorithms.py:1454(take)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions