-
-
Notifications
You must be signed in to change notification settings - Fork 19k
Description
Code Sample, a copy-pastable example if possible
import pandas as pd
df_str = pd.DataFrame(dict(key_cat=['a', 'b', 'c', 'd'] * 250, key_num1=range(1000), key_num2=range(1000), val=[1]*1000))
df_cat = df_str.copy()
df_cat.key_cat = pd.Categorical(df_cat.key_cat)
%%time
df_cat.groupby(["key_cat", "key_num1", "key_num2"]).count()
# CPU times: user 10.8 s, sys: 3 s, total: 13.8 s
# Wall time: 13.8 s # 14 sec for a dataframe with 1000 elements?!!
%%time
df_str.groupby(["key_cat", "key_num1", "key_num2"]).count()
# CPU times: user 0 ns, sys: 4 ms, total: 4 ms
# Wall time: 4.05 ms # expected performance when using a column with string type
Problem description
Changing column dtype to categorical makes groupby()
operation 3500 times slower.
The problem occurs both in pandas-0.23.4 and in pandas-0.24.0 (untagged.1.g216986d)
Expected Output
Similar execution times.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.0-35-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: 3.6.3
pip: 18.0
setuptools: 39.0.1
Cython: 0.28.2
numpy: 1.15.1
scipy: 1.1.0
pyarrow: 0.8.0
xarray: None
IPython: 6.3.1
sphinx: None
patsy: 0.4.1
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.0
pymysql: None
psycopg2: 2.7.3.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: 0.1.2
fastparquet: None
pandas_gbq: 0.2.1
pandas_datareader: None
Profiling output
The output of %%prun -s cumtime
:
14574 function calls (14438 primitive calls) in 14.267 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 14.267 14.267 {built-in method builtins.exec}
1 0.140 0.140 14.267 14.267 <string>:2(<module>)
1 0.000 0.000 14.125 14.125 groupby.py:4843(count)
1 0.000 0.000 14.122 14.122 groupby.py:4739(_wrap_agged_blocks)
1 0.000 0.000 14.118 14.118 groupby.py:4757(_reindex_output)
2 0.000 0.000 13.833 6.917 _decorators.py:185(wrapper)
2 0.000 0.000 13.833 6.917 frame.py:3552(reindex)
2 0.000 0.000 13.833 6.917 generic.py:3647(reindex)
2 0.000 0.000 13.833 6.917 frame.py:3489(_reindex_axes)
1 0.000 0.000 13.833 13.833 frame.py:3505(_reindex_index)
1 0.000 0.000 13.823 13.823 multi.py:2023(reindex)
1 0.000 0.000 13.823 13.823 multi.py:1981(get_indexer)
1 3.043 3.043 13.823 13.823 {method 'get_indexer' of 'pandas._libs.index.BaseMultiIndexCodesEngine' objects}
1 0.000 0.000 7.333 7.333 category.py:560(get_indexer)
1 1.277 1.277 6.440 6.440 {method 'get_indexer_non_unique' of 'pandas._libs.index.IndexEngine' objects}
400 0.029 0.000 5.163 0.013 fromnumeric.py:1173(resize)
403 5.120 0.013 5.120 0.013 {built-in method numpy.core.multiarray.concatenate}
1 0.000 0.000 2.788 2.788 base.py:912(__iter__)
1 0.000 0.000 2.788 2.788 base.py:893(tolist)
33/29 0.009 0.000 2.743 0.095 base.py:677(_values)
3 0.000 0.000 2.734 0.911 multi.py:807(values)
1 2.621 2.621 2.621 2.621 {pandas._libs.lib.fast_zip}
12 0.000 0.000 0.863 0.072 base.py:3219(get_indexer)
1 0.008 0.008 0.687 0.687 category.py:247(equals)
40 0.030 0.001 0.680 0.017 base.py:4914(_ensure_index)
13/10 0.000 0.000 0.651 0.065 base.py:255(__new__)
1 0.000 0.000 0.628 0.628 category.py:218(_is_dtype_compat)
1 0.000 0.000 0.491 0.491 base.py:3471(isin)
2/1 0.000 0.000 0.491 0.491 algorithms.py:384(isin)
1 0.010 0.010 0.491 0.491 categorical.py:2292(isin)
2 0.230 0.115 0.371 0.185 base.py:850(_try_convert_to_int_index)
12 0.271 0.023 0.271 0.023 {method 'get_indexer' of 'pandas._libs.index.IndexEngine' objects}
504/500 0.202 0.000 0.263 0.001 {built-in method numpy.core.multiarray.array}
1 0.000 0.000 0.248 0.248 multi.py:1878(sortlevel)
1 0.008 0.008 0.238 0.238 algorithms.py:426(<lambda>)
1 0.005 0.005 0.230 0.230 arraysetops.py:438(in1d)
1 0.000 0.000 0.227 0.227 sorting.py:177(indexer_from_factorized)
2 0.000 0.000 0.225 0.112 arraysetops.py:121(unique)
2 0.049 0.024 0.225 0.112 arraysetops.py:268(_unique1d)
84/82 0.000 0.000 0.209 0.003 numeric.py:433(asarray)
11 0.208 0.019 0.208 0.019 {pandas._libs.lib.infer_dtype}
8 0.000 0.000 0.151 0.019 common.py:301(_asarray_tuplesafe)
14 0.000 0.000 0.150 0.011 category.py:139(_create_categorical)
25 0.143 0.006 0.143 0.006 {method 'astype' of 'numpy.ndarray' objects}
1 0.000 0.000 0.143 0.143 sorting.py:321(get_group_index_sorter)
1 0.143 0.143 0.143 0.143 {pandas._libs.algos.groupsort_indexer}
12 0.000 0.000 0.138 0.012 categorical.py:267(__init__)
2 0.000 0.000 0.136 0.068 categorical.py:2420(_get_codes_for_values)
23/22 0.010 0.000 0.130 0.006 algorithms.py:1548(take_nd)
2 0.001 0.000 0.117 0.058 missing.py:189(_isna_ndarraylike)
2 0.000 0.000 0.117 0.058 missing.py:32(isna)
2 0.000 0.000 0.117 0.058 missing.py:112(_isna_new)
1 0.116 0.116 0.116 0.116 {pandas._libs.missing.isnaobj}
8 0.000 0.000 0.081 0.010 algorithms.py:224(_get_data_algo)
2 0.062 0.031 0.080 0.040 sorting.py:20(get_group_index)
11 0.077 0.007 0.077 0.007 {method 'argsort' of 'numpy.ndarray' objects}
3 0.000 0.000 0.062 0.021 categorical.py:1248(__array__)
3 0.000 0.000 0.060 0.020 multi.py:1007(_get_level_values)
437 0.055 0.000 0.055 0.000 {method 'reduce' of 'numpy.ufunc' objects}
406 0.001 0.000 0.055 0.000 fromnumeric.py:49(_wrapfunc)
1 0.000 0.000 0.053 0.053 category.py:302(get_values)
1 0.000 0.000 0.053 0.053 categorical.py:1451(get_values)
2 0.053 0.026 0.053 0.026 {method 'lookup' of 'pandas._libs.hashtable.StringHashTable' objects}
3 0.000 0.000 0.051 0.017 missing.py:376(array_equivalent)
1 0.000 0.000 0.051 0.051 category.py:342(__array__)
14 0.000 0.000 0.047 0.003 algorithms.py:48(_ensure_data)
1 0.045 0.045 0.045 0.045 {method 'sort' of 'numpy.ndarray' objects}
1 0.044 0.044 0.044 0.044 {method 'tolist' of 'numpy.ndarray' objects}
36 0.000 0.000 0.040 0.001 common.py:1578(is_bool_dtype)
3 0.000 0.000 0.040 0.013 base.py:1964(inferred_type)
1 0.000 0.000 0.037 0.037 multi.py:1332(from_product)
3 0.034 0.011 0.034 0.011 {pandas._libs.algos.take_1d_object_object}
38 0.033 0.001 0.033 0.001 {built-in method pandas._libs.algos.ensure_int64}
2 0.011 0.006 0.033 0.016 multi.py:58(_codes_to_ints)
2 0.000 0.000 0.032 0.016 fromnumeric.py:2092(cumsum)
2 0.032 0.016 0.032 0.016 {method 'cumsum' of 'numpy.ndarray' objects}
1 0.000 0.000 0.024 0.024 util.py:16(cartesian_product)
1 0.000 0.000 0.023 0.023 util.py:63(<listcomp>)
5 0.022 0.004 0.022 0.004 {method 'repeat' of 'numpy.ndarray' objects}
2 0.022 0.011 0.022 0.011 {method 'flatten' of 'numpy.ndarray' objects}
3 0.000 0.000 0.020 0.007 fromnumeric.py:404(repeat)
7 0.020 0.003 0.020 0.003 {method 'take' of 'numpy.ndarray' objects}
1 0.000 0.000 0.020 0.020 multi.py:1940(<listcomp>)
6 0.000 0.000 0.018 0.003 {method 'max' of 'numpy.ndarray' objects}
6 0.000 0.000 0.018 0.003 _methods.py:26(_amax)
1 0.000 0.000 0.018 0.018 categorical.py:1774(take_nd)
39 0.015 0.000 0.015 0.000 {built-in method numpy.core.multiarray.empty}
4 0.013 0.003 0.014 0.004 sorting.py:55(maybe_lift)
4 0.000 0.000 0.013 0.003 multi.py:212(__new__)
14 0.000 0.000 0.013 0.001 base.py:510(_shallow_copy)
5 0.000 0.000 0.013 0.003 category.py:193(_shallow_copy)
9 0.000 0.000 0.013 0.001 category.py:178(_simple_new)
5 0.000 0.000 0.013 0.003 categorical.py:717(_set_dtype)
10 0.013 0.001 0.013 0.001 {pandas._libs.algos.take_1d_int64_int64}
6 0.000 0.000 0.013 0.002 categorical.py:2437(_recode_for_categories)
1 0.000 0.000 0.012 0.012 algorithms.py:1454(take)