Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: df.loc is 100x slower for CategoricalIndex than for normal Index #20395

Closed
topper-123 opened this issue Mar 17, 2018 · 2 comments

Comments

Projects
None yet
4 participants
@topper-123
Copy link
Contributor

commented Mar 17, 2018

EDIT: After #21369 was merged the result of %timeit df2.loc['b'] has improved to 3.8 ms.
EDIT: After #21618 was merged the result of %timeit df2.loc['b'] has improved to 3.3 ms.
EDIT: After #21659 was merged the result of %timeit df2.loc['b'] has improved to 1.6 ms.
EDIT: After #23235 was merged the result of %timeit df2.loc['b'] has improved to 159 µs. Issue resolved.

Code Sample

>>> n = 100_000
>>> df1 = pd.DataFrame(dict(A=range(n*3)), index=list('a'*n + 'b'*n + 'c'*n))
>>> df1.index.is_monotonic_increasing
True
>>> df2 = df1.copy()
>>> df2.index = pd.CategoricalIndex(df2.index)
>>> df2.index.is_monotonic_increasing
True
>>> %timeit df1.loc['b']
125 µs ± 2.95 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit df2.loc['b']
13.8 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Problem description

Selecting on a CategoricalIndex is 100x slower than selecting on a normal Index.

I've tested this on master ( a few days old) and on v0.22, with same result for both versions. The speed is even worse than the speed for a full columns scan:

>>> df3 = df2.reset_index()
>>> %timeit df3[df3['index'] == 'b']
6.58 ms ± 25.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

A guess is that the binary search is bypassed and a full index scan is being done + some extra stuff so it's even slower than a normal full columns scan.

Expected Output

The output is as expected, but the speed is very slow for CategoricalIndex.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: a7a7f8c
python: 3.6.3.final.0
python-bits: 32
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0.dev0+870.ga7a7f8c
pytest: 3.3.1
pip: 9.0.1
setuptools: 38.2.5
Cython: 0.26.1
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: 0.10.0
IPython: 6.2.1
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: None
bs4: None
html5lib: 1.0b10
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@topper-123

This comment has been minimized.

Copy link
Contributor Author

commented May 9, 2018

I've retested this on RC2, and this issue is still there. Anyone can confirm this issue?

Is this a known limitation of using CategoricalIndex rather than Index?

@david-liu-brattle-1

This comment has been minimized.

Copy link
Contributor

commented May 18, 2018

It looks to me like CategoricalIndex is slower every step of the way, even with the improvements in #21022

key = df1.loc.obj.index._engine.get_loc('b')
result = df1.loc.obj.iloc[key]
%timeit key = df1.loc.obj.index._engine.get_loc('b')

9.86 µs ± 635 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit result = df1.loc.obj.iloc[key]

73.9 µs ± 1.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

codes = df2.loc.obj.index.categories.get_loc('b')
key = df2.loc.obj.index._engine.get_loc(codes)
result = df2.loc.obj.iloc[key]
%timeit codes = df2.loc.obj.index.categories.get_loc('b')
%timeit key = df2.loc.obj.index._engine.get_loc(codes)

2.15 µs ± 156 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
857 µs ± 144 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit result = df2.loc.obj.iloc[key]

429 µs ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.