PERF: remove_unused_levels is very slow #16556

Closed
rhendric opened this Issue May 31, 2017 · 6 comments

Comments

Projects
None yet
2 participants
@rhendric
Contributor

rhendric commented May 31, 2017

Code Sample

import numpy as np
import pandas as pd
series = pd.DataFrame(dict(
    A=np.random.randint(0, 10000, 100000),
    B=np.random.randint(0, 10000, 100000),
    V=np.random.rand(100000))).groupby(['A', 'B']).V.sum()
filtered_series = series[series < 0.1]
%time x = filtered_series.index.remove_unused_levels()
%time y = filtered_series.reset_index().set_index(['A', 'B']).index

Problem description

On my laptop, x takes 20 to 40 times as long as y, despite y doing the extra work of sorting the second level and reindexing the series in the process. The outputs, except for the sorting of the second level, are identical. Why is remove_unused_levels so slow?

Expected Output

remove_unused_levels should be at least as fast on large indexes as the reset_index/set_index hack.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.11.2-1-ARCH
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.utf8
LOCALE: en_US.UTF-8

pandas: 0.20.1
pytest: None
pip: 9.0.1
setuptools: 35.0.2
Cython: None
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback May 31, 2017

Contributor

look at my comment here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/indexes/multi.py#L1314

you are welcome to provide another implementation if you'd like.

Contributor

jreback commented May 31, 2017

look at my comment here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/indexes/multi.py#L1314

you are welcome to provide another implementation if you'd like.

@jreback jreback added this to the Next Major Release milestone May 31, 2017

@jreback jreback changed the title from remove_unused_levels is very slow to PERF: remove_unused_levels is very slow May 31, 2017

@rhendric

This comment has been minimized.

Show comment
Hide comment
@rhendric

rhendric May 31, 2017

Contributor

Cool, I may take a swing at that. Would a replacement implementation have to produce level lists with the same order as the current implementation, or does that not matter as long as .values remains the same as the input index? Docstring reads slightly ambiguously to me; it says "same .values and ordering", which I take to mean ordering of .values but might mean ordering of levels.

Contributor

rhendric commented May 31, 2017

Cool, I may take a swing at that. Would a replacement implementation have to produce level lists with the same order as the current implementation, or does that not matter as long as .values remains the same as the input index? Docstring reads slightly ambiguously to me; it says "same .values and ordering", which I take to mean ordering of .values but might mean ordering of levels.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback May 31, 2017

Contributor

@rhendric

I think you DO need to produce the same orderings, otherwise things won't be sorted. But not 100% sure this is a strict guarantee. We have some tests which exercise this (prob need a few more).

The before and after have to be equal. So technically your example doesn't meet this test. However it may be possible to recompute the missing levels, then simply reorder them. (again the internals), not the .values

Contributor

jreback commented May 31, 2017

@rhendric

I think you DO need to produce the same orderings, otherwise things won't be sorted. But not 100% sure this is a strict guarantee. We have some tests which exercise this (prob need a few more).

The before and after have to be equal. So technically your example doesn't meet this test. However it may be possible to recompute the missing levels, then simply reorder them. (again the internals), not the .values

@rhendric

This comment has been minimized.

Show comment
Hide comment
@rhendric

rhendric May 31, 2017

Contributor

Hm. Then I think there's not just a performance issue, but an actual bug. filtered_series.index.remove_unused_levels().equals(filtered_series.index) is false for me, when prepared as above.

Should I open a second issue or will this one serve for both?

Contributor

rhendric commented May 31, 2017

Hm. Then I think there's not just a performance issue, but an actual bug. filtered_series.index.remove_unused_levels().equals(filtered_series.index) is false for me, when prepared as above.

Should I open a second issue or will this one serve for both?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 1, 2017

Contributor

no these won't compare equal
the level indexers have changed

Contributor

jreback commented Jun 1, 2017

no these won't compare equal
the level indexers have changed

@rhendric

This comment has been minimized.

Show comment
Hide comment
@rhendric

rhendric Jun 1, 2017

Contributor

I think you're misreading me? I'm comparing the input and output of remove_unused_levels here, not comparing it with the different index from my original example.

Contributor

rhendric commented Jun 1, 2017

I think you're misreading me? I'm comparing the input and output of remove_unused_levels here, not comparing it with the different index from my original example.

rhendric added a commit to rhendric/pandas that referenced this issue Jun 1, 2017

BUG: reimplement MultiIndex.remove_unused_levels
* Add a large random test case for remove_unused_levels that failed the
previous implementation

* Fix #16556, a performance issue with the previous implementation

* Add inplace functionality

* Always return (if not inplace) at least a view instead of the original
index

rhendric added a commit to rhendric/pandas that referenced this issue Jun 1, 2017

BUG: reimplement MultiIndex.remove_unused_levels
* Add a large random test case for remove_unused_levels that failed the
previous implementation

* Fix #16556, a performance issue with the previous implementation

* Add inplace functionality

* Always return (if not inplace) at least a view instead of the original
index

@jreback jreback modified the milestones: 0.20.2, Next Major Release Jun 1, 2017

rhendric added a commit to rhendric/pandas that referenced this issue Jun 1, 2017

BUG: reimplement MultiIndex.remove_unused_levels
* Add a large random test case for remove_unused_levels that failed the
previous implementation

* Fix #16556, a performance issue with the previous implementation

* Always return at least a view instead of the original index

rhendric added a commit to rhendric/pandas that referenced this issue Jun 1, 2017

BUG: reimplement MultiIndex.remove_unused_levels
* Add a large random test case for remove_unused_levels that failed the
previous implementation

* Fix #16556, a performance issue with the previous implementation

* Always return at least a view instead of the original index

rhendric added a commit to rhendric/pandas that referenced this issue Jun 1, 2017

BUG: reimplement MultiIndex.remove_unused_levels
* Add a large random test case for remove_unused_levels that failed the
previous implementation

* Fix #16556, a performance issue with the previous implementation

* Always return at least a view instead of the original index

rhendric added a commit to rhendric/pandas that referenced this issue Jun 2, 2017

BUG: reimplement MultiIndex.remove_unused_levels
* Add a large random test case for remove_unused_levels that failed the
previous implementation

* Fix #16556, a performance issue with the previous implementation

* Always return at least a view instead of the original index

@jreback jreback closed this in #16565 Jun 2, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment