Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: MultiIndex._shallow_copy #32669

Merged
merged 3 commits into from
Mar 14, 2020

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Mar 12, 2020

Improves performance of MultiIndex._shallow_copy. Example:

>>> n = 100_000
>>> df = pd.DataFrame({'a': range(n), 'b': range(1, n+1)})
>>> mi = pd.MultiIndex.from_frame(df)
>>> mi.is_lexsorted()
True
>>> mi.get_loc(mi[0])  # also sets up the cache
>>> %timeit mi._shallow_copy().get_loc(mi[0])
8.56 ms ± 127 µs per loop  # master
75.1 µs ± 2.3 µs per loop  # this PR, first commit
46.9 µs ± 792 ns per loop  # this PR, second commit

Also adds tests for _shallow_copy for all index types. This ensures that this issue has been resolved for all index types.

@topper-123 topper-123 force-pushed the perf_mulit_index_shallow_copy branch 2 times, most recently from d13a162 to b22bca4 Compare March 12, 2020 22:50
@@ -689,7 +690,7 @@ def __len__(self) -> int:
# --------------------------------------------------------------------
# Levels Methods

@cache_readonly
@property
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the levels are actually not readonly, bacise each level's name attribute is writeable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't mind treating it as immutable, it was just that the test using pandas/tests/indexes/multi/test_names.py::check_level_names failed.

CategoricalIndex and CategoricalDtype actually have the same issue already:

>>> dtype = pd.CategoricalDtype(['a', 'b', 'c'])
>>> dtype.categories.name = "x"  # dtypes a re immutable, this is not supposed to work
>>> dtype.categories
Index(['a', 'b', 'c'], dtype='object', name='x')

IMO this issue is a wart, but nothing serious, because MultiIndex.names are now seperate from the levels, so it doesn't interfere with anything and the same issue is the same as with CategoricalIndex.categories.name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check the history here? Did we deliberately make this cache_readonly to avoid a performance regression.

I also don't see why check_level_names would be failing when this is cache_readonly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it was changed in #31651 because of a performance regression.

I've made a new version that does a shallow_copy of _cache["levels"] also.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using cache_readonly is faster than property. I've updated the example to reflect that.

@topper-123 topper-123 added the Performance Memory or execution speed performance label Mar 13, 2020
@topper-123 topper-123 added this to the 1.1 milestone Mar 13, 2020
@@ -991,7 +992,13 @@ def _shallow_copy(self, values=None, **kwargs):
# discards freq
kwargs.pop("freq", None)
return MultiIndex.from_tuples(values, names=names, **kwargs)
return self.copy(**kwargs)

result = self.copy(**kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment pointing back to this thread?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@topper-123 topper-123 force-pushed the perf_mulit_index_shallow_copy branch 4 times, most recently from 61dc8cd to 3612860 Compare March 13, 2020 22:43
@topper-123
Copy link
Contributor Author

topper-123 commented Mar 13, 2020

Updated.

It was needlessly complex to do ops in the "levels" in the cache. Simpler to delete the "levels" key and just let it be recreated if/when the .levels attribute is accessed in the new shallow_copied index.

@topper-123 topper-123 force-pushed the perf_mulit_index_shallow_copy branch from 3612860 to 9c5a56c Compare March 14, 2020 06:55
@jreback jreback merged commit 8111d64 into pandas-dev:master Mar 14, 2020
@jreback
Copy link
Contributor

jreback commented Mar 14, 2020

thanks @topper-123 very nice

@topper-123 topper-123 deleted the perf_mulit_index_shallow_copy branch March 14, 2020 16:47
@topper-123
Copy link
Contributor Author

Yeah, this potentially makes copying thing around much cheaper. Will be interesting to see if there are some specific use cases where performances improve significantly.

@jacobaustin123
Copy link
Contributor

This PR has introduced a small bug. I'm not sure how deep it goes, but I've encountered it in PeriodIndex objects inside a MultiIndex. Here is a minimal reproduction:

import pandas as pd
idx = pd.MultiIndex.from_arrays([pd.PeriodIndex([pd.Period("2019Q1"), pd.Period("2019Q2")], name='b')])
idx2 = pd.MultiIndex.from_arrays([idx._get_level_values(level) for level in range(idx.nlevels)])
print(all(x.is_monotonic for x in idx2.levels)) # raises an error

There's a simple fix for this, which involves changing pandas/core/indexes/period.py:260 to result._cache = {}. I don't understand the internals enough to know what is causing this issue or whether it extends farther than this case. The problem seems to be that the _cache caches the _engine attribute, and since we no longer clear the cache when copying the object using _simple_new, _engine is left in an invalid state. The error is caused by idx._simple_new()._engine.vgetter() returning None, which causes issues in is_monotonic, among other things.

@topper-123
Copy link
Contributor Author

Thanks for the report, @jacobaustin123. This is a bit tricky as you mention, but is caused by the use of weakref(self) is PeriodIndex._engine, where the weakreffed self in this case is released too early.

This can be solved by deleting PeriodIndex._engine and falling back to the base Index._engine.

@jbrockmendel , I think you made the new Index._engine method including Index._get_engine_target. Do you agree that PeriodIndex._engine should just be deleted?

@jbrockmendel
Copy link
Member

Do you agree that PeriodIndex._engine should just be deleted?

Eventually, yes. The reason why PeriodIndex._engine exists is because PeriodEngine needs to get the .freq attribute of the PeriodIndex. However, the PeriodEngine methods that actually use .freq are never actually reached, because the relevant PeriodIndex methods route around them, and it isn't obvious how intentional that is.

So it should be removed, but there is some untangling to do first.

@topper-123
Copy link
Contributor Author

topper-123 commented Mar 29, 2020

Ok. a fix to the current problem seems to be to replace weakref(self) with weakref(self._values) (haven't run the test suite yet, though). We could do that until the fix, I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants