PERF: MultiIndex._shallow_copy #32669

topper-123 · 2020-03-12T22:48:47Z

xref PERF: Index._shallow_copy doesn't copy ._engine #28584, PERF: copy cached attributes on index shallow_copy #32568, PERF: copy cached attributes on extension index shallow_copy #32640
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Improves performance of MultiIndex._shallow_copy. Example:

>>> n = 100_000
>>> df = pd.DataFrame({'a': range(n), 'b': range(1, n+1)})
>>> mi = pd.MultiIndex.from_frame(df)
>>> mi.is_lexsorted()
True
>>> mi.get_loc(mi[0])  # also sets up the cache
>>> %timeit mi._shallow_copy().get_loc(mi[0])
8.56 ms ± 127 µs per loop  # master
75.1 µs ± 2.3 µs per loop  # this PR, first commit
46.9 µs ± 792 ns per loop  # this PR, second commit

Also adds tests for _shallow_copy for all index types. This ensures that this issue has been resolved for all index types.

topper-123 · 2020-03-12T23:00:24Z

pandas/core/indexes/multi.py

@@ -689,7 +690,7 @@ def __len__(self) -> int:
    # --------------------------------------------------------------------
    # Levels Methods

-    @cache_readonly
+    @property


the levels are actually not readonly, bacise each level's name attribute is writeable.

Hmm I thought these were readonly as of 1.0 though?

https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html#backwards-incompatible-api-changes

@TomAugspurger

I actually don't mind treating it as immutable, it was just that the test using pandas/tests/indexes/multi/test_names.py::check_level_names failed.

CategoricalIndex and CategoricalDtype actually have the same issue already:

>>> dtype = pd.CategoricalDtype(['a', 'b', 'c']) >>> dtype.categories.name = "x" # dtypes a re immutable, this is not supposed to work >>> dtype.categories Index(['a', 'b', 'c'], dtype='object', name='x')

IMO this issue is a wart, but nothing serious, because MultiIndex.names are now seperate from the levels, so it doesn't interfere with anything and the same issue is the same as with CategoricalIndex.categories.name.

Can you check the history here? Did we deliberately make this cache_readonly to avoid a performance regression.

I also don't see why check_level_names would be failing when this is cache_readonly.

Yeah it was changed in #31651 because of a performance regression.

I've made a new version that does a shallow_copy of _cache["levels"] also.

Using cache_readonly is faster than property. I've updated the example to reflect that.

jbrockmendel · 2020-03-13T16:28:41Z

pandas/core/indexes/multi.py

@@ -991,7 +992,13 @@ def _shallow_copy(self, values=None, **kwargs):
            # discards freq
            kwargs.pop("freq", None)
            return MultiIndex.from_tuples(values, names=names, **kwargs)
-        return self.copy(**kwargs)
+
+        result = self.copy(**kwargs)


comment pointing back to this thread?

topper-123 · 2020-03-13T22:50:56Z

Updated.

It was needlessly complex to do ops in the "levels" in the cache. Simpler to delete the "levels" key and just let it be recreated if/when the .levels attribute is accessed in the new shallow_copied index.

jreback · 2020-03-14T16:29:29Z

thanks @topper-123 very nice

topper-123 · 2020-03-14T16:57:33Z

Yeah, this potentially makes copying thing around much cheaper. Will be interesting to see if there are some specific use cases where performances improve significantly.

jacobaustin123 · 2020-03-28T22:00:18Z

This PR has introduced a small bug. I'm not sure how deep it goes, but I've encountered it in PeriodIndex objects inside a MultiIndex. Here is a minimal reproduction:

import pandas as pd
idx = pd.MultiIndex.from_arrays([pd.PeriodIndex([pd.Period("2019Q1"), pd.Period("2019Q2")], name='b')])
idx2 = pd.MultiIndex.from_arrays([idx._get_level_values(level) for level in range(idx.nlevels)])
print(all(x.is_monotonic for x in idx2.levels)) # raises an error

There's a simple fix for this, which involves changing pandas/core/indexes/period.py:260 to result._cache = {}. I don't understand the internals enough to know what is causing this issue or whether it extends farther than this case. The problem seems to be that the _cache caches the _engine attribute, and since we no longer clear the cache when copying the object using _simple_new, _engine is left in an invalid state. The error is caused by idx._simple_new()._engine.vgetter() returning None, which causes issues in is_monotonic, among other things.

topper-123 · 2020-03-29T12:00:34Z

Thanks for the report, @jacobaustin123. This is a bit tricky as you mention, but is caused by the use of weakref(self) is PeriodIndex._engine, where the weakreffed self in this case is released too early.

This can be solved by deleting PeriodIndex._engine and falling back to the base Index._engine.

@jbrockmendel , I think you made the new Index._engine method including Index._get_engine_target. Do you agree that PeriodIndex._engine should just be deleted?

jbrockmendel · 2020-03-29T16:04:14Z

Do you agree that PeriodIndex._engine should just be deleted?

Eventually, yes. The reason why PeriodIndex._engine exists is because PeriodEngine needs to get the .freq attribute of the PeriodIndex. However, the PeriodEngine methods that actually use .freq are never actually reached, because the relevant PeriodIndex methods route around them, and it isn't obvious how intentional that is.

So it should be removed, but there is some untangling to do first.

topper-123 · 2020-03-29T17:47:57Z

Ok. a fix to the current problem seems to be to replace weakref(self) with weakref(self._values) (haven't run the test suite yet, though). We could do that until the fix, I think.

topper-123 force-pushed the perf_mulit_index_shallow_copy branch 2 times, most recently from d13a162 to b22bca4 Compare March 12, 2020 22:50

topper-123 commented Mar 12, 2020

View reviewed changes

WillAyd added the MultiIndex label Mar 13, 2020

topper-123 added the Performance Memory or execution speed performance label Mar 13, 2020

topper-123 added this to the 1.1 milestone Mar 13, 2020

jbrockmendel reviewed Mar 13, 2020

View reviewed changes

topper-123 force-pushed the perf_mulit_index_shallow_copy branch 4 times, most recently from 61dc8cd to 3612860 Compare March 13, 2020 22:43

topper-123 added 3 commits March 14, 2020 06:55

PERF: MultiIndex._shallow_copy

1127370

keep cache_readonly

dcbcb1b

better comments and tests

9c5a56c

topper-123 force-pushed the perf_mulit_index_shallow_copy branch from 3612860 to 9c5a56c Compare March 14, 2020 06:55

jreback merged commit 8111d64 into pandas-dev:master Mar 14, 2020

topper-123 deleted the perf_mulit_index_shallow_copy branch March 14, 2020 16:47

topper-123 mentioned this pull request Mar 21, 2020

PERF/REF: MultiIndex.copy #32883

Merged

SeeminSyed pushed a commit to CSCD01-team01/pandas that referenced this pull request Mar 22, 2020

PERF: MultiIndex._shallow_copy (pandas-dev#32669)

d0f3b27

jacobaustin123 mentioned this pull request Mar 28, 2020

ENH: Added key option to df/series.sort_values(key=...) and df/series.sort_index(key=...) sorting #27237

Merged

4 tasks

This was referenced Mar 29, 2020

BUG: Copying PeriodIndex levels on MultiIndex loses weakrefs #33131

Closed

BUG: create new MI from MultiIndex._get_level_values #33134

Merged

rhshadrach mentioned this pull request Oct 2, 2023

REGR: groupby.transform with a UDF performance #55256

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: MultiIndex._shallow_copy #32669

PERF: MultiIndex._shallow_copy #32669

topper-123 commented Mar 12, 2020 •

edited

Loading

topper-123 Mar 12, 2020

WillAyd Mar 13, 2020

topper-123 Mar 13, 2020

TomAugspurger Mar 13, 2020

topper-123 Mar 13, 2020

topper-123 Mar 13, 2020

jbrockmendel Mar 13, 2020

topper-123 Mar 13, 2020

topper-123 commented Mar 13, 2020 •

edited

Loading

jreback commented Mar 14, 2020

topper-123 commented Mar 14, 2020

jacobaustin123 commented Mar 28, 2020

topper-123 commented Mar 29, 2020

jbrockmendel commented Mar 29, 2020

topper-123 commented Mar 29, 2020 •

edited

Loading

PERF: MultiIndex._shallow_copy #32669

PERF: MultiIndex._shallow_copy #32669

Conversation

topper-123 commented Mar 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Mar 13, 2020 • edited Loading

jreback commented Mar 14, 2020

topper-123 commented Mar 14, 2020

jacobaustin123 commented Mar 28, 2020

topper-123 commented Mar 29, 2020

jbrockmendel commented Mar 29, 2020

topper-123 commented Mar 29, 2020 • edited Loading

topper-123 commented Mar 12, 2020 •

edited

Loading

topper-123 commented Mar 13, 2020 •

edited

Loading

topper-123 commented Mar 29, 2020 •

edited

Loading