BUG: Fix getitem dtype preservation with multiindexes #51895

m-richards · 2023-03-11T01:01:31Z

closes BUG: Partial multiindex columns selection breaks categorical dtypes #51261
Tests added and passed
All code checks passed.
~~Added type annotations to new arguments/methods/functions.~~ NA
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature. Will wait and see if approach is okay first before changelog.

jorisvandenbossche

Thanks for looking into this!

jorisvandenbossche · 2023-03-11T08:17:41Z

pandas/core/frame.py

-                )
-                result = result.__finalize__(self)
+                result = self.iloc[:, loc]
+                result.columns = result_columns


If we do away with going through self.values (which is a good catch! that probably stems from the time we only had consolidated numpy dtypes, so whenever you had a single dtype we assumed a single numpy array), we might as well combine the if self._is_mixed_type: .. else: .. paths? I am not sure there is any benefit over using iloc vs reindex.

Having a look at this locally, it seems like these separate paths are now redundant and the reindex block can be used in both.
(There was not deliberate choice in my original change to use iloc instead of reindex, I am just less familiar with reindex). I'm also unfortunately ignorant of how they compare performance wise.

They should be equivalent, I think (especially since we afterwards still set the resulting column names), generally they should end up using the same code under the hood. Since the existing code was using reindex, I think it's fine to continue using that.

pandas/core/frame.py

pandas/tests/indexing/multiindex/test_multiindex.py

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

mroeschke · 2023-03-13T17:45:19Z

pandas/tests/indexing/multiindex/test_multiindex.py

+        # GH51261
+        columns = MultiIndex.from_tuples([("A", "B")], names=["lvl1", "lvl2"])
+        df = DataFrame(["value"], columns=columns).astype("category")
+        assert (df["A"].dtypes == "category").all()


Could you use pandas.core.dtypes.common.is_categorical_dtype(df["A"]) here?

mroeschke · 2023-03-13T17:46:21Z

pandas/tests/indexing/multiindex/test_multiindex.py

+                ["x", "y"],
+            ],
+        ).assign(bools=Series([True, False], dtype="boolean"))
+        assert df["bools"].dtype == "boolean"


Could you use pandas.core.dtypes.common.is_bool_dtype(df["bools"]) here?

For this case we actually want to be more explicit to check the extension dtype (is_bool_dtype would also pass with numpy bool dtype). Possible alternative is isinstance(.., BooleanDtype)

phofl · 2023-03-13T23:01:01Z

pandas/core/frame.py

-                result = self.reindex(columns=new_columns)
-                result.columns = result_columns
-            else:
-                new_values = self.values[:, loc]


I guess this is faster than iloc? Did not profile but that's the only reason I can think of that it is done this way. To preserve the performance, I guess we can check something like _can_fast_transpose here as well?

But in the case where this matters, I expect that locis a slice? And then I would expect iloc to be fast as well? (it has separate paths for single blocks AFAIR)
But yes, something to verify (indexing the numpy array will always be faster since you don't have the overhead of our indexing machinery. But if we also end up doing the same, the difference should be acceptable).

(and I think a bit of extra overhead is also acceptable to avoid needing the custom add_references code here (as in #51944)

Sounds good as long as is not to much overhead. We should check at least. That was my reasoning for adding the reference code instead of switching to iloc

Small test:

columns = pd.MultiIndex.from_product([range(10), range(20)]) df = pd.DataFrame(np.random.randn(10000, len(columns)), columns=columns).copy() slice = df.columns.get_loc(0) # gives slice(0, 20, None) In [47]: %timeit df.iloc[:, slice] 96.2 µs ± 4.58 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [48]: %timeit pd.DataFrame(df.values[:, slice], index=df.index) 33.9 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

So iloc is a bit slower, having a bit more overhead (mostly in creating the resulting MultiIndex columns, which we then override ..). But this mostly fixed overhead regardless of the size of the data, and we are speaking about microseconds.
The actual subsetting still happens with a slice. And just to be sure to compare to a non-slice case (where we do an actual "take") selecting the same subset:

In [49]: idx = np.arange(20) In [50]: %timeit df.iloc[:, idx] 576 µs ± 19.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

github-actions · 2023-04-16T00:05:44Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

mroeschke · 2023-04-21T17:30:17Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

jorisvandenbossche · 2023-04-21T17:34:14Z

This is actually waiting on an approval / merge (or more comments), although needs a merge of main now

mroeschke · 2023-04-24T18:41:39Z

pandas/tests/indexing/multiindex/test_multiindex.py

+        columns = MultiIndex.from_tuples([("A", "B")], names=["lvl1", "lvl2"])
+        df = DataFrame(["value"], columns=columns).astype("category")
+        df_no_multiindex = df["A"]
+        assert is_categorical_dtype(df_no_multiindex["B"])


Suggested change

assert is_categorical_dtype(df_no_multiindex["B"])

assert isinstance(df_no_multiindex["B"].dtype, CategoricalDtype)

CategoricalDtype will need an import at the top

mroeschke

Could you add a whatsnew entry in v.2.1.0.rst?

Co-Authored-By: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>

m-richards · 2023-04-25T02:27:54Z

Could you add a whatsnew entry in v.2.1.0.rst?

Sure, I've had a go at this, I originally left it because I wasn't sure how to clearly decribe what's actually being fixed.

mroeschke · 2023-04-25T17:27:12Z

Thanks @m-richards

* BUG/TST fix dtype preservation with multindex * lint * Update pandas/tests/indexing/multiindex/test_multiindex.py Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> * cleanups * switch to iloc, reindex fails in some cases * suggestions from code review * address code review comments Co-Authored-By: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> --------- Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> (cherry picked from commit 194b6bb)

…on with multiindexes) (#53121) * BUG: Fix getitem dtype preservation with multiindexes (#51895) * BUG/TST fix dtype preservation with multindex * lint * Update pandas/tests/indexing/multiindex/test_multiindex.py Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> * cleanups * switch to iloc, reindex fails in some cases * suggestions from code review * address code review comments Co-Authored-By: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> --------- Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> (cherry picked from commit 194b6bb) * Add whatsnew --------- Co-authored-by: Matt Richards <45483497+m-richards@users.noreply.github.com>

* BUG/TST fix dtype preservation with multindex * lint * Update pandas/tests/indexing/multiindex/test_multiindex.py Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> * cleanups * switch to iloc, reindex fails in some cases * suggestions from code review * address code review comments Co-Authored-By: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> --------- Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>

m-richards added 3 commits March 11, 2023 11:50

BUG/TST fix dtype preservation with multindex

a459276

Merge remote-tracking branch 'upstream/main' into fix_multiindex_dtype

eab4c6a

lint

a3fc02f

m-richards mentioned this pull request Mar 11, 2023

BUG: Fix GeoDataFrames with MultiIndex as columns do not support CRS geopandas/geopandas#2088

Merged

m-richards marked this pull request as ready for review March 11, 2023 01:35

jorisvandenbossche reviewed Mar 11, 2023

View reviewed changes

m-richards and others added 4 commits March 11, 2023 21:22

Update pandas/tests/indexing/multiindex/test_multiindex.py

1be4d25

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

cleanups

e865548

Merge remote-tracking branch 'upstream/main' into fix_multiindex_dtype

fda3df8

switch to iloc, reindex fails in some cases

533e0d8

mroeschke reviewed Mar 13, 2023

View reviewed changes

mroeschke added MultiIndex Dtype Conversions Unexpected or buggy dtype conversions labels Mar 13, 2023

jorisvandenbossche mentioned this pull request Mar 13, 2023

BUG: CoW not tracking references when indexing midx with slice #51944

Merged

5 tasks

phofl reviewed Mar 13, 2023

View reviewed changes

m-richards and others added 3 commits March 14, 2023 21:13

suggestions from code review

1d221b1

Merge remote-tracking branch 'upstream/main' into fix_multiindex_dtype

968228a

Merge remote-tracking branch 'upstream/main' into fix_multiindex_dtype

f56e76c

github-actions bot added the Stale label Apr 16, 2023

mroeschke closed this Apr 21, 2023

jorisvandenbossche reopened this Apr 21, 2023

m-richards added 3 commits April 23, 2023 14:22

Merge remote-tracking branch 'upstream/main' into fix_multiindex_dtype

e3c17af

Merge remote-tracking branch 'upstream/main' into fix_multiindex_dtype

6f260fe

Merge remote-tracking branch 'upstream/main' into fix_multiindex_dtype

079d543

mroeschke reviewed Apr 24, 2023

View reviewed changes

address code review comments

6badc9e

Co-Authored-By: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>

mroeschke approved these changes Apr 25, 2023

View reviewed changes

mroeschke merged commit 194b6bb into pandas-dev:main Apr 25, 2023

m-richards deleted the fix_multiindex_dtype branch September 16, 2023 01:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix getitem dtype preservation with multiindexes #51895

BUG: Fix getitem dtype preservation with multiindexes #51895

m-richards commented Mar 11, 2023 •

edited

Loading

jorisvandenbossche left a comment

jorisvandenbossche Mar 11, 2023

m-richards Mar 11, 2023

jorisvandenbossche Mar 11, 2023

mroeschke Mar 13, 2023

m-richards Mar 14, 2023

mroeschke Mar 13, 2023

jorisvandenbossche Mar 13, 2023

phofl Mar 13, 2023

jorisvandenbossche Mar 13, 2023

jorisvandenbossche Mar 13, 2023

phofl Mar 13, 2023

jorisvandenbossche Mar 14, 2023

github-actions bot commented Apr 16, 2023

mroeschke commented Apr 21, 2023

jorisvandenbossche commented Apr 21, 2023

mroeschke Apr 24, 2023

mroeschke left a comment

m-richards commented Apr 25, 2023

mroeschke commented Apr 25, 2023

	assert is_categorical_dtype(df_no_multiindex["B"])
	assert isinstance(df_no_multiindex["B"].dtype, CategoricalDtype)

BUG: Fix getitem dtype preservation with multiindexes #51895

BUG: Fix getitem dtype preservation with multiindexes #51895

Conversation

m-richards commented Mar 11, 2023 • edited Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Apr 16, 2023

mroeschke commented Apr 21, 2023

jorisvandenbossche commented Apr 21, 2023

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

m-richards commented Apr 25, 2023

mroeschke commented Apr 25, 2023

m-richards commented Mar 11, 2023 •

edited

Loading