Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix getitem dtype preservation with multiindexes #51895

Merged
merged 14 commits into from
Apr 25, 2023

Conversation

m-richards
Copy link
Contributor

@m-richards m-richards commented Mar 11, 2023

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this!

)
result = result.__finalize__(self)
result = self.iloc[:, loc]
result.columns = result_columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do away with going through self.values (which is a good catch! that probably stems from the time we only had consolidated numpy dtypes, so whenever you had a single dtype we assumed a single numpy array), we might as well combine the if self._is_mixed_type: .. else: .. paths? I am not sure there is any benefit over using iloc vs reindex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a look at this locally, it seems like these separate paths are now redundant and the reindex block can be used in both.
(There was not deliberate choice in my original change to use iloc instead of reindex, I am just less familiar with reindex). I'm also unfortunately ignorant of how they compare performance wise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should be equivalent, I think (especially since we afterwards still set the resulting column names), generally they should end up using the same code under the hood. Since the existing code was using reindex, I think it's fine to continue using that.

pandas/core/frame.py Outdated Show resolved Hide resolved
pandas/tests/indexing/multiindex/test_multiindex.py Outdated Show resolved Hide resolved
# GH51261
columns = MultiIndex.from_tuples([("A", "B")], names=["lvl1", "lvl2"])
df = DataFrame(["value"], columns=columns).astype("category")
assert (df["A"].dtypes == "category").all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use pandas.core.dtypes.common.is_categorical_dtype(df["A"]) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

["x", "y"],
],
).assign(bools=Series([True, False], dtype="boolean"))
assert df["bools"].dtype == "boolean"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use pandas.core.dtypes.common.is_bool_dtype(df["bools"]) here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case we actually want to be more explicit to check the extension dtype (is_bool_dtype would also pass with numpy bool dtype). Possible alternative is isinstance(.., BooleanDtype)

result = self.reindex(columns=new_columns)
result.columns = result_columns
else:
new_values = self.values[:, loc]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is faster than iloc? Did not profile but that's the only reason I can think of that it is done this way. To preserve the performance, I guess we can check something like _can_fast_transpose here as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in the case where this matters, I expect that locis a slice? And then I would expect iloc to be fast as well? (it has separate paths for single blocks AFAIR)
But yes, something to verify (indexing the numpy array will always be faster since you don't have the overhead of our indexing machinery. But if we also end up doing the same, the difference should be acceptable).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and I think a bit of extra overhead is also acceptable to avoid needing the custom add_references code here (as in #51944)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good as long as is not to much overhead. We should check at least. That was my reasoning for adding the reference code instead of switching to iloc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small test:

columns = pd.MultiIndex.from_product([range(10), range(20)])
df = pd.DataFrame(np.random.randn(10000, len(columns)), columns=columns).copy()
slice = df.columns.get_loc(0)
# gives slice(0, 20, None)

In [47]: %timeit df.iloc[:, slice]
96.2 µs ± 4.58 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [48]: %timeit pd.DataFrame(df.values[:, slice], index=df.index)
33.9 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

So iloc is a bit slower, having a bit more overhead (mostly in creating the resulting MultiIndex columns, which we then override ..). But this mostly fixed overhead regardless of the size of the data, and we are speaking about microseconds.
The actual subsetting still happens with a slice. And just to be sure to compare to a non-slice case (where we do an actual "take") selecting the same subset:

In [49]: idx = np.arange(20)

In [50]: %timeit df.iloc[:, idx]
576 µs ± 19.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Apr 16, 2023
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@jorisvandenbossche
Copy link
Member

This is actually waiting on an approval / merge (or more comments), although needs a merge of main now

columns = MultiIndex.from_tuples([("A", "B")], names=["lvl1", "lvl2"])
df = DataFrame(["value"], columns=columns).astype("category")
df_no_multiindex = df["A"]
assert is_categorical_dtype(df_no_multiindex["B"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert is_categorical_dtype(df_no_multiindex["B"])
assert isinstance(df_no_multiindex["B"].dtype, CategoricalDtype)

CategoricalDtype will need an import at the top

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a whatsnew entry in v.2.1.0.rst?

Co-Authored-By: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
@m-richards
Copy link
Contributor Author

Could you add a whatsnew entry in v.2.1.0.rst?

Sure, I've had a go at this, I originally left it because I wasn't sure how to clearly decribe what's actually being fixed.

@mroeschke mroeschke merged commit 194b6bb into pandas-dev:main Apr 25, 2023
@mroeschke
Copy link
Member

Thanks @m-richards

phofl pushed a commit to phofl/pandas that referenced this pull request May 6, 2023
* BUG/TST fix dtype preservation with multindex

* lint

* Update pandas/tests/indexing/multiindex/test_multiindex.py

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

* cleanups

* switch to iloc, reindex fails in some cases

* suggestions from code review

* address code review comments

Co-Authored-By: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>

---------

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>

(cherry picked from commit 194b6bb)
mroeschke pushed a commit that referenced this pull request May 8, 2023
…on with multiindexes) (#53121)

* BUG: Fix getitem dtype preservation with multiindexes (#51895)

* BUG/TST fix dtype preservation with multindex

* lint

* Update pandas/tests/indexing/multiindex/test_multiindex.py

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

* cleanups

* switch to iloc, reindex fails in some cases

* suggestions from code review

* address code review comments

Co-Authored-By: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>

---------

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>

(cherry picked from commit 194b6bb)

* Add whatsnew

---------

Co-authored-by: Matt Richards <45483497+m-richards@users.noreply.github.com>
Rylie-W pushed a commit to Rylie-W/pandas that referenced this pull request May 19, 2023
* BUG/TST fix dtype preservation with multindex

* lint

* Update pandas/tests/indexing/multiindex/test_multiindex.py

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

* cleanups

* switch to iloc, reindex fails in some cases

* suggestions from code review

* address code review comments

Co-Authored-By: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>

---------

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
* BUG/TST fix dtype preservation with multindex

* lint

* Update pandas/tests/indexing/multiindex/test_multiindex.py

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

* cleanups

* switch to iloc, reindex fails in some cases

* suggestions from code review

* address code review comments

Co-Authored-By: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>

---------

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
@m-richards m-richards deleted the fix_multiindex_dtype branch September 16, 2023 01:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions MultiIndex Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Partial multiindex columns selection breaks categorical dtypes
4 participants