Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: find_common_type incorrect for Categorical with NaNs #38240

Open
jbrockmendel opened this issue Dec 2, 2020 · 7 comments
Open

BUG: find_common_type incorrect for Categorical with NaNs #38240

jbrockmendel opened this issue Dec 2, 2020 · 7 comments
Labels
Categorical Categorical Data Type Deprecate Functionality to remove in pandas ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint

Comments

@jbrockmendel
Copy link
Member

import pandas as pd
from pandas.core.dtypes.cast import find_common_type

left = pd.Index([1, 2, 3, 4])
right = pd.CategoricalIndex([1, None], categories=left)

dtype = find_common_type([left.dtype, right.dtype])

>>> right.astype(dtype)
ValueError: Cannot convert float NaN to integer

xref #37930; I expect this would also be relevant for #19371

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 2, 2020
@jorisvandenbossche jorisvandenbossche changed the title BUG: find_common_type incorrect for Categorical with NAs BUG: find_common_type incorrect for Categorical with NaNs Dec 6, 2020
@jorisvandenbossche
Copy link
Member

This is not a "bug", but rather unfortunate but intentional behaviour. We handle this specifically here:

def _cast_to_common_type(arr: ArrayLike, dtype: DtypeObj) -> ArrayLike:
"""
Helper function for `arr.astype(common_dtype)` but handling all special
cases.
"""
if (
is_categorical_dtype(arr.dtype)
and isinstance(dtype, np.dtype)
and np.issubdtype(dtype, np.integer)
):
# problem case: categorical of int -> gives int as result dtype,
# but categorical can contain NAs -> fall back to object dtype
try:
return arr.astype(dtype, copy=False)
except ValueError:
return arr.astype(object, copy=False)

See this comment on the original PR about exactly this case: #33607 (comment)

This whole _cast_to_common_type function is meant to deal with the current inconsistencies (it also handles a few other cases as well).
Of course, if you use find_common_type outside of the concat.py / _cast_to_common_type context (eg like you did in #38122), then it can give problems. We might need to use this _cast_to_common_type then in those places as well, instead of a plain astype.


We should also see if we can actually change this behaviour, to get rid of the value-dependent behaviour (which I think we generally want). But that doesn't seem easy to deprecate, so might be 2.0 material.

@jorisvandenbossche jorisvandenbossche added ExtensionArray Extending pandas with custom dtypes or arrays. and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 6, 2020
@mroeschke mroeschke added Categorical Categorical Data Type Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action labels Aug 14, 2021
@jbrockmendel
Copy link
Member Author

Marking as blocker for 1.4rc; we should make a decision about deprecation before then.

@jbrockmendel jbrockmendel added the Blocker for rc Blocking issue or pull request for release candidate label Dec 22, 2021
@jreback
Copy link
Contributor

jreback commented Dec 23, 2021

I agree i am not sure how you would deprecate this.

@jbrockmendel
Copy link
Member Author

How about I put together a PR for what it would look like to change directly in 2.0 and based on that see if a decent way of doing a deprecation for 1.4 pops out?

@jreback
Copy link
Contributor

jreback commented Dec 23, 2021

How about I put together a PR for what it would look like to change directly in 2.0 and based on that see if a decent way of doing a deprecation for 1.4 pops out?

sgtm

@jorisvandenbossche
Copy link
Member

We could raise a deprecation warning when the categorical gets cast to int and that succeeds (so not falling back to object). People could avoid the deprecation warning by casting to object dtype (or to int dtype if they want that) in advance.

I don't think this will be a very common case, so it might not be problematic to actually warn about this.

@jbrockmendel
Copy link
Member Author

No real progress on a branch here. Removing blocker-for-RC label.

@jbrockmendel jbrockmendel removed the Blocker for rc Blocking issue or pull request for release candidate label Dec 30, 2021
@jbrockmendel jbrockmendel added the PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint label Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Deprecate Functionality to remove in pandas ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint
Projects
None yet
Development

No branches or pull requests

4 participants