Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should isin behave with pd.NA? #31990

Open
dsaxton opened this issue Feb 15, 2020 · 4 comments
Open

How should isin behave with pd.NA? #31990

dsaxton opened this issue Feb 15, 2020 · 4 comments
Labels
Bug isin isin method NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@dsaxton
Copy link
Member

dsaxton commented Feb 15, 2020

import pandas as pd

arr = pd.array(["a", "b", pd.NA], dtype="string")
s = pd.Series(["a", "b", "c"])

print(s.isin(arr))
# 0     True
# 1     True
# 2    False
# dtype: bool

print(pd.Series(arr).isin(["a", "b"]))
# 0     True
# 1     True
# 2    False
# dtype: bool

I think a case could be made that the actual output is not correct in these cases, and that both should return the nullable boolean pd.Series(pd.array([True, True, pd.NA])). In the first case we don't know that "c" is not in arr, and in the second case we don't know if pd.NA happens to be "a" or "b", so again we should have pd.NA.

This is obviously an edge case but may be worth considering for the sake of consistency with the other three-valued logic operations (since isin is essentially an "or" statement).

@jreback
Copy link
Contributor

jreback commented Feb 15, 2020

can show a similar example with a float array with np.nan

@jorisvandenbossche
Copy link
Member

Yes, I think we should indeed propagate the NA here, similarly how we do it with other operations that result in booleans (like comparisons).

And the result here should indeed also be a nullable boolean anyway (even when not propagating the NA), as we should try as much as possible return nullable dtypes from opertations with columns with nullable dtypes.

@jorisvandenbossche jorisvandenbossche added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Feb 15, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.1 milestone Feb 15, 2020
@cottrell
Copy link
Contributor

Jumping into this a bit late in the game ... but some questions about pd.NA came up on gitter and I see it is still experimental. If there is some place for discussion about pd.NA please point me there.

This use case points to allowing pd.NA to change the underlying numpy array dtypes to objects could be pretty bad.

If it was simply implemented as a mask attribute (sprase or dense optionally) everyone might have a better time down the road. For example, isin could be lightly modified with no significant performance hit if a pd.NA mask is present.

If I understand the origins, I'm imagining this is largely to allow for missing integers and booleans?

@jorisvandenbossche
Copy link
Member

For integers and booleans, it is using a mask (see https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/masked.py). But for string dtype not (or not yet).

For general questions, we don't really have a good place to ask those, further discussing it on gitter is fine, or otherwise the mailing list. And if you have specific issues (bug reports, enhancement requests), new issues are fine!

@mroeschke mroeschke added the Bug label Apr 28, 2020
@jreback jreback modified the milestones: 1.1, Contributions Welcome Jul 11, 2020
@jbrockmendel jbrockmendel added the isin isin method label Oct 30, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug isin isin method NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

No branches or pull requests

6 participants