Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: __eq__ raises NotImplementedError for new arrow string dtype #56008

Closed
phofl opened this issue Nov 16, 2023 · 2 comments · Fixed by #56245
Closed

BUG: __eq__ raises NotImplementedError for new arrow string dtype #56008

phofl opened this issue Nov 16, 2023 · 2 comments · Fixed by #56245
Labels
Arrow pyarrow functionality Bug Strings String extension data type and string data
Milestone

Comments

@phofl
Copy link
Member

phofl commented Nov 16, 2023

Reproducible Example

ser = pd.Series([False, True])
ser2 = pd.Series(["a", "b"], dtype="string[pyarrow_numpy]")
ser == ser2

Issue Description

This shouldn't raise if we want to emulate object dtype behavior

Expected Behavior

all False return value
cc @jorisvandenbossche thoughts?

Installed Versions

Replace this line with the output of pd.show_versions()

@phofl phofl added Bug Strings String extension data type and string data Arrow pyarrow functionality labels Nov 16, 2023
@phofl phofl added this to the 3.0 milestone Nov 16, 2023
@jorisvandenbossche
Copy link
Member

Regardless of exactly emulating object dtype, we should maybe ask ourselves the more general question: how does pandas handle equality comparisons of incompatible dtypes in general? (or at least with the non-arrow dtypes)

At the moment, it seems we just never raise an error, with one exception for comparing two categoricals (based on a quick experiment, didn't include every possible dtype):

serieses =[
    pd.Series(["a", "b"]),
    pd.Series(["a", "b"], dtype="string"),
    pd.Series([True, False]),
    pd.Series([True, False], dtype="boolean"),
    pd.Series([1, 2]),
    pd.Series([1, 2], dtype="Int64"),
    pd.Series([0.1, 0.2]),
    pd.Series([0.1, 0.2], dtype="Float64"),
    pd.Series(pd.date_range("2012-01-01", periods=2)),
    pd.Series(pd.timedelta_range("1 days", periods=2)),
    pd.Series(pd.period_range("2012", periods=2)),
    pd.Series(['a', 'b'], dtype="category"),
    pd.Series([1, 2], dtype="category"),
]

for left in serieses:
    for right in serieses:
        try:
            left == right
        except:
            print(f"Exception for {left.dtype} and {right.dtype}")

gives

Exception for category and category
Exception for category and category

So if we want to keep that "rule" consistent, then I think the new default string dtype should also never raise in comparisons, but give Falses instead.

@jbrockmendel
Copy link
Member

in general we follow python semantics for non-comparable dtypes: == returns all-False, != returns all-True, and inequalities raise. A boilerplate version of this logic is in ops.invalid_comparison

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants