Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(pandas): make case work for non-RangeIndex dataframes #9083

Merged
merged 2 commits into from
Apr 30, 2024

Conversation

dlovell
Copy link
Contributor

@dlovell dlovell commented Apr 30, 2024

Description of changes

This PR makes PandasExecutor create the Series with an index that matches the incoming data. Currently, when the incoming data does not use a RangeIndex, the output index is a union of a RangeIndex and the incoming data index.

@dlovell dlovell force-pushed the pandas-cases-non-RangeIndex branch from e5f99d5 to a3c59cc Compare April 30, 2024 11:34
@cpcloud
Copy link
Member

cpcloud commented Apr 30, 2024

What's the use case that's enabled here that requires this change? What can you not do with the current codebase?

@dlovell
Copy link
Contributor Author

dlovell commented Apr 30, 2024

What's the use case that's enabled here that requires this change? What can you not do with the current codebase?

register a dataframe with non-range index and have cases work

@dlovell
Copy link
Contributor Author

dlovell commented Apr 30, 2024

alternatively, should dataframes' index be sanitized on registration? is there a specification of what should be true about dataframes that are registered?

@cpcloud
Copy link
Member

cpcloud commented Apr 30, 2024

register a dataframe with non-range index and have cases work

It would be good to have a less abstract example documented in this PR. Doesn't really even have to be code, just some description that helps justify why we should take on any additional code to the pandas backend.

@dlovell
Copy link
Contributor Author

dlovell commented Apr 30, 2024

register a dataframe with non-range index and have cases work

It would be good to have a less abstract example documented in this PR. Doesn't really even have to be code, just some description that helps justify why we should take on any additional code to the pandas backend.

Does this match what you're looking for?

When I run this code

def do_replace(col):
    return (
        col
        .cases(
            (
                (1, "one"),
                (2, "two"),
            ),
            default="unk",
        )
    )


df = pd.DataFrame({
    "A": pd.Series({i: i % 3 for i in (0, 1, 2, 4)}),
    "B": 0,
})
expr = ibis.pandas.connect({"t": df}).table("t")


print("Input")
print(len(expr.execute()))
print(expr.execute())
print()


print("Current results")
x = expr.mutate(**{"A": lambda t: t["A"].pipe(do_replace)}).execute()
print(len(x))
print(x)
print()

I get these results

Input
4
   A  B
0  0  0
1  1  0
2  2  0
4  1  0

Current results
5
     A    B
0  unk  0.0
1  one  0.0
2  two  0.0
3  one  NaN
4  NaN  0.0

@cpcloud
Copy link
Member

cpcloud commented Apr 30, 2024

Heh, yeah, that does :)

I'll just keep suggesting that the pandas backend should probably be avoided.

I somewhat reluctantly will accept this PR, fully realizing that creating the pandas backend was probably a doomed idea from the start 😂

@cpcloud cpcloud added this to the 9.0 milestone Apr 30, 2024
@cpcloud cpcloud added bug Incorrect behavior inside of ibis pandas The pandas backend labels Apr 30, 2024
@cpcloud cpcloud enabled auto-merge (squash) April 30, 2024 12:12
@cpcloud cpcloud merged commit 73dd685 into ibis-project:main Apr 30, 2024
86 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis pandas The pandas backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants