Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

if_else shouldn't coerce nan to strings #179

Closed
machow opened this issue Jan 25, 2020 · 2 comments
Closed

if_else shouldn't coerce nan to strings #179

machow opened this issue Jan 25, 2020 · 2 comments

Comments

@machow
Copy link
Owner

machow commented Jan 25, 2020

Edit: Removed pasted code snipper, replaced with reproducible example.

Problem: if_else can end up converting np.nan to a string: "nan".

Cause: if_else tries to downcast using numpy at the very end

Example

import numpy as np
from siuba import if_else

if_else(
    pd.Series([True, True, False, False]),
    np.nan,
    pd.Series([123, 456, "Number", "Percent"]),
)

# Result: array(['nan', 'nan', 'Number', 'Percent'], dtype='<U32')

The root cause in is the method used to downcast at the end. Here is the problem shown just with numpy...

np.array([np.nan, np.nan, "Number", "Percent"])

# Result: array(['nan', 'nan', 'Number', 'Percent'], dtype='<U32')
@machow
Copy link
Owner Author

machow commented Jan 26, 2020

I think the solution to this problem is to replace the very last line...

    return np.array(list(result))

with

    return pd.Series(result)

There are two reasons...

  1. It fixes the issue with converting np.nan to a string
  2. Since this is if_else dispatching on a pd.Series, it makes sense for it to return a Series

Alternatively, we could engage whatever mechanism pandas uses for converting the final array, since we don't really need all the Series machinery (e.g. an index). However, I suspect as pandas supports String arrays, etc.., through Series, it will become harder to avoid them.

@machow machow changed the title if_else shouldn't coerce to strings if_else shouldn't coerce nan to strings Jan 26, 2020
@machow
Copy link
Owner Author

machow commented Feb 5, 2020

Instead of fix 2 (returning a Series), I opted to just keep the np array dtype "O". The reasoning is..

  • inferring a type to convert to is one of pandas slowest operations
  • as a result, anything that infers a type won't scale when run N_group times

(this is why pandas goes to great lengths to wait to convert types, until the very end of a set of operations)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

1 participant