New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug? Replacing NaN values based on a condition. #8669

Closed
ozhogin opened this Issue Oct 29, 2014 · 4 comments

Comments

Projects
None yet
3 participants
@ozhogin

ozhogin commented Oct 29, 2014

Dataframe with 2 columns: A and B. If values in B are larger than values in A - replace those values with values of A. I used to do this by doing df.B[df.B > df.A] = df.A, however recent upgrade of pandas started giving a SettingWithCopyWarning when encountering this chained assignment. Official documentation recommends using .loc.

Okay, I said, and did it through df.loc[df.B > df.A, 'B'] = df.A and it all works fine, unless column B has all values of NaN. Then something weird happens:

In [1]: df = pd.DataFrame({'A': [1, 2, 3],'B': [np.NaN, np.NaN, np.NaN]})

In [2]: df
Out[2]: 
   A   B
0  1 NaN
1  2 NaN
2  3 NaN

In [3]: df.loc[df.B > df.A, 'B'] = df.A

In [4]: df
Out[4]: 
   A                    B
0  1 -9223372036854775808
1  2 -9223372036854775808
2  3 -9223372036854775808

Now, if even one of B's elements satisfies the condition (larger than A), then it all works fine:

In [1]: df = pd.DataFrame({'A': [1, 2, 3],'B': [np.NaN, 4, np.NaN]})

In [2]: df
Out[2]: 
   A   B
0  1 NaN
1  2   4
2  3 NaN

In [3]: df.loc[df.B > df.A, 'B'] = df.A

In [4]: df
Out[4]: 
   A   B
0  1 NaN
1  2   2
2  3 NaN

But if none of B's elements satisfy, then all NaNs get replaces with -9223372036854775808:

In [1]: df = pd.DataFrame({'A':[1,2,3],'B':[np.NaN,1,np.NaN]})

In [2]: df
Out[2]: 
   A   B
0  1 NaN
1  2   1
2  3 NaN

In [3]: df.loc[df.B > df.A, 'B'] = df.A

In [4]: df
Out[4]: 
   A                    B
0  1 -9223372036854775808
1  2                    1
2  3 -9223372036854775808

Am I doing something wrong, or this is a bug?

pandas: 0.15.0

@onesandzeroes

This comment has been minimized.

Contributor

onesandzeroes commented Oct 29, 2014

I think there's some weird type conversion going on, I can't figure out why B converts to int64 here:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, 3],'B': [np.NaN, np.NaN, np.NaN]})
df
Out[6]: 
   A   B
0  1 NaN
1  2 NaN
2  3 NaN

df.dtypes
Out[10]: 
A      int64
B    float64
dtype: object

# This loc shouldn't match any rows
df.loc[df.B > df.A, 'B'] = df.A
df
Out[8]: 
   A                    B
0  1 -9223372036854775808
1  2 -9223372036854775808
2  3 -9223372036854775808

# Why has this become int? Is this expected
# behaviour for this assignment?
df.dtypes
Out[12]: 
A    int64
B    int64
dtype: object
@jreback

This comment has been minimized.

Contributor

jreback commented Oct 29, 2014

so the reason this shows up is that

(Pdb) p np.array([np.nan]).astype(np.int64)
array([-9223372036854775808])

is happening, eg. coercing a float64 to a int64. It shouldn't be coercing because the indexer is empty (e.g. df.B > df.A is all False. So a buggy.

@jreback

This comment has been minimized.

Contributor

jreback commented Oct 29, 2014

fixed in #8671

don't ask me to explain that their are prob 10+ cases of what to do with a value on the rhs of an assignment when you need to coerce dtypes / infer - I dont fully understand some of the cases. good thing we have a comprehensive test suite.

Not really sure it CAN be any simpler, as pandas allows like anything to be set :)

@ozhogin

This comment has been minimized.

ozhogin commented Oct 29, 2014

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment