New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: None is not equal to None #20442

Open
mapio opened this Issue Mar 21, 2018 · 17 comments

Comments

Projects
None yet
6 participants
@mapio
Copy link

mapio commented Mar 21, 2018

If you have None in a series, it will not be considered equal to None (even in the same series).

A = pd.Series([None]*3)
A == A

gives

0    False
1    False
2    False
dtype: bool

but, of course

A.values == A.values

gives

array([ True,  True,  True])

I can't find any reference to this behaviour in the documentation and looks quite unnatural.

Output of pd.show_versions()

Here the output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.5.2
Cython: None
numpy: 1.14.2
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.0
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jschendel

This comment has been minimized.

Copy link
Member

jschendel commented Mar 21, 2018

Also a bit strange that == is inconsistent with the eq method in this case:

In [2]: pd.__version__
Out[2]: '0.23.0.dev0+658.g17c1fad'

In [3]: s = pd.Series([None]*3)

In [4]: s
Out[4]:
0    None
1    None
2    None
dtype: object

In [5]: s == s
Out[5]:
0    False
1    False
2    False
dtype: bool

In [6]: s.eq(s)
Out[6]:
0    True
1    True
2    True
dtype: bool
@mroeschke

This comment has been minimized.

Copy link
Member

mroeschke commented Mar 21, 2018

I believe this occurs because this comparison hits this block:

pandas/pandas/_libs/ops.pyx

Lines 162 to 163 in 17c1fad

if checknull(x) or checknull(y):
result[i] = False

And the checknull function resolves down this path which considers None as null:

cdef inline bint _checknull(object val):
try:
return val is None or (cpython.PyFloat_Check(val) and val != val)

@jreback

This comment has been minimized.

Copy link
Contributor

jreback commented Mar 22, 2018

see the warnings box: http://pandas.pydata.org/pandas-docs/stable/missing_data.html

This was done quite a while ago to make the behavior of nulls consistent, in that they don't compare equal. This puts None and np.nan on an equal (though not-consistent with python, BUT consistent with numpy) footing.

So this is not a bug, rather a consequence of stradling 2 conventions.

I suppose the documentation could be slightly enhanced.

@jreback

This comment has been minimized.

Copy link
Contributor

jreback commented Mar 22, 2018

however this should be consistent with this:In [7]: s = pd.Series([np.nan]*3)

In [8]: s == s
Out[8]: 
0    False
1    False
2    False
dtype: bool

In [9]: s.eq(s)
Out[9]: 
0    False
1    False
2    False
dtype: bool

so [6] form @jschendel should match [9] here. The incorrect case is actually .eq

@jreback

This comment has been minimized.

Copy link
Contributor

jreback commented Mar 22, 2018

here's the dichotomy in scalars

In [11]: None == None
Out[11]: True

In [12]: np.nan == np.nan
Out[12]: False

@jreback jreback added this to the Next Major Release milestone Mar 22, 2018

@jreback jreback changed the title None is not equal to None BUG: None is not equal to None Mar 22, 2018

@mapio

This comment has been minimized.

Copy link

mapio commented Mar 22, 2018

I do not agree with @jreback and the current status of equality for None in pandas.

I've seen the warning box and I completely agree that, according to IEEE floating-point standard, "a NaN is never equal to any other number (including another NaN)" (cited from What Every Computer Scientist Should Know About Floating-Point Arithmetic).

I've no problems, if a data type of a series is numeric, to accept that NaNs are not equal and even that None get converted to NaNs.

But it a series contains objects (let them be string, list, or whatever), the equality should be consistent with Python equality for such objects.

It's a basic application of the principle of least astonishment. I don't see any benefit in mangling None if the data type is non numeric.

@jreback

This comment has been minimized.

Copy link
Contributor

jreback commented Mar 22, 2018

@mapio you are missing the point. None is NOT the default missing indicator for anything. np.nan IS. None is accepted for compatibility.

@mapio

This comment has been minimized.

Copy link

mapio commented Mar 22, 2018

Where in the documentation is stated that None is the "missing indicator"? And, if this is the case, I find incredibly confusing that

>>> pd.Series([1,None,2])
0    1.0
1    NaN
2    2.0
dtype: float64

but

>>> pd.Series(['a', None, 'b'])
0       a
1    None
2       b
dtype: object

If you pick na.nan as a missing data indicator, than you must convert every None in a series to it, not only in numeric series, but also in series of objects. To be clear, I expect the second example to look

>>> pd.Series(['a', None, 'b'])
0       a
1    NaN
2       b
dtype: object

Given such a convention I would have probably still asked why you decide to translate a None to a NaN, but I would have not been surprised at all to have a False comparing the second entry of the two series.

Probably I'm missing the point, but the current situation is, at best, inconsistent (and as such, is very confusing).

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

TomAugspurger commented Mar 22, 2018

http://pandas-docs.github.io/pandas-docs-travis/missing_data.html#values-considered-missing

@mapio how familiar are you with NumPy / pandas type system? When you do Series([1, None, 2]), we infer it as floats. If you want python objects, specify that.

In [5]: s = pd.Series([1, None, 2], dtype=object)

In [6]: s
Out[6]:
0       1
1    None
2       2
dtype: object

But as you've seen, many places in pandas, especially in ops like ==, treat None as missing.

@mapio

This comment has been minimized.

Copy link

mapio commented Mar 22, 2018

I'm quite familiar with NumPy. My point is the exact opposite of what you say.

Since when you build a series from object in Pandas it will correctly infer the object data type, None are left as such. But when you compare them (and only then), they are silently translated to NaNs and (due to IEEE way of handling equality among NaNs) they result to be different.

It makes complete sense in numeric series, where None are clearly converted to NaNs.

My point is:

  • it's ok to convert to NaNs if you want to have NaN the only way to represent missing values,
  • it's ok to compare NaNs as different (of course),
  • you must perform such conversion visibly every time you set a series element to None!

If you keep None as element of object series and convert them silently to NaNs before comparing them (with == or eq) you are surprising your users (and also your developers, given that this is reported as a bug with regard to  eq).

No implicit conversion to NaN should happen. If you store objects, you should compare them as Python does.

@jreback

This comment has been minimized.

Copy link
Contributor

jreback commented Mar 23, 2018

this is a long ago argument. np.nan is the missing value indicator for ALL dtypes, including object. This is not going to change. Yes we allow None to be explicity passed for object dtypes and be preserved (but only in this case), as object can essentially hold anything.

So comparisons must follow the convention that np.nan behavior is what we do.'
leaving this issue open for the sole purpose of fixing [9] above.

@mapio

This comment has been minimized.

Copy link

mapio commented Mar 23, 2018

If you choose to convert missing values to NaNs for ALL (allcaps your) dtypes, a decision I can understand and second), I still miss why this does not hold for object dtype!

Please, consider converting None to NaN also for object series, this will make missing values handling consistent and equality consistent. I can't see any drawback…

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

TomAugspurger commented Mar 23, 2018

I think this will be hard to change as pandas is currently structured. We allow both None and NaN as the missing value indicator in object columns.

In [14]: a = pd.Series(['a', None, float('nan')])

In [15]: a
Out[15]:
0       a
1    None
2     NaN
dtype: object

In [16]: a == a
Out[16]:
0     True
1    False
2    False
dtype: bool

I can't see any drawback…

IIRC, people found it useful to not immediately convert None to NaN there. They wanted to keep their python scalars. I don't recall details though.

If we wanted to change this, which I'm not sure we do, I don't see a deprecation path to changing None to compare equal to itself, though perhaps you do.

How is this affecting you negatively?

@mapio

This comment has been minimized.

Copy link

mapio commented Mar 23, 2018

I think that people finds useful to not immediately convert to NaN and keep their Python types because they want the objects they put in the series to behave like Python objects. This benefit is lost if you change the way equality works for some of the Python types (like None)!

Consider this piece of code:

In [1]: import pandas as pd
In [2]: A = pd.Series(['a', None, 2])
In [3]: B = pd.Series(['a', None, 2])

In [4]: all(A == B)
Out[4]: False
In [5]: all(lambda a, b: a == b for a, b in zip(A, B))
Out[5]: True

In [5]: all(A.values == B.values)
Out[5]: True
In [6]: all(lambda a, b: a == b for a, b in zip(A.values, B.values))
Out[6]: True

I expect that In [4] should tell me if the two series are equal member by member, vectorizing in some sense the equality. Currently they turn out not to be! But if I do test equality member by member as in In[5] instead I get (as expected) that the series are equal member by member .

Now, if I turn the series to numpy.arrays as in In[5] I got the expected result. Here NumPy is consistent with the vectorization in In[6].

As you can clearly see there is an inconsistency that is very hard to understand (you have to know, and there is no place in the docs where this is clearly stated) that:

  • pandas coverts all missing values except for None to NaN,
  • pandas (according to IEEE) treats NaN as different from themselves.

Having an inconsistent behaviour and leaving None being as such but comparing them as different violates the principle of least astonishment.

I spent quite some time (with two dataframes of tens of thousands of elements of different dtypes) to get why there was a difference when, looking at the corresponding rows, all the values where equal!

I don't think anyone can really think that an inconsistent behaviour is a good thing…

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

TomAugspurger commented Mar 23, 2018

@mapio

This comment has been minimized.

Copy link

mapio commented Mar 23, 2018

I am just a Pandas user, not a developer. What I can suggest is to look for the code that converts None to NaN in numeric series constructors and copy it to object series constructors. And document it in the session of the documentation where the handling of missing values.

Observe that this is not likely to cause any particular harm. If one puts a None in a series it's not likely to invoke methods on it, so that changing it to np.nan will not make much of a difference. Especially since Series.isnull is (as stated in the documentation) an alias of Series.isna, so that code that was looking for None will still find it, whatever method it will call to test for it.

If some developer offers to fix the code, I volunteer to fix the docs 😄

@walkerh

This comment has been minimized.

Copy link

walkerh commented Apr 9, 2018

There is no way to address this without a major point release, since any change here will break backwards compatibility.

Nevertheless, this issue does imply the clear need for documentation improvement: Nones selectively behave as nans.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment