Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: comparisons fail for NaT in DataFrame #15697

Closed
adbull opened this issue Mar 16, 2017 · 6 comments

Comments

Projects
None yet
2 participants
@adbull
Copy link
Contributor

commented Mar 16, 2017

Code Sample, a copy-pastable example if possible

>>> import pandas as pd
>>> nat = pd.NaT
>>> x = pd.Series([nat])
>>> x.eq(nat)
0    False
dtype: bool
>>> x == nat
0    False
dtype: bool
>>> y = pd.DataFrame(dict(x=x))
>>> y.eq(nat)
       x
0    NaT
>>> y == nat
       x
0    True

Problem description

Comparisons in a dataframe containing a single nat give incorrect answers. Note this occurs with both datetime and timedelta nats.

Expected Output

0    False
dtype: bool

0    False
dtype: bool

       x
0  False

       x
0  False

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.8-100.fc24.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: C
LOCALE: None.None

pandas: 0.19.0+579.g4ce9c0c
pytest: 3.0.5
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
xarray: 0.9.1
IPython: 4.2.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: 0.999
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
s3fs: None
pandas_gbq: None
pandas_datareader: None

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 16, 2017

This is as expected, NaT != NaT, just as nan != nan

see the big red box: http://pandas-docs.github.io/pandas-docs-travis/missing_data.html#values-considered-missing

In [1]: nat = pd.NaT
   ...: x = pd.Series([nat])
   ...: 

In [2]: x.eq(nat)
Out[2]: 
0    False
dtype: bool

In [3]: x.isnull()
Out[3]: 
0    True
dtype: bool

In [4]: x = pd.Series([np.nan])
   ...: 
   ...: 

In [5]: x.eq(np.nan)
Out[5]: 
0    False
dtype: bool

In [6]: x.isnull()
Out[6]: 
0    True
dtype: bool

@jreback jreback closed this Mar 16, 2017

@jreback jreback added this to the No action milestone Mar 16, 2017

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 16, 2017

actually you are right, this broken for dataframe for NaT, but works for np.nan.

so I'll mark it, though you should never do this. Maybe we should just raise.

In [15]: y = DataFrame(dict(x=[np.nan]))

In [16]: y.eq(np.nan)
Out[16]: 
       x
0  False

In [17]: y == np.nan
Out[17]: 
       x
0  False

In [18]: y = DataFrame(dict(x=[pd.NaT]))

In [19]: y.eq(pd.NaT)
Out[19]: 
    x
0 NaT

In [20]: y == pd.NaT
Out[20]: 
      x
0  True

@jreback jreback reopened this Mar 16, 2017

@jreback jreback modified the milestones: Next Major Release, No action Mar 16, 2017

@adbull

This comment has been minimized.

Copy link
Contributor Author

commented Mar 16, 2017

Agreed; as above, the correct result of NaT == NaT is False. The bug is that for a DataFrame, we instead get NaT or True, depending on how we test for equality.

Is there a reason why this call should never be made? At worst, we could just call apply() and use the Series methods, which work fine.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 16, 2017

Is there a reason why this call should never be made? At worst, we could just call apply() and use the Series methods, which work fine.

you don't want to compare against a null value, its not intuitive to do this as nan != nan is just plain confusing to most people.

The more explicit

df[df.isnull()] is much more obvious.

So its not that you shouldn't do it if you know what you are doing, its just non-obvious from reading. Further it can provide lots of opportunities for odd bugs, consider.

for x in ['foo', np.nan]:
     df.eq(x)

this will give totally unexpected results.

@adbull

This comment has been minimized.

Copy link
Contributor Author

commented Mar 16, 2017

Sure, that specific call would be better as .isnull(), but nats break comparisons in dataframes more generally.

>>> nat = pd.NaT
>>> now = pd.to_datetime('now')
>>> nat < now
False
>>> pd.DataFrame([[nat]]) < now
      0
0  True
@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 16, 2017

here as the issue for Series: #9005

this is not very common on frames, you generally cannot compare frames, unless they are of a single dtype. You would usually select out a portion and then compare.

So its a bug. pull-requests are welcomed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.