Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: comparisons fail for NaT in DataFrame #15697

Closed
Tracked by #18824
adbull opened this issue Mar 16, 2017 · 6 comments · Fixed by #22163
Closed
Tracked by #18824

BUG: comparisons fail for NaT in DataFrame #15697

adbull opened this issue Mar 16, 2017 · 6 comments · Fixed by #22163
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Timedelta Timedelta data type Timeseries
Milestone

Comments

@adbull
Copy link
Contributor

adbull commented Mar 16, 2017

Code Sample, a copy-pastable example if possible

>>> import pandas as pd
>>> nat = pd.NaT
>>> x = pd.Series([nat])
>>> x.eq(nat)
0    False
dtype: bool
>>> x == nat
0    False
dtype: bool
>>> y = pd.DataFrame(dict(x=x))
>>> y.eq(nat)
       x
0    NaT
>>> y == nat
       x
0    True

Problem description

Comparisons in a dataframe containing a single nat give incorrect answers. Note this occurs with both datetime and timedelta nats.

Expected Output

0    False
dtype: bool

0    False
dtype: bool

       x
0  False

       x
0  False

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.8-100.fc24.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: C
LOCALE: None.None

pandas: 0.19.0+579.g4ce9c0c
pytest: 3.0.5
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
xarray: 0.9.1
IPython: 4.2.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: 0.999
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
s3fs: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Mar 16, 2017

This is as expected, NaT != NaT, just as nan != nan

see the big red box: http://pandas-docs.github.io/pandas-docs-travis/missing_data.html#values-considered-missing

In [1]: nat = pd.NaT
   ...: x = pd.Series([nat])
   ...: 

In [2]: x.eq(nat)
Out[2]: 
0    False
dtype: bool

In [3]: x.isnull()
Out[3]: 
0    True
dtype: bool

In [4]: x = pd.Series([np.nan])
   ...: 
   ...: 

In [5]: x.eq(np.nan)
Out[5]: 
0    False
dtype: bool

In [6]: x.isnull()
Out[6]: 
0    True
dtype: bool

@jreback jreback closed this as completed Mar 16, 2017
@jreback jreback added this to the No action milestone Mar 16, 2017
@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Usage Question labels Mar 16, 2017
@jreback
Copy link
Contributor

jreback commented Mar 16, 2017

actually you are right, this broken for dataframe for NaT, but works for np.nan.

so I'll mark it, though you should never do this. Maybe we should just raise.

In [15]: y = DataFrame(dict(x=[np.nan]))

In [16]: y.eq(np.nan)
Out[16]: 
       x
0  False

In [17]: y == np.nan
Out[17]: 
       x
0  False

In [18]: y = DataFrame(dict(x=[pd.NaT]))

In [19]: y.eq(pd.NaT)
Out[19]: 
    x
0 NaT

In [20]: y == pd.NaT
Out[20]: 
      x
0  True

@jreback jreback reopened this Mar 16, 2017
@jreback jreback modified the milestones: Next Major Release, No action Mar 16, 2017
@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves Timedelta Timedelta data type Timeseries and removed Usage Question labels Mar 16, 2017
@adbull
Copy link
Contributor Author

adbull commented Mar 16, 2017

Agreed; as above, the correct result of NaT == NaT is False. The bug is that for a DataFrame, we instead get NaT or True, depending on how we test for equality.

Is there a reason why this call should never be made? At worst, we could just call apply() and use the Series methods, which work fine.

@jreback
Copy link
Contributor

jreback commented Mar 16, 2017

Is there a reason why this call should never be made? At worst, we could just call apply() and use the Series methods, which work fine.

you don't want to compare against a null value, its not intuitive to do this as nan != nan is just plain confusing to most people.

The more explicit

df[df.isnull()] is much more obvious.

So its not that you shouldn't do it if you know what you are doing, its just non-obvious from reading. Further it can provide lots of opportunities for odd bugs, consider.

for x in ['foo', np.nan]:
     df.eq(x)

this will give totally unexpected results.

@adbull
Copy link
Contributor Author

adbull commented Mar 16, 2017

Sure, that specific call would be better as .isnull(), but nats break comparisons in dataframes more generally.

>>> nat = pd.NaT
>>> now = pd.to_datetime('now')
>>> nat < now
False
>>> pd.DataFrame([[nat]]) < now
      0
0  True

@jreback
Copy link
Contributor

jreback commented Mar 16, 2017

here as the issue for Series: #9005

this is not very common on frames, you generally cannot compare frames, unless they are of a single dtype. You would usually select out a portion and then compare.

So its a bug. pull-requests are welcomed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Timedelta Timedelta data type Timeseries
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants