Inconsistency, NaT included in result of groupby method first but not NaN #10590

Closed
larvian opened this Issue Jul 15, 2015 · 5 comments

Comments

Projects
None yet
3 participants
Contributor

larvian commented Jul 15, 2015

NaT is included in result of groupby method first while NaN. I am expecting that first should skip both NaN and NaT and include the first value where pandas.isnull is False.
Demonstration of the inconsistency. (note that both NaT and NaN in the data frame are produced by np.nan, the difference is that the d_t column contains date values).

import numpy as np
import pandas as pd
from datetime import datetime as dt

testFrame=DataFrame({'IX':['A','A'],'num':[np.nan,100],'d_t':[np.nan,dt.now()]})

Resulting data frame:

  IX                     d_t  num
0  A                     NaT  NaN
1  A 2015-07-15 22:47:10.635  100

Grouping this data frame on the IX column and executing the first method results in this data frame which shows the inconsistency between the d_t and num columns.

testFrame.groupby('IX').first()

Resulting dataframe:

        d_t  num
IX              
A       NaT  100
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.16.2
nose: 1.3.4
Cython: 0.22
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.0.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.6.7
lxml: 3.4.2
bs4: 4.3.2
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 0.9.9
pymysql: None
psycopg2: None
Contributor

jreback commented Jul 15, 2015

hmm, @sinhrks I thought this was fixed?

jreback added this to the 0.17.0 milestone Jul 15, 2015

Member

sinhrks commented Jul 15, 2015

I remember some issues when NaT is in group key, but not aware the aggregation issure. .min might be affected also.

Contributor

jreback commented Jul 15, 2015

prob needs some adjustment for comparison vs iNaT in the first/last (though i thought it was there)

Member

sinhrks commented Jul 16, 2015

Confirmed .min also affected. @larvian PR is appreciated:)

testFrame.groupby('IX').min()
#    d_t  num
# IX         
# A  NaT  100
Contributor

larvian commented Jul 18, 2015

@sinhrks :) I see. Well, I have started to research GitHub and the Pandas source code a bit but right now I unfortunately don't have very much time available and waiting for me could require patience.
If I come to the point when I feel confident that I can contribute to the solution then I will do a PR. If someone else does it I am OK with that too :)

@larvian larvian added a commit to larvian/pandas that referenced this issue Jul 26, 2015

@larvian larvian BUG: Fix issue with incorrect groupby handling of NaT #10590
For groupby the time stamps gets converted to integervalue tslib.iNaT
which is -9223372036854775808. The aggregation is then done using this
value with incorrect result as a consequence. The solution proposed here
is to replace its value by np.nan in case it is a datetime or timedelta.
f7f030e

@jreback jreback modified the milestone: Next Major Release, 0.17.0 Sep 1, 2015

jreback added the Prio-medium label Sep 1, 2015

@larvian larvian added a commit to larvian/pandas that referenced this issue Sep 2, 2015

@larvian larvian BUG: Fix issue with incorrect groupby handling of NaT #10590
For groupby the time stamps gets converted to integervalue tslib.iNaT
which is -9223372036854775808. The aggregation is then done using this
value with incorrect result as a consequence. The solution proposed here
is to replace its value by np.nan in case it is a datetime or timedelta.
9c2d1a6

@jreback jreback modified the milestone: 0.17.0, Next Major Release Sep 2, 2015

jreback closed this in #10625 Sep 3, 2015

@nickeubank nickeubank added a commit to nickeubank/pandas that referenced this issue Sep 29, 2015

@larvian @nickeubank larvian + nickeubank BUG: Fix issue with incorrect groupby handling of NaT #10590
For groupby the time stamps gets converted to integervalue tslib.iNaT
which is -9223372036854775808. The aggregation is then done using this
value with incorrect result as a consequence. The solution proposed here
is to replace its value by np.nan in case it is a datetime or timedelta.
0b0469a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment