Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Resampler.nunique counting data more than once #13453
Comments
|
May be related to pydata#10914. |
|
Interestingly everything seems to work fine if
|
sinhrks
added Resample Bug
labels
Jun 15, 2016
|
CC: @behzadnouri |
jreback
referenced
this issue
Jul 26, 2016
Closed
BUG: resample nunique calculation incorrect #13795
jreback
added Difficulty Intermediate Effort Low
labels
Jul 26, 2016
jreback
added this to the
Next Major Release
milestone
Jul 26, 2016
mgalbright
commented
Oct 8, 2016
|
I think the root cause of the problem is in groupby.nunique(), which I believe is eventually is called by resample.nunique(). Note that groupby.nunique() has the same bug: import pandas as pd
from pandas import Timestamp
data = ['1', '2', '3']
time = time = [Timestamp('2016-06-28 09:35:35'), Timestamp('2016-06-28 16:09:30'), Timestamp('2016-06-28 16:46:28')]
test = pd.DataFrame({'time':time, 'data':data})
#wrong counts
print test.set_index('time').groupby(pd.TimeGrouper(freq='h'))['data'].nunique(), "\n"
#correct counts
print test.set_index('time').groupby(pd.TimeGrouper(freq='h'))['data'].apply(pd.Series.nunique)This gives
I believe the problem is in the second to last line of groupby.nunique(), res[ids] = outI suspect pd.show_versions()
|
|
@mgalbright why don't you submit a pull-request with your test examples (and those from the issue), and the proposed fix. See if that breaks anything else. Would be greatly appreciated! |
|
Hey, is there any advancement on this? I just realized that a report that I've been building is giving the wrong results and I believe it's due to this. I can't share all the code but here's a comparison of In [216]: ents.groupby(pd.Grouper(freq='1W-SAT', key='startdate'))['ent_id'].unique().tail(1)
Out[216]:
startdate
2016-11-12 [550A00000033DHUIA2]
Freq: W-SAT, Name: ent_id, dtype: object
In [217]: ents.groupby(pd.Grouper(freq='1W-SAT', key='startdate'))['ent_id'].nunique().tail(1)
Out[217]:
startdate
2016-11-12 7
Freq: W-SAT, Name: ent_id, dtype: int64
In [218]: ents.groupby(pd.Grouper(freq='1W-SAT', key='startdate'))['ent_id'].count().tail(1)
Out[221]:
startdate
2016-11-12 1
Freq: W-SAT, Name: ent_id, dtype: int64 |
|
@aiguofer pull-requests are welcome to fix. |
hantusk
commented
Feb 6, 2017
•
|
Not really adding anything, but I just ran into this issue for a work report as well (pandas version 0.19.2). Passing to .agg(pd.Series.nunique) works great - thanks for the tip |
aiguofer
added a commit
to aiguofer/pandas
that referenced
this issue
Feb 15, 2017
|
|
aiguofer |
0daab80
|
aiguofer
referenced
this issue
Feb 15, 2017
Closed
Ensure the right values are set in SeriesGroupBy.nunique #15418
|
Took a look at @mgalbright coment and suggestion and if I'm understanding the code correctly, the above PR should fix it. I ran |
aiguofer
added a commit
to aiguofer/pandas
that referenced
this issue
Feb 16, 2017
|
|
aiguofer |
c53bd70
|
jreback
closed this
in 5a8883b
Feb 16, 2017
jreback
modified the milestone: 0.20.0, Next Major Release
Feb 16, 2017
AnkurDedania
added a commit
to AnkurDedania/pandas
that referenced
this issue
Mar 21, 2017
|
|
aiguofer + AnkurDedania |
bd9c1d2
|
jcrist commentedJun 15, 2016
•
edited by jreback
xref addtl example in #13795
Pandas
Resampler.nuniqueappears to be putting the same data in multiple bins:In pandas 0.18.1 and 0.18.0 these don't give the same results, when they should
In pandas 0.17.0 and 0.17.1 (adjusting to old style resample syntax), the
nuniqueone fails due to a "ValueError: Wrong number of items passed 4, placement implies 5" somewhere in the depths ofinternals.py. If I go back to 0.16.2, I do get the same result for each.I'm not sure what's going on here. Since the
nuniqueresults sum to larger than the length, it appears data is being counted more than once.