Resampler.nunique counting data more than once #13453

Closed
jcrist opened this Issue Jun 15, 2016 · 9 comments

Comments

Projects
None yet
6 participants
Contributor

jcrist commented Jun 15, 2016 edited by jreback

xref addtl example in #13795

Pandas Resampler.nunique appears to be putting the same data in multiple bins:

import pandas as pd

# Create a series with a datetime index
index = pd.date_range('1-1-2000', '2-15-2000', freq='h')
index2 = pd.date_range('4-15-2000', '5-15-2000', freq='h')
index3 = index.append(index2)
s = pd.Series(range(len(index3)), index=index3)

# Since all elements are unique, `count` and `nunique` should give the same result
count = s.resample('M').count()
nunique = s.resample('M').nunique()

In pandas 0.18.1 and 0.18.0 these don't give the same results, when they should

In [3]: count
Out[3]:
2000-01-31    744
2000-02-29    337
2000-03-31      0
2000-04-30    384
2000-05-31    337
Freq: M, dtype: int64

In [4]: nunique
Out[4]:
2000-01-31    337
2000-02-29    744
2000-03-31      0
2000-04-30    744
2000-05-31    337
Freq: M, dtype: int64

In pandas 0.17.0 and 0.17.1 (adjusting to old style resample syntax), the nunique one fails due to a "ValueError: Wrong number of items passed 4, placement implies 5" somewhere in the depths of internals.py. If I go back to 0.16.2, I do get the same result for each.

I'm not sure what's going on here. Since the nunique results sum to larger than the length, it appears data is being counted more than once.


In [19]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 15.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 23.0.0
Cython: None
numpy: 1.10.4
scipy: None
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None
Contributor

jcrist commented Jun 15, 2016

May be related to pydata#10914.

Contributor

jcrist commented Jun 15, 2016

Interestingly everything seems to work fine if agg is used with pd.Series.nunique instead:

In [11]: r = s.resample('M')

In [12]: r.agg(pd.Series.nunique)
Out[12]:
2000-01-31    744
2000-02-29    337
2000-03-31      0
2000-04-30    384
2000-05-31    337
Freq: M, dtype: int64

In [13]: r.nunique()    # same result as r.agg('nunique')
Out[13]:
2000-01-31    337
2000-02-29    744
2000-03-31      0
2000-04-30    744
2000-05-31    337
Freq: M, dtype: int64
Member

sinhrks commented Jun 15, 2016

jreback added this to the Next Major Release milestone Jul 26, 2016

I think the root cause of the problem is in groupby.nunique(), which I believe is eventually is called by resample.nunique(). Note that groupby.nunique() has the same bug:

import pandas as pd
from pandas import Timestamp

data = ['1', '2', '3']
time = time = [Timestamp('2016-06-28 09:35:35'), Timestamp('2016-06-28 16:09:30'), Timestamp('2016-06-28 16:46:28')]
test = pd.DataFrame({'time':time, 'data':data})

#wrong counts
print test.set_index('time').groupby(pd.TimeGrouper(freq='h'))['data'].nunique(), "\n"
#correct counts
print test.set_index('time').groupby(pd.TimeGrouper(freq='h'))['data'].apply(pd.Series.nunique)

This gives

time  
2016-06-28 09:00:00    1  
2016-06-28 10:00:00    0  
2016-06-28 11:00:00    0  
2016-06-28 12:00:00    0  
2016-06-28 13:00:00    0  
2016-06-28 14:00:00    0  
2016-06-28 15:00:00    0  
2016-06-28 16:00:00    1  
Freq: H, Name: data, dtype: int64   

time  
2016-06-28 09:00:00    1  
2016-06-28 10:00:00    0  
2016-06-28 11:00:00    0  
2016-06-28 12:00:00    0  
2016-06-28 13:00:00    0  
2016-06-28 14:00:00    0  
2016-06-28 15:00:00    0  
2016-06-28 16:00:00    2  
Freq: H, Name: data, dtype: int64  

I believe the problem is in the second to last line of groupby.nunique(),
i.e. line 2955 in groupby.py

res[ids] = out

I suspect ids should not be used for the indexing--it has different dimensions than out.

pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.0
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: 0.2.1
Contributor

jreback commented Oct 9, 2016

@mgalbright why don't you submit a pull-request with your test examples (and those from the issue), and the proposed fix. See if that breaks anything else. Would be greatly appreciated!

Contributor

aiguofer commented Nov 14, 2016 edited

Hey, is there any advancement on this? I just realized that a report that I've been building is giving the wrong results and I believe it's due to this. I can't share all the code but here's a comparison of groupby.unique and groupby.nunique:

In [216]: ents.groupby(pd.Grouper(freq='1W-SAT', key='startdate'))['ent_id'].unique().tail(1)
Out[216]: 

startdate
2016-11-12    [550A00000033DHUIA2]
Freq: W-SAT, Name: ent_id, dtype: object

In [217]: ents.groupby(pd.Grouper(freq='1W-SAT', key='startdate'))['ent_id'].nunique().tail(1)
Out[217]: 

startdate
2016-11-12    7
Freq: W-SAT, Name: ent_id, dtype: int64

In [218]: ents.groupby(pd.Grouper(freq='1W-SAT', key='startdate'))['ent_id'].count().tail(1)
Out[221]: 

startdate
2016-11-12    1
Freq: W-SAT, Name: ent_id, dtype: int64
Contributor

jreback commented Nov 14, 2016

@aiguofer pull-requests are welcome to fix.

hantusk commented Feb 6, 2017 edited

Not really adding anything, but I just ran into this issue for a work report as well (pandas version 0.19.2). Passing to .agg(pd.Series.nunique) works great - thanks for the tip

@aiguofer aiguofer added a commit to aiguofer/pandas that referenced this issue Feb 15, 2017

@aiguofer aiguofer Ensure the right values are set in SeriesGroupBy.nunique
We only need to use the group boundaries as the index for `res` so that
the dimensions match those of `out`. Fixes #13453
0daab80
Contributor

aiguofer commented Feb 15, 2017

Took a look at @mgalbright coment and suggestion and if I'm understanding the code correctly, the above PR should fix it. I ran nosetests pandas/tests/groupby and only had one unrelated test fail (test_series_groupby_value_counts() takes exactly 2 arguments (0 given)).

@aiguofer aiguofer added a commit to aiguofer/pandas that referenced this issue Feb 16, 2017

@aiguofer aiguofer Add test for #13453 in test_resample and add note to whatsnew c53bd70

jreback closed this in 5a8883b Feb 16, 2017

@jreback jreback modified the milestone: 0.20.0, Next Major Release Feb 16, 2017

@AnkurDedania AnkurDedania added a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017

@aiguofer @AnkurDedania aiguofer + AnkurDedania BUG: Ensure the right values are set in SeriesGroupBy.nunique
closes #13453

Author: Diego Fernandez <difernan@redhat.com>

Closes #15418 from aiguofer/gh_13453 and squashes the following commits:

c53bd70 [Diego Fernandez] Add test for #13453 in test_resample and add note to whatsnew
0daab80 [Diego Fernandez] Ensure the right values are set in SeriesGroupBy.nunique
bd9c1d2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment