API: add DataFrame.nunique() and DataFrameGroupBy.nunique() #14336

Closed
xflr6 opened this Issue Oct 3, 2016 · 5 comments

Comments

Projects
None yet
4 participants
Contributor

xflr6 commented Oct 3, 2016

When exploring a data set, I often need to df.apply(pd.Series.nunique) or df.apply(lambda x: x.nunique()). How about adding this as nunique()-method parallel to DataFrame.count() (count and unique are also the two most basic infos displayed by DataFrame.describe())?

I think there are also use cases for this as a groupby-method, for example when checking a candidate primary key for different lines (values):

>>> import pandas as pd
>>> df = pd.DataFrame({'id': ['spam', 'eggs', 'eggs', 'spam'], 'value': [1, 5, 5, 2]})
>>> df.groupby('id').filter(lambda g: (g.apply(pd.Series.nunique) > 1).any())
     id  value
0  spam      1
3  spam      2
Member

shoyer commented Oct 3, 2016

Agreed, I think this would be welcome functionality.

Contributor

jreback commented Oct 3, 2016

Note that these are already defined for Series.

In [9]: 
   ...: df.groupby('id').value.nunique()
Out[9]: 
id
eggs    1
spam    2
Name: value, dtype: int64

In [10]: 
    ...: df.groupby('id').value.unique()
Out[10]: 
id
eggs       [5]
spam    [1, 2]
Name: value, dtype: object

jreback added this to the Next Major Release milestone Oct 3, 2016

Contributor

xflr6 commented Oct 4, 2016

Of course, extending the groupby-example:

>>> df = pd.DataFrame({'id': ['spam', 'eggs', 'eggs', 'spam', 'ham', 'ham'],
                       'value1': [1, 5, 5, 2, 5, 5], 'value2': list('abbaxy')})
>>> df
     id  value1 value2
0  spam       1      a
1  eggs       5      b
2  eggs       5      b
3  spam       2      a
4   ham       5      x
5   ham       5      y
>>> df.groupby('id').filter(lambda g: (g.apply(pd.Series.nunique) > 1).any())
     id  value1 value2
0  spam       1      a
3  spam       2      a
4   ham       5      x
5   ham       5      y

@jreback jreback modified the milestone: 0.20.0, Next Major Release Jan 2, 2017

Any news?

jreback closed this in a1b6587 Jan 23, 2017

Contributor

jreback commented Jan 23, 2017

just merged.

@AnkurDedania AnkurDedania added a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017

@xflr6 @AnkurDedania xflr6 + AnkurDedania API: add DataFrame.nunique() and DataFrameGroupBy.nunique()
closes #14336

Author: Sebastian Bank <sebastian.bank@uni-leipzig.de>

Closes #14376 from xflr6/nunique and squashes the following commits:

a0558e7 [Sebastian Bank] use apply()-kwargs instead of partial, more tests, better examples
c8d3ac4 [Sebastian Bank] extend docs and tests
fd0f22d [Sebastian Bank] add simple benchmarks
5c4b325 [Sebastian Bank] API: add DataFrame.nunique() and DataFrameGroupBy.nunique()
51e32d0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment