Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: add DataFrame.nunique() and DataFrameGroupBy.nunique() #14376

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
14 changes: 14 additions & 0 deletions asv_bench/benchmarks/frame_methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -433,6 +433,20 @@ def time_frame_from_records_generator_nrows(self):



#-----------------------------------------------------------------------------
# nunique

class frame_nunique(object):

def setup(self):
self.data = np.random.randn(10000, 1000)
self.df = DataFrame(self.data)

def time_frame_nunique(self):
self.df.nunique()



#-----------------------------------------------------------------------------
# duplicated

Expand Down
16 changes: 16 additions & 0 deletions asv_bench/benchmarks/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -251,6 +251,22 @@ def time_groupby_int_count(self):
self.df.groupby(['key1', 'key2']).count()


#----------------------------------------------------------------------
# nunique() speed

class groupby_nunique(object):

def setup(self):
self.n = 10000
self.df = DataFrame({'key1': randint(0, 500, size=self.n),
'key2': randint(0, 100, size=self.n),
'ints': randint(0, 1000, size=self.n),
'ints2': randint(0, 1000, size=self.n), })

def time_groupby_nunique(self):
self.df.groupby(['key1', 'key2']).nunique()


#----------------------------------------------------------------------
# group with different functions per column

Expand Down
3 changes: 3 additions & 0 deletions doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,9 @@ Other enhancements
^^^^^^^^^^^^^^^^^^
- ``Series.sort_index`` accepts parameters ``kind`` and ``na_position`` (:issue:`13589`, :issue:`14444`)

- ``DataFrame`` has gained a ``nunique()`` method as short-cut for ``.apply(lambda x: x.nunique())`` (counting the distinct values over an axis) (:issue:`14336`).
- New ``DataFrame.groupby().nunique()`` method as short-cut for ``.apply(lambda g: g.apply(lambda x: x.nunique()))`` (counting the distinct values for all columns within each group) (:issue:`14336`).

- ``pd.read_excel`` now preserves sheet order when using ``sheetname=None`` (:issue:`9930`)
- Multiple offset aliases with decimal points are now supported (e.g. '0.5min' is parsed as '30s') (:issue:`8419`)

Expand Down
32 changes: 32 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -4969,6 +4969,38 @@ def f(x):

return Series(result, index=labels)

def nunique(self, axis=0, dropna=True):
"""
Return Series with number of distinct observations over requested
axis.

.. versionadded:: 0.20.0

Parameters
----------
axis : {0 or 'index', 1 or 'columns'}, default 0
0 or 'index' for row-wise, 1 or 'columns' for column-wise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

row-wise / column-wise is probably used a lot in other methods here as well (can you check?), but, I find it in this case a bit confusing. As I would interpret 'column-wise' as "distinct observations for each column". And this is not correct, as that is the default of axis=0/'index'. So the axis=1 is more 'over/along the columns'
But English is not my mother tongue. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As is, the wording is consistent with all the other methods such as count(): Wouldn't it be better to have a dedicated PR for that, in case all the axis docstrings are to be improved?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xflr6 yes could have a PR for improving these kinds of things in general (we already use shared_docs for this type of thing anyhow), so these are pretty general. Here is not as .nunique has separate doc-strings for Series/DataFrame, which is why @jorisvandenbossche is asking.

ok with actually fixing that (so this would hook into our more general doc-strings system).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually Series.nunique is defined in pandas.core.base (so its the same for Index). But these could easily hook into the same doc-string system as I said above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it is not fully consistent throughout frame.py as well. There are also some methods that explain this differently (eg apply, mode. corrwith actually switches the row and column-wise ("0 or 'index' to compute column-wise, 1 or 'columns' for row-wise")).

The thing is also that the explanation can be different depending on what the function does I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xflr6 can you make the doc-string references to the axis consistent with other methods

e.g. example from another method.

        axis : {0 or 'index', 1 or 'columns'}, default 0
            Sort index/rows versus columns

I think you can simply drop the 2nd line of the axis parm

dropna : boolean, default True
Don't include NaN in the counts.

Returns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add Examples section

-------
nunique : Series

Examples
--------
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]})
>>> df.nunique()
A 3
B 1

>>> df.nunique(axis=1)
0 1
1 2
2 2
"""
return self.apply(Series.nunique, axis=axis, dropna=dropna)

def idxmin(self, axis=0, skipna=True):
"""
Return index of first occurrence of minimum over requested axis.
Expand Down
48 changes: 48 additions & 0 deletions pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -3899,6 +3899,54 @@ def count(self):

return self._wrap_agged_blocks(data.items, list(blk))

def nunique(self, dropna=True):
"""
Return DataFrame with number of distinct observations per group for
each column.

.. versionadded:: 0.20.0

Parameters
----------
dropna : boolean, default True
Don't include NaN in the counts.

Returns
-------
nunique: DataFrame

Examples
--------
>>> df = pd.DataFrame({'id': ['spam', 'egg', 'egg', 'spam',
... 'ham', 'ham'],
... 'value1': [1, 5, 5, 2, 5, 5],
... 'value2': list('abbaxy')})
>>> df
id value1 value2
0 spam 1 a
1 egg 5 b
2 egg 5 b
3 spam 2 a
4 ham 5 x
5 ham 5 y

>>> df.groupby('id').nunique()
id value1 value2
id
egg 1 1 1
ham 1 1 2
spam 1 2 1

# check for rows with the same id but conflicting values
>>> df.groupby('id').filter(lambda g: (g.nunique() > 1).any())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice example (although more of the filter method and dataframe.nunique), but can you add one sentence introducing it? (explaining what we are going to do in the next example)

id value1 value2
0 spam 1 a
3 spam 2 a
4 ham 5 x
5 ham 5 y
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a Returns and Examples sections

return self.apply(lambda g: g.apply(Series.nunique, dropna=dropna))


from pandas.tools.plotting import boxplot_frame_groupby # noqa
DataFrameGroupBy.boxplot = boxplot_frame_groupby
Expand Down
16 changes: 16 additions & 0 deletions pandas/tests/frame/test_analytics.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
MultiIndex, date_range, Timestamp)
import pandas as pd
import pandas.core.nanops as nanops
import pandas.core.algorithms as algorithms
import pandas.formats.printing as printing

import pandas.util.testing as tm
Expand Down Expand Up @@ -410,6 +411,21 @@ def test_count(self):
expected = Series(0, index=[])
tm.assert_series_equal(result, expected)

def test_nunique(self):
f = lambda s: len(algorithms.unique1d(s.dropna()))
self._check_stat_op('nunique', f, has_skipna=False,
check_dtype=False, check_dates=True)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests for the dropna and axis keywords as well?

df = DataFrame({'A': [1, 1, 1],
'B': [1, 2, 3],
'C': [1, np.nan, 3]})
tm.assert_series_equal(df.nunique(), Series({'A': 1, 'B': 3, 'C': 2}))
tm.assert_series_equal(df.nunique(dropna=False),
Series({'A': 1, 'B': 3, 'C': 3}))
tm.assert_series_equal(df.nunique(axis=1), Series({0: 1, 1: 2, 2: 2}))
tm.assert_series_equal(df.nunique(axis=1, dropna=False),
Series({0: 1, 1: 3, 2: 2}))

def test_sum(self):
self._check_stat_op('sum', np.sum, has_numeric_only=True)

Expand Down
38 changes: 33 additions & 5 deletions pandas/tests/groupby/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -2800,6 +2800,34 @@ def test_count_cross_type(self): # GH8169
result = df.groupby(['c', 'd']).count()
tm.assert_frame_equal(result, expected)

def test_nunique(self):
df = DataFrame({
'A': list('abbacc'),
'B': list('abxacc'),
'C': list('abbacx'),
})

expected = DataFrame({'A': [1] * 3, 'B': [1, 2, 1], 'C': [1, 1, 2]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test with both as_index=True and False

result = df.groupby('A', as_index=False).nunique()
tm.assert_frame_equal(result, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also can you test with dropna=True and False


# as_index
expected.index = list('abc')
expected.index.name = 'A'
result = df.groupby('A').nunique()
tm.assert_frame_equal(result, expected)

# with na
result = df.replace({'x': None}).groupby('A').nunique(dropna=False)
tm.assert_frame_equal(result, expected)

# dropna
expected = DataFrame({'A': [1] * 3, 'B': [1] * 3, 'C': [1] * 3},
index=list('abc'))
expected.index.name = 'A'
result = df.replace({'x': None}).groupby('A').nunique()
tm.assert_frame_equal(result, expected)

def test_non_cython_api(self):

# GH5610
Expand Down Expand Up @@ -5150,11 +5178,11 @@ def test_tab_completion(self):
'first', 'get_group', 'groups', 'hist', 'indices', 'last', 'max',
'mean', 'median', 'min', 'name', 'ngroups', 'nth', 'ohlc', 'plot',
'prod', 'size', 'std', 'sum', 'transform', 'var', 'sem', 'count',
'head', 'irow', 'describe', 'cummax', 'quantile', 'rank',
'cumprod', 'tail', 'resample', 'cummin', 'fillna', 'cumsum',
'cumcount', 'all', 'shift', 'skew', 'bfill', 'ffill', 'take',
'tshift', 'pct_change', 'any', 'mad', 'corr', 'corrwith', 'cov',
'dtypes', 'ndim', 'diff', 'idxmax', 'idxmin',
'nunique', 'head', 'irow', 'describe', 'cummax', 'quantile',
'rank', 'cumprod', 'tail', 'resample', 'cummin', 'fillna',
'cumsum', 'cumcount', 'all', 'shift', 'skew', 'bfill', 'ffill',
'take', 'tshift', 'pct_change', 'any', 'mad', 'corr', 'corrwith',
'cov', 'dtypes', 'ndim', 'diff', 'idxmax', 'idxmin',
'ffill', 'bfill', 'pad', 'backfill', 'rolling', 'expanding'])
self.assertEqual(results, expected)

Expand Down