-
-
Notifications
You must be signed in to change notification settings - Fork 17.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: add DataFrame.nunique() and DataFrameGroupBy.nunique() #14376
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4969,6 +4969,38 @@ def f(x): | |
|
||
return Series(result, index=labels) | ||
|
||
def nunique(self, axis=0, dropna=True): | ||
""" | ||
Return Series with number of distinct observations over requested | ||
axis. | ||
|
||
.. versionadded:: 0.20.0 | ||
|
||
Parameters | ||
---------- | ||
axis : {0 or 'index', 1 or 'columns'}, default 0 | ||
0 or 'index' for row-wise, 1 or 'columns' for column-wise | ||
dropna : boolean, default True | ||
Don't include NaN in the counts. | ||
|
||
Returns | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add Examples section |
||
------- | ||
nunique : Series | ||
|
||
Examples | ||
-------- | ||
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]}) | ||
>>> df.nunique() | ||
A 3 | ||
B 1 | ||
|
||
>>> df.nunique(axis=1) | ||
0 1 | ||
1 2 | ||
2 2 | ||
""" | ||
return self.apply(Series.nunique, axis=axis, dropna=dropna) | ||
|
||
def idxmin(self, axis=0, skipna=True): | ||
""" | ||
Return index of first occurrence of minimum over requested axis. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3899,6 +3899,54 @@ def count(self): | |
|
||
return self._wrap_agged_blocks(data.items, list(blk)) | ||
|
||
def nunique(self, dropna=True): | ||
""" | ||
Return DataFrame with number of distinct observations per group for | ||
each column. | ||
|
||
.. versionadded:: 0.20.0 | ||
|
||
Parameters | ||
---------- | ||
dropna : boolean, default True | ||
Don't include NaN in the counts. | ||
|
||
Returns | ||
------- | ||
nunique: DataFrame | ||
|
||
Examples | ||
-------- | ||
>>> df = pd.DataFrame({'id': ['spam', 'egg', 'egg', 'spam', | ||
... 'ham', 'ham'], | ||
... 'value1': [1, 5, 5, 2, 5, 5], | ||
... 'value2': list('abbaxy')}) | ||
>>> df | ||
id value1 value2 | ||
0 spam 1 a | ||
1 egg 5 b | ||
2 egg 5 b | ||
3 spam 2 a | ||
4 ham 5 x | ||
5 ham 5 y | ||
|
||
>>> df.groupby('id').nunique() | ||
id value1 value2 | ||
id | ||
egg 1 1 1 | ||
ham 1 1 2 | ||
spam 1 2 1 | ||
|
||
# check for rows with the same id but conflicting values | ||
>>> df.groupby('id').filter(lambda g: (g.nunique() > 1).any()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a nice example (although more of the filter method and dataframe.nunique), but can you add one sentence introducing it? (explaining what we are going to do in the next example) |
||
id value1 value2 | ||
0 spam 1 a | ||
3 spam 2 a | ||
4 ham 5 x | ||
5 ham 5 y | ||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add a Returns and Examples sections |
||
return self.apply(lambda g: g.apply(Series.nunique, dropna=dropna)) | ||
|
||
|
||
from pandas.tools.plotting import boxplot_frame_groupby # noqa | ||
DataFrameGroupBy.boxplot = boxplot_frame_groupby | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,6 +16,7 @@ | |
MultiIndex, date_range, Timestamp) | ||
import pandas as pd | ||
import pandas.core.nanops as nanops | ||
import pandas.core.algorithms as algorithms | ||
import pandas.formats.printing as printing | ||
|
||
import pandas.util.testing as tm | ||
|
@@ -410,6 +411,21 @@ def test_count(self): | |
expected = Series(0, index=[]) | ||
tm.assert_series_equal(result, expected) | ||
|
||
def test_nunique(self): | ||
f = lambda s: len(algorithms.unique1d(s.dropna())) | ||
self._check_stat_op('nunique', f, has_skipna=False, | ||
check_dtype=False, check_dates=True) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add tests for the dropna and axis keywords as well? |
||
df = DataFrame({'A': [1, 1, 1], | ||
'B': [1, 2, 3], | ||
'C': [1, np.nan, 3]}) | ||
tm.assert_series_equal(df.nunique(), Series({'A': 1, 'B': 3, 'C': 2})) | ||
tm.assert_series_equal(df.nunique(dropna=False), | ||
Series({'A': 1, 'B': 3, 'C': 3})) | ||
tm.assert_series_equal(df.nunique(axis=1), Series({0: 1, 1: 2, 2: 2})) | ||
tm.assert_series_equal(df.nunique(axis=1, dropna=False), | ||
Series({0: 1, 1: 3, 2: 2})) | ||
|
||
def test_sum(self): | ||
self._check_stat_op('sum', np.sum, has_numeric_only=True) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2800,6 +2800,34 @@ def test_count_cross_type(self): # GH8169 | |
result = df.groupby(['c', 'd']).count() | ||
tm.assert_frame_equal(result, expected) | ||
|
||
def test_nunique(self): | ||
df = DataFrame({ | ||
'A': list('abbacc'), | ||
'B': list('abxacc'), | ||
'C': list('abbacx'), | ||
}) | ||
|
||
expected = DataFrame({'A': [1] * 3, 'B': [1, 2, 1], 'C': [1, 1, 2]}) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. test with both as_index=True and False |
||
result = df.groupby('A', as_index=False).nunique() | ||
tm.assert_frame_equal(result, expected) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also can you test with dropna=True and False |
||
|
||
# as_index | ||
expected.index = list('abc') | ||
expected.index.name = 'A' | ||
result = df.groupby('A').nunique() | ||
tm.assert_frame_equal(result, expected) | ||
|
||
# with na | ||
result = df.replace({'x': None}).groupby('A').nunique(dropna=False) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
# dropna | ||
expected = DataFrame({'A': [1] * 3, 'B': [1] * 3, 'C': [1] * 3}, | ||
index=list('abc')) | ||
expected.index.name = 'A' | ||
result = df.replace({'x': None}).groupby('A').nunique() | ||
tm.assert_frame_equal(result, expected) | ||
|
||
def test_non_cython_api(self): | ||
|
||
# GH5610 | ||
|
@@ -5150,11 +5178,11 @@ def test_tab_completion(self): | |
'first', 'get_group', 'groups', 'hist', 'indices', 'last', 'max', | ||
'mean', 'median', 'min', 'name', 'ngroups', 'nth', 'ohlc', 'plot', | ||
'prod', 'size', 'std', 'sum', 'transform', 'var', 'sem', 'count', | ||
'head', 'irow', 'describe', 'cummax', 'quantile', 'rank', | ||
'cumprod', 'tail', 'resample', 'cummin', 'fillna', 'cumsum', | ||
'cumcount', 'all', 'shift', 'skew', 'bfill', 'ffill', 'take', | ||
'tshift', 'pct_change', 'any', 'mad', 'corr', 'corrwith', 'cov', | ||
'dtypes', 'ndim', 'diff', 'idxmax', 'idxmin', | ||
'nunique', 'head', 'irow', 'describe', 'cummax', 'quantile', | ||
'rank', 'cumprod', 'tail', 'resample', 'cummin', 'fillna', | ||
'cumsum', 'cumcount', 'all', 'shift', 'skew', 'bfill', 'ffill', | ||
'take', 'tshift', 'pct_change', 'any', 'mad', 'corr', 'corrwith', | ||
'cov', 'dtypes', 'ndim', 'diff', 'idxmax', 'idxmin', | ||
'ffill', 'bfill', 'pad', 'backfill', 'rolling', 'expanding']) | ||
self.assertEqual(results, expected) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
row-wise / column-wise is probably used a lot in other methods here as well (can you check?), but, I find it in this case a bit confusing. As I would interpret 'column-wise' as "distinct observations for each column". And this is not correct, as that is the default of axis=0/'index'. So the axis=1 is more 'over/along the columns'
But English is not my mother tongue. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As is, the wording is consistent with all the other methods such as
count()
: Wouldn't it be better to have a dedicated PR for that, in case all theaxis
docstrings are to be improved?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xflr6 yes could have a PR for improving these kinds of things in general (we already use shared_docs for this type of thing anyhow), so these are pretty general. Here is not as
.nunique
has separate doc-strings for Series/DataFrame, which is why @jorisvandenbossche is asking.ok with actually fixing that (so this would hook into our more general doc-strings system).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually
Series.nunique
is defined inpandas.core.base
(so its the same forIndex
). But these could easily hook into the same doc-string system as I said above.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that it is not fully consistent throughout frame.py as well. There are also some methods that explain this differently (eg
apply
,mode
.corrwith
actually switches the row and column-wise ("0 or 'index' to compute column-wise, 1 or 'columns' for row-wise")).The thing is also that the explanation can be different depending on what the function does I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xflr6 can you make the doc-string references to the axis consistent with other methods
e.g. example from another method.
I think you can simply drop the 2nd line of the axis parm