Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: add DataFrame.nunique() and DataFrameGroupBy.nunique() #14376

Closed
wants to merge 4 commits into from
Closed

API: add DataFrame.nunique() and DataFrameGroupBy.nunique() #14376

wants to merge 4 commits into from

Conversation

xflr6
Copy link
Contributor

@xflr6 xflr6 commented Oct 7, 2016

@codecov-io
Copy link

codecov-io commented Oct 7, 2016

Current coverage is 84.77% (diff: 75.00%)

Merging #14376 into master will decrease coverage by <.01%

@@             master     #14376   diff @@
==========================================
  Files           145        145          
  Lines         51090      51133    +43   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43315      43347    +32   
- Misses         7775       7786    +11   
  Partials          0          0          

Powered by Codecov. Last update 3e3434b...a0558e7

@jreback
Copy link
Contributor

jreback commented Oct 7, 2016

these are not going to be very efficient. Pls add some benchmarks.

@xflr6
Copy link
Contributor Author

xflr6 commented Oct 31, 2016

I could not get the benchmarks to run with asv under win32, so now there is something basically following what is already there for count().

There is no 'fast-path' in the implementation of nunique() for describe(include='all') either, right?

@jreback
Copy link
Contributor

jreback commented Dec 16, 2016

can you rebase and post some asv numbers

@xflr6
Copy link
Contributor Author

xflr6 commented Dec 18, 2016

OK, rebased. None of the proposed methods to run asv on this seem to work on my win machine, e.g.:

(C:\Users\User\Miniconda2) C:\Users\User\Desktop\pandas\asv_bench>asv run 4c13e.
.9ab44 -b _nunique
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py2.7-Cython-libpython-matplotlib-numexpr-numpy-openp
yxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Installing into conda-py2.7-Cython-libpython-matplotlib-numexpr-numpy-openpyx
l-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
· Running 3 total benchmarks (1 commits * 1 environments * 3 benchmarks)
[  0.00%] · For pandas commit hash 9ab44eec:
[  0.00%] ·· Building for conda-py2.7-Cython-libpython-matplotlib-numexpr-numpy-
openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  0.00%] ··· Error running C:\Users\User\Desktop\pandas\asv_bench\env\eebfa40b3
935b6d282712e13c756274b\python.exe setup.py build
              STDOUT -------->
[...]
              build\temp.win-amd64-2.7\Release\pandas\src\datetime\np_datetime_s
trings.o:np_datetime_strings.c:(.text+0x12): undefined reference to `localtime_r
'
              collect2.exe: error: ld returned 1 exit status
              error: command 'C:\\Users\\User\\Miniconda2\\Scripts\\gcc.bat' fai
led with exit status 1

[  0.00%] ·· Skipping conda-py2.7-Cython-libpython-matplotlib-numexpr-numpy-open
pyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 33.33%] ··· Benchmark groupby.groupby_nunique.time_groupby_nunique skipped
[ 66.67%] ··· Benchmark groupby.GroupBySuite.time_nunique skipped
[100.00%] ··· Benchmark frame_methods.frame_nunique.time_frame_nunique skipped

----------
dropna : boolean, default True
Don't include NaN in the counts.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a Returns and Examples sections

dropna : boolean, default True
Don't include NaN in the counts.

Returns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add Examples section

'B': list('abxacc'),
'C': list('abbacx'),
})
expected = DataFrame({'A': [1] * 3, 'B': [1, 2, 1], 'C': [1, 1, 2]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test with both as_index=True and False

})
expected = DataFrame({'A': [1] * 3, 'B': [1, 2, 1], 'C': [1, 1, 2]})
result = df.groupby('A', as_index=False).nunique()
tm.assert_frame_equal(result, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also can you test with dropna=True and False

@jreback
Copy link
Contributor

jreback commented Dec 20, 2016

@xflr6 ok lgtm. just some added tests / docs. ping on green.

don't worry about the benchmarks.

@xflr6
Copy link
Contributor Author

xflr6 commented Dec 28, 2016

Thanks, extended the docs and tests, CI passed.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xflr6 Nice enhancement! I added some more comments.


Examples
--------
>>> df = DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrame -> pd.DataFrame

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed (note that there are still some other example sections with bare DataFrame).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, but we try to have the standards for new code a bit higher :-)


Examples
--------
>>> df = DataFrame({'id': ['spam', 'egg', 'egg', 'spam', 'ham', 'ham'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrame -> pd.DataFrame

ham 1 1 2
spam 1 2 1

>>> df.groupby('id').filter(lambda g: (g.nunique() > 1).any())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice example (although more of the filter method and dataframe.nunique), but can you add one sentence introducing it? (explaining what we are going to do in the next example)

"""
from functools import partial
func = partial(Series.nunique, dropna=dropna)
return self.apply(lambda g: g.apply(func))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the use of functools.partial actually needed here? As extra keyword like dropna passed to apply will normally be passed through to the func

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about kwargs-passing in apply(), simplified both nunique-methods with that.

f = lambda s: len(algorithms.unique1d(s.dropna()))
self._check_stat_op('nunique', f, has_skipna=False,
check_dtype=False, check_dates=True)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests for the dropna and axis keywords as well?

Parameters
----------
axis : {0 or 'index', 1 or 'columns'}, default 0
0 or 'index' for row-wise, 1 or 'columns' for column-wise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

row-wise / column-wise is probably used a lot in other methods here as well (can you check?), but, I find it in this case a bit confusing. As I would interpret 'column-wise' as "distinct observations for each column". And this is not correct, as that is the default of axis=0/'index'. So the axis=1 is more 'over/along the columns'
But English is not my mother tongue. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As is, the wording is consistent with all the other methods such as count(): Wouldn't it be better to have a dedicated PR for that, in case all the axis docstrings are to be improved?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xflr6 yes could have a PR for improving these kinds of things in general (we already use shared_docs for this type of thing anyhow), so these are pretty general. Here is not as .nunique has separate doc-strings for Series/DataFrame, which is why @jorisvandenbossche is asking.

ok with actually fixing that (so this would hook into our more general doc-strings system).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually Series.nunique is defined in pandas.core.base (so its the same for Index). But these could easily hook into the same doc-string system as I said above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it is not fully consistent throughout frame.py as well. There are also some methods that explain this differently (eg apply, mode. corrwith actually switches the row and column-wise ("0 or 'index' to compute column-wise, 1 or 'columns' for row-wise")).

The thing is also that the explanation can be different depending on what the function does I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xflr6 can you make the doc-string references to the axis consistent with other methods

e.g. example from another method.

        axis : {0 or 'index', 1 or 'columns'}, default 0
            Sort index/rows versus columns

I think you can simply drop the 2nd line of the axis parm

@jorisvandenbossche
Copy link
Member

@xflr6 As @jreback said, it is no problem that you do not get the benchmarks working for this PR. But, just to try to solve it: there seems to go something wrong in building the environment. Can you yourself build pandas master in a conda env?

@jreback jreback added this to the 0.20.0 milestone Jan 2, 2017
@jreback
Copy link
Contributor

jreback commented Jan 2, 2017

lgtm. @jorisvandenbossche if you have any final comments.

@jreback jreback closed this in a1b6587 Jan 23, 2017
@jreback
Copy link
Contributor

jreback commented Jan 23, 2017

thanks @xflr6

@jorisvandenbossche
Copy link
Member

One remaining remark is that I think row-wise / column-wise is a wrong explanation in this case.

@jreback
Copy link
Contributor

jreback commented Jan 23, 2017

@jorisvandenbossche I took that out on the merge, FYI

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
closes pandas-dev#14336

Author: Sebastian Bank <sebastian.bank@uni-leipzig.de>

Closes pandas-dev#14376 from xflr6/nunique and squashes the following commits:

a0558e7 [Sebastian Bank] use apply()-kwargs instead of partial, more tests, better examples
c8d3ac4 [Sebastian Bank] extend docs and tests
fd0f22d [Sebastian Bank] add simple benchmarks
5c4b325 [Sebastian Bank] API: add DataFrame.nunique() and DataFrameGroupBy.nunique()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: add DataFrame.nunique() and DataFrameGroupBy.nunique()
4 participants