DataFrame.nunique and Series.nunique not consistent when Empty #28202

ZaxR · 2019-08-28T19:09:12Z

Code Sample, a copy-pastable example if possible

Example A:

>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1, 2], "b": [3, 4], "c": [5, 6]})
>>> assert df.nunique().tolist() == [df[col].nunique() for col in df.columns]
# Both equal [2, 2, 2]

Example B:

>>> df = pd.DataFrame(columns=['a', 'b', 'c'])
>>> df.nunique()
# Empty DataFrame
# Columns: [a, b, c]
# Index: []

>>> [df[col].nunique() for col in df.columns]
# [0, 0, 0]

Problem description

In Example A, when a DataFrame isn't empty, getting nunique is consistent between the DataFrame and Series approaches; however, when a DataFrame is empty (Example B), DataFrame.nunique returns itself, while the Series approach returns 0.

Expected Output

I would expect df.nunique to return 0 for each column, consistent with how a Series behaves. An empty object, by definition, has 0 unique elements in my mind.

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit : None
python : 3.6.5.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.0
numpy : 1.17.0
pytz : 2019.2
dateutil : 2.8.0
pip : 18.1
setuptools : 41.1.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.7.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

Thank you!

The text was updated successfully, but these errors were encountered:

ZaxR · 2019-08-28T19:11:03Z

I also tried this in 0.25.1 and the behavior is the same.

TomAugspurger · 2019-08-28T21:00:49Z

Agreed, the expected output should be

In [56]: pd.Series(0, index=df.columns)
Out[56]:
a    0
b    0
c    0
dtype: int64

dsaxton · 2019-08-29T00:20:05Z

Looks like something funny happening with apply (https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L7966):

[ins] In [35]: df.apply(Series.nunique, axis=0, dropna=True)                                           
Out[35]: 
Empty DataFrame
Columns: [a, b, c]
Index: []

[ins] In [36]: df.apply(Series.nunique, axis=0, dropna=False)                                          
Out[36]: 
Empty DataFrame
Columns: [a, b, c]
Index: []

[ins] In [37]: df.apply(Series.nunique, axis=0)                                                        
Out[37]: 
a   NaN
b   NaN
c   NaN
dtype: float64

You get similar behavior if you try taking a sum for instance:

[ins] In [51]: df.apply(np.sum, axis=0, dropna=True)                                                   
Out[51]: 
Empty DataFrame
Columns: [a, b, c]
Index: []

[ins] In [52]: np.sum(df["a"])                                                                         
Out[52]: 0

ZaxR mentioned this issue Aug 28, 2019

Potential New Check: has_n_values ZaxR/bulwark#53

Open

TomAugspurger added this to the Contributions Welcome milestone Aug 28, 2019

TomAugspurger added the Bug label Aug 28, 2019

dsaxton mentioned this issue Aug 29, 2019

BUG: Fix issue with apply on empty DataFrame #28213

Merged

jreback modified the milestones: Contributions Welcome, 1.0 Sep 19, 2019

WillAyd closed this as completed in #28213 Sep 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.nunique and Series.nunique not consistent when Empty #28202

DataFrame.nunique and Series.nunique not consistent when Empty #28202

ZaxR commented Aug 28, 2019

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

ZaxR commented Aug 28, 2019

TomAugspurger commented Aug 28, 2019

dsaxton commented Aug 29, 2019

DataFrame.nunique and Series.nunique not consistent when Empty #28202

DataFrame.nunique and Series.nunique not consistent when Empty #28202

Comments

ZaxR commented Aug 28, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

ZaxR commented Aug 28, 2019

TomAugspurger commented Aug 28, 2019

dsaxton commented Aug 29, 2019

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS