Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different return type when using groupby with nunique #21090

Closed
RaulPL opened this issue May 16, 2018 · 2 comments · Fixed by #34012
Closed

Different return type when using groupby with nunique #21090

RaulPL opened this issue May 16, 2018 · 2 comments · Fixed by #34012
Labels
Milestone

Comments

@RaulPL
Copy link

RaulPL commented May 16, 2018

Code Sample

I have the following code

import pandas as pd
print(pd.__version__) # 0.22
df = pd.DataFrame(
    {'A': ['Jane', 'Jane', 'Charles', 'Charles'], 
     'B': ['red', 'blue', 'green', 'green']})

# here I would like to group by one of the columns (A in this case), and aggregate the other. 
# These two lines return a pandas DataFrame
df.groupby('A', as_index=False).agg({'B': pd.Series.count})  # pd.DataFrame
df.groupby('A', as_index=False).agg({'B': pd.Series.nunique})  # pd.DataFrame

# But when I do it in this way I don't know why I am getting a pandas Series in the last line
df.groupby('A', as_index=False).B.count()  # pd.DataFrame
df.groupby('A', as_index=False).B.nunique()  # pd.Series

Problem description

I am getting a pandas Series when trying to aggregate using "col.nunique()" notation with as_index set to False. Also, the pandas Series that is returned drops the values of the grouped column.

Expected Output

I think that the last line of code should return a pandas DataFrame in order to be consistent.

I am happy to help with this issue if its possible, I am not an expert but I would like to contribute.

Thanks a lot, this is an awesome library =).

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-41-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.2
scipy: 1.1.0
pyarrow: 0.9.0
xarray: 0.10.3
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: 0.1.4
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented May 16, 2018

Interesting indeed. I think the pattern is internally Cython implemented aggregation functions (i.e. sum, count, min, max, etc...) work, but functions that go down the apply route do not. To illustrate further:

In [38]: df.groupby('A', as_index=False).B.min()  # Cythonized min func
Out[38]: 
         A      B
0  Charles  green
1     Jane   blue

In [39]: df.groupby('A', as_index=False).B.apply(min) # apply route
Out[39]: 
0    green
1     blue
dtype: object

Admittedly might be tough for a first contribution but if you want to give it a shot the details of this implementation will be in pandas.core.groupby.groupby.py

@RaulPL
Copy link
Author

RaulPL commented May 17, 2018

What do you mean with the apply route? I will start reading about it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants