Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate groupby() squeeze option #32380

Closed
dechamps opened this issue Feb 29, 2020 · 12 comments · Fixed by #33218
Closed

Deprecate groupby() squeeze option #32380

dechamps opened this issue Feb 29, 2020 · 12 comments · Fixed by #33218
Labels
Deprecate Functionality to remove in pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@dechamps
Copy link

Code Sample

import pandas as pd
print(pd.DataFrame([{
    'A': 1,
    'B': 1,
}, {
    'A': 2,
    'B': 2,
}, {
    'A': 2,
    'B': 3,
}]).groupby('A', squeeze=True).count())

Problem description

I expected .groupby(squeeze=True) to, well, squeeze, and count() to return a Series. Instead squeeze=True doesn't seem to do anything, and count() returns a DataFrame.

A workaround is to write .groupby('A').count().squeeze(), which does work.

Expected Output

A
1    1
2    2
Name: B, dtype: int64

Actual output

   B
A
1  1
2  2

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.6.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-4-amd64 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 1.0.1
numpy : 1.17.4
pytz : 2019.3
dateutil : 2.8.1
pip : 18.1
setuptools : 44.0.0
Cython : None
pytest : 4.6.9
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 4.6.9
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@MarcoGorelli MarcoGorelli added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Feb 29, 2020
@TomAugspurger
Copy link
Contributor

It doesn't appear to be documented, and I'm not familiar with it. @dechamps are you interested in walking through the code to see what it's intended for?

@jreback
Copy link
Contributor

jreback commented Mar 2, 2020

we should deprecate this option - i don’t think original usecases that i added are worth it

@dechamps
Copy link
Author

dechamps commented Mar 2, 2020

It doesn't appear to be documented

Well the groupby() squeeze parameter does have documentation, it seems.

Personally I don't care much about the parameter - it's simple enough to just call squeeze() later anyway. However it is of course confusing if the parameter is there but does nothing.

@WillAyd
Copy link
Member

WillAyd commented Mar 3, 2020

+1 to deprecate as well; seems out of place as an argument here

@WillAyd WillAyd added the Deprecate Functionality to remove in pandas label Mar 3, 2020
@WillAyd WillAyd added this to the Contributions Welcome milestone Mar 3, 2020
@WillAyd WillAyd changed the title The groupby() squeeze option doesn't seem to do anything Deprecate groupby() squeeze option Mar 3, 2020
@MarcoGorelli
Copy link
Member

However it is of course confusing if the parameter is there but does nothing.

Here's an example of where it does something

In [2]: from pandas import DataFrame                                                                                                                                                                   

In [3]:     df3 = DataFrame( 
   ...:         [ 
   ...:             {"val1": 1, "val2": 20, 'val3': 1}, 
   ...:             {"val1": 1, "val2": 20, 'val3': 2}, 
   ...:             {"val1": 1, "val2": 20, 'val3': 3}, 
   ...:             {"val1": 1, "val2": 20, 'val3': 4}, 
   ...:         ] 
   ...:     ) 
   ...:                                                                                                                                                                                                

In [4]: df3.set_index(['val1', 'val2']).groupby(['val1', 'val2'], squeeze=True).apply(sum)                                                                                                             
Out[4]: 
val3    10
dtype: int64

In [5]: df3.set_index(['val1', 'val2']).groupby(['val1', 'val2'], squeeze=False).apply(sum)                                                                                                            
Out[5]: 
           val3
val1 val2      
1    20      10

Anyway, @dechamps , are you interested in submitting a pull request to deprecate it? If so, see https://pandas.pydata.org/docs/development/contributing.html - else, I'd happily take it and see the groupby logic simplified :)

@dechamps
Copy link
Author

Honestly I'd be prefer you do it - I have zero familiarity with Pandas development workflows.

@jreback jreback modified the milestones: Contributions Welcome, 1.1 May 22, 2020
@mlyons-tcc
Copy link

mlyons-tcc commented Sep 29, 2020

Here's a rant: You deprecated squeeze in 1.1.0 in violation of your own deprecation policy introduced in 1.0.0.

edit: Perhaps it was intended to throw a DeprecationWarning instead of the FutureWarning that was used. FutureWarning indicates that it has already been deprecated and user is still using it.

@jreback
Copy link
Contributor

jreback commented Sep 30, 2020

we can and will deprecate things in almost every version

what the policy is not to remove those depreciated until a next major version eg 2.0

we don't use DeprecationWarning because it's not shown by default and IMHO just useless

FutureWarning is visible

you don't have to change you code and can continue to use it if you would like

@mlyons-tcc
Copy link

mlyons-tcc commented Sep 30, 2020

Great to hear that it is not going away until 2.0! I'm much appreciative of the deprecation policy you provided, and I took the FutureWarning to mean something else since I expected a DeprecationWarning.

Thanks for providing the rationale of FutureWarning.

Arguments Against Using FutureWarning for Deprecations
As I mentioned, I found the usage of FutureWarning to be ambiguous in terms of what the intentions were. It seemed to me that it could go away at any point in time and that a major release post deprecation had already happened. Or more scary that behavior was going to change since there is "existing use of FutureWarning to warn about constructs that will remain valid code in the future, but will have different semantics" (pep-0565). Otherwise, wouldn't I get a DeprecationWarning?

I think the big cause of confusion is that python changed its definition/recommendation of Deprecation and Future warnings in PEP-565, implemented in py3.7. Now instead of differentiating based on behavior, they are differentiating based on audience. "intended for other Python developers" as opposed to "intended for end users of applications that are written in Python". I think PEP565 has provided clear guidance that the type of warning that pandas should be providing for depreciations should in fact be a DeprecationWarning.

With regards to visibility, as of 3.7, DeprecationWarning are only visible if called from "__main__" by default. It's great for hiding the warnings from the application users. Recommendation is provided to use a test suite to make them visible. Warning visibility is also controllable in a number of ways so I definitely would not consider DeprecationWarning useless; it's actually quite proper.

Argument in Support of FutureWarning
Unfortunately, none of the REPLs are making DeprecationWarnings generated from modules visible as far as I can tell. IPython even went so far as to say they want to hide deprecation warnings from modules because some of their dependencies produce a bunch :eyeroll:. Many in the scientific computing landscape depend solely on Jupyter so it would be unfortunate if they never get visibility into these warnings as they code/execute in a notebook environment. Because of that, as much as I truly believe these should be DeprecationWarnings, I see the importance for them to be FutureWarning if it is of most importance to surface these warnings to that community of users that are not first and foremost software devs.

Final Thought
Using FutureWarning does cause a problem in that someone using pandas to create applications cannot easily ignore pandas deprecation warnings intended for developers without also ignoring warnings intended for the user of the application. If/when Jupyter decides to start surfacing DeprecationWarnings by default, I think it would be a good time to change the type of warning generated in pandas.

@jreback
Copy link
Contributor

jreback commented Oct 1, 2020

we are unlikely to change the warning type as visibility is most important

@AlexanderNenninger
Copy link

Hi,

I need exactly this behavior when applying functions to the GroupBy. Is there guidance on alternatives?

However it is of course confusing if the parameter is there but does nothing.

Here's an example of where it does something

In [2]: from pandas import DataFrame                                                                                                                                                                   

In [3]:     df3 = DataFrame( 
   ...:         [ 
   ...:             {"val1": 1, "val2": 20, 'val3': 1}, 
   ...:             {"val1": 1, "val2": 20, 'val3': 2}, 
   ...:             {"val1": 1, "val2": 20, 'val3': 3}, 
   ...:             {"val1": 1, "val2": 20, 'val3': 4}, 
   ...:         ] 
   ...:     ) 
   ...:                                                                                                                                                                                                

In [4]: df3.set_index(['val1', 'val2']).groupby(['val1', 'val2'], squeeze=True).apply(sum)                                                                                                             
Out[4]: 
val3    10
dtype: int64

In [5]: df3.set_index(['val1', 'val2']).groupby(['val1', 'val2'], squeeze=False).apply(sum)                                                                                                            
Out[5]: 
           val3
val1 val2      
1    20      10

Anyway, @dechamps , are you interested in submitting a pull request to deprecate it? If so, see https://pandas.pydata.org/docs/development/contributing.html - else, I'd happily take it and see the groupby logic simplified :)

freshtuo added a commit to freshtuo/GRCF_daily_work that referenced this issue Mar 27, 2023
…re. Besides, the current code is cause error when working with a newer pandas version 1.4.4. So using a work around as recommended in pandas-dev/pandas#32380
@brandonrwin
Copy link

brandonrwin commented Aug 1, 2023

Hi,

I need exactly this behavior when applying functions to the GroupBy. Is there guidance on alternatives?

I believe it's

df3.set_index(['val1', 'val2']).groupby(['val1', 'val2']).apply(sum).squeeze()

DataFrameGroupBy.apply() returns a DataFrame, and you use Dataframe.squeeze() on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants