BUG: Getting FutureWarning for Groupby.mean when using .pivot_table #49247

Sierra-MC · 2022-10-22T05:00:50Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df = pd.DataFrame({"C1": ["a", "b", "c"],
                   "C2": [1, 2, 3]})
table = pd.pivot_table(df, columns=['C2'])

Issue Description

Getting FutureWarning:

":1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function."

Expected Behavior

pivot_table is internally using DataFrameGroupBy.mean, but does not allow a user to pass a numeric_only argument as suggested in the FutureWarning

Installed Versions

INSTALLED VERSIONS

commit : 91111fd
python : 3.9.13.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0-19041-Microsoft
Version : #1237-Microsoft Sat Sep 11 14:32:00 PST 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.1
numpy : 1.23.4
pytz : 2022.5
dateutil : 2.8.2
setuptools : 65.5.0
pip : 22.3
Cython : None
pytest : 7.1.3
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.5.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli :
fastparquet : None
fsspec : 2022.10.0
gcsfs : None
matplotlib : 3.6.1
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.2
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

The text was updated successfully, but these errors were encountered:

phofl · 2022-10-22T11:46:30Z

cc @rhshadrach we should probably specify numeric_only, correct?

rhshadrach · 2022-10-22T20:46:41Z

What's the expected output here? Currently on 1.5.x I get an empty DataFrame. Within the __internal_pivot_table function, the groupby is using the entire frame (i.e. C1 is included) and grouping on C2. Taking the mean of a string column fails, hence the warning.

If this is the desired behavior for pivot table - that is, this call should group on C2 and use all other columns to aggregate, then I think everything here is correct. Namely, this is going to fail in 2.0 instead of silently dropping columns, hence the warning. But the groupby call itself (in particular, grouping on C2) looks odd to me, though I'm not very familiar with pivot tables.

phofl · 2022-10-23T17:35:36Z

Not sure, but I think we should raise a more specific warning, if this is not an unintended side effect.

Edit: To keep compatibility, we should probably specify numeric_only so that behaviour does not change in 2.0. is this possible?

rhshadrach · 2022-10-23T18:21:40Z

Not sure, but I think we should raise a more specific warning, if this is not an unintended side effect.

Yea, agreed. I can take this up.

To keep compatibility, we should probably specify numeric_only so that behaviour does not change in 2.0. is this possible?

Assuming the intention of this op is "groupby by C2 and aggregate all other columns", I don't we should be specifying numeric_only=True internally (this is how I interpret your suggestion above - but let me know if I got it wrong). You run into issues with dtypes that are non-numeric but can still be aggregated. In particular, first or last with any dtype, but also sum with strings and and timedeltas. Specifying numeric_only=True, such columns would disappear on the user even though they can be aggregated, and I think that would be unexpected. Also, if my description is the operation is correct, then I think pandas should fail when it is asked to do something that it is not able to do.

One thing we could do here is add numeric_only to pivot_table (defaulting to False for consistency with the rest of the API). Not averse to this, but I'm not able to really reason about it without understanding more about pivot tables.

Sierra-MC · 2022-10-24T15:07:13Z

The example I provided was the most simple I could come up with to create the warning: a dataframe with a numeric and non-numeric column. It is not the best example to show how pivot_tables are supposed to work/what they are for. Because I'm grouping by the numeric column and the non-numeric column can't be grouped (can't take the mean of a non-numeric column) it should come back as an empty dataframe in this case. (Not sure if this is desired behavior without a warning to the user of why they've received an empty dataframe back, but that's another discussion). The issue is that the FutureWarning coming up can't be avoided by the user based on the suggested action (adding a numeric_only keyword). And the warning appears to be for a function the user is not entering (it's used within the pivot_table function).

The more realistic version of this case (and what is coming up in my code) would be when you have a larger dataframe with many numeric and non-numeric columns. In this case, when creating your pivot table, it is expected that pandas will ignore all non-numeric columns and output the pivot table with the numeric values only. Some examples of this are shown here: https://datagy.io/python-pivot-tables/ (the first example under "Creating a Pivot Table in Pandas" shows the columns with non-numeric data are simply not in the pivot table).

Hopefully, this helps clear up any confusion. Apologies for the unclear example.

rhshadrach · 2022-10-25T21:45:42Z

Hopefully, this helps clear up any confusion. Apologies for the unclear example.

Thanks! For the record, I don't think your example is unclear.

The more realistic version of this case (and what is coming up in my code) would be when you have a larger dataframe with many numeric and non-numeric columns. In this case, when creating your pivot table, it is expected that pandas will ignore all non-numeric columns and output the pivot table with the numeric values only.

pandas is able to support an arbitrary aggregation with pivot tables. If the aggregation was first or last, would you still expect it to drop any non-numeric column? What about a time delta with sum?

Sierra-MC added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 22, 2022

phofl added this to the 1.5.2 milestone Oct 22, 2022

phofl added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Warnings Warnings that appear or should be added to pandas and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 22, 2022

rhshadrach self-assigned this Oct 23, 2022

rhshadrach mentioned this issue Nov 10, 2022

REGR: Better warning in pivot_table when dropping nuisance columns #49615

Merged

5 tasks

mroeschke closed this as completed in #49615 Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Getting FutureWarning for Groupby.mean when using .pivot_table #49247

BUG: Getting FutureWarning for Groupby.mean when using .pivot_table #49247

Sierra-MC commented Oct 22, 2022

INSTALLED VERSIONS

phofl commented Oct 22, 2022 •

edited

Loading

rhshadrach commented Oct 22, 2022 •

edited

Loading

phofl commented Oct 23, 2022 •

edited

Loading

rhshadrach commented Oct 23, 2022 •

edited

Loading

Sierra-MC commented Oct 24, 2022 •

edited

Loading

rhshadrach commented Oct 25, 2022

BUG: Getting FutureWarning for Groupby.mean when using .pivot_table #49247

BUG: Getting FutureWarning for Groupby.mean when using .pivot_table #49247

Comments

Sierra-MC commented Oct 22, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

phofl commented Oct 22, 2022 • edited Loading

rhshadrach commented Oct 22, 2022 • edited Loading

phofl commented Oct 23, 2022 • edited Loading

rhshadrach commented Oct 23, 2022 • edited Loading

Sierra-MC commented Oct 24, 2022 • edited Loading

rhshadrach commented Oct 25, 2022

phofl commented Oct 22, 2022 •

edited

Loading

rhshadrach commented Oct 22, 2022 •

edited

Loading

phofl commented Oct 23, 2022 •

edited

Loading

rhshadrach commented Oct 23, 2022 •

edited

Loading

Sierra-MC commented Oct 24, 2022 •

edited

Loading