Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe.mean with numeric_only=False results in error with strings #26927

Closed
smangham opened this issue Jun 18, 2019 · 12 comments
Closed

Dataframe.mean with numeric_only=False results in error with strings #26927

smangham opened this issue Jun 18, 2019 · 12 comments
Labels
Closing Candidate May be closeable, needs more eyeballs Docs Dtype Conversions Unexpected or buggy dtype conversions Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply Reduction Operations sum, mean, min, max, etc.

Comments

@smangham
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.DataFrame({
    'A': [0, 1, 2], 'B': ['a', 'b', 'c'], 'C': [4, 5, 6]
})
df.mean(axis=0, numeric_only=False, skipna=False)

Problem description

Instead of outputting a NaN for a non-numeric column of strings when trying to take the mean, it instead throws TypeError: could not convert string 'abc' to float.

Expected Output

I would expect this to output a series with values [1, NaN, 5].

My work-around is currently df.apply(pd.to_numeric, args=['coerce']).mean(axis=0, skipna=False), which outputs the expected result.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.7.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-51-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.1.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.16.0
scipy: 1.2.1
pyarrow: 0.12.0
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.2.16
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@TomAugspurger
Copy link
Contributor

I'm not sure I agree with NaN being the expected output for the mean of string columns.

@smangham
Copy link
Author

Fair enough, though it seems to me within the context of numeric_only=False it would be a reasonable response, as the average of a non-numeric column is not (usually) a number.

I was expecting it to mostly be useful for making sure the output series length matched the column index length, taking non-numeric columns into account.

Is it the case that numeric_only is instead really intended for custom types where you've defined addition and float conversion yourself?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 18, 2019 via email

@smangham
Copy link
Author

Would elaborating on the documentation to make the behaviour of numeric_only=False more explicit (i.e. non-numeric dtypes that do not support both the + and / operators will result in an exception) be a reasonable response?

@jreback
Copy link
Contributor

jreback commented Jun 25, 2019

yes and/or doc-string additions

@sergeny
Copy link

sergeny commented Aug 9, 2019

We have found a much bigger problem:

pd.Series(['1', '2', '3', '4']).mean() returns 308.5 instead of 2.5 (pandas 0.23.4).
This is because it divides 1234 by 4.

@TomAugspurger
Copy link
Contributor

@sergeny that sounds like a different issue.

@TomAugspurger TomAugspurger added Docs Dtype Conversions Unexpected or buggy dtype conversions labels Aug 13, 2019
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Aug 13, 2019
@jbrockmendel jbrockmendel added Reduction Operations sum, mean, min, max, etc. Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply labels Sep 21, 2020
@jsignell
Copy link
Contributor

It seems to me that a good behavior would be to just include a bit more information on the error. Maybe something like:

TypeError: Could not convert ['abc'] to numeric. Select only valid columns before calling the reduction or drop nuisance columns with 'numeric_only=True'.

@jreback jreback modified the milestones: Contributions Welcome, 1.4 Nov 13, 2021
@jreback jreback modified the milestones: 1.4, Contributions Welcome Dec 23, 2021
@ES208
Copy link

ES208 commented Aug 14, 2022

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.DataFrame({
    'A': [0, 1, 2], 'B': ['a', 'b', 'c'], 'C': [4, 5, 6]
})
df.mean(axis=0, numeric_only=False, skipna=False)

Problem description

Instead of outputting a NaN for a non-numeric column of strings when trying to take the mean, it instead throws TypeError: could not convert string 'abc' to float.

Expected Output

I would expect this to output a series with values [1, NaN, 5].

My work-around is currently df.apply(pd.to_numeric, args=['coerce']).mean(axis=0, skipna=False), which outputs the expected result.

Output of pd.show_versions()

change the axis from 0 to 1 then you will get some outputs.

however, I computed the below code:
a = [1,"Name",np.nan]
b = [3,75,0]
c = [6,80,90]
df = pd.DataFrame({'A': a, 'B': b, 'C': c})
df["row_mean"] = df.mean(axis=1, numeric_only= True)

output:

A | B | C | row_mean
1 | 3 | 6 | 4.5 --> expected: 3.3333 (column A ignored)
Name | 75 | 80 | 77.5 --> correct
NaN | 0 | 90 | 45.0 --> correct

in the above case, the computation ignored column A completely.
so be careful when using (numeric_only= True) in your data frame.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@miraculixx
Copy link

miraculixx commented Oct 25, 2023

this is now the default behavior since #49915. I'm not sure it should be since mean() fails on non-numeric values.

@rhshadrach
Copy link
Member

@miraculixx - I'm not sure what you mean by "this is now the default behavior". Can you clarify?

@rhshadrach rhshadrach added the Closing Candidate May be closeable, needs more eyeballs label Oct 26, 2023
@phofl
Copy link
Member

phofl commented Mar 18, 2024

Yeah lets close this, mean shouldn't work and definitely not return NaN for strings, rather fail loudly

@phofl phofl closed this as completed Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Docs Dtype Conversions Unexpected or buggy dtype conversions Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.