BUG: df.sum() of string columns depends on whether or not they can be coerced to numbers #22642

rosnfeld · 2018-09-08T23:13:00Z

Note that all the columns in this example are string columns:

In [32]: df1 = pd.DataFrame(data={'a':['1', '2'], 'b':['3', '4']})

In [33]: df1.sum(axis=1)
Out[33]: 
0    13.0
1    24.0
dtype: float64

In [34]: df2 = pd.DataFrame(data={'a':['i', 'j'], 'b':['m', 'n']})

In [35]: df2.sum(axis=1)
Out[35]: 
0    im
1    jn
dtype: object

Problem description

It would seem a bug, or at least very surprising behavior, that the sum() operation would depend on the contents of the strings when summing the columns.

Not sure if this has been reported before - I wasn't sure exactly what to search on, but wasn't able to find anything.

Expected Output

I would expect the columns to be concatenated as strings and then left as strings, just the same as if we did df1.a + df1.b.

In [25]: df1.a + df1.b
Out[25]: 
0    13
1    24
dtype: object

Output of `pd.show_versions()`

On current master as of this filing.

In [26]: pd.show_versions() No module named 'dask'

INSTALLED VERSIONS

commit: 996f361
python: 3.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-128-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.24.0.dev0+562.g996f361
pytest: 3.7.4
pip: 10.0.1
setuptools: 40.2.0
Cython: 0.28.5
numpy: 1.15.1
scipy: 1.1.0
pyarrow: 0.9.0
xarray: 0.10.8
IPython: 6.5.0
sphinx: 1.7.8
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: 0.4.0
matplotlib: 2.2.3
openpyxl: 2.5.6
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.0
lxml: 4.2.4
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.11
pymysql: 0.9.2
psycopg2: None
jinja2: 2.10
s3fs: 0.1.6
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None
gcsfs: 0.1.2

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-09-11T00:40:24Z

I suppose the behavior you are looking for is actually achieved at some point though there is a coercion at the end to float.

Using cat would be a much more idiomatic approach here, i.e. this:

In [14]: df1['a'].str.cat(df1['b'])
Out[14]: 
0    13
1    24
Name: a, dtype: object

I believe the auto cast to float for the numeric is desired in other operations, so not sure there's anything to be done for the sum use case here. Will see what others think

ghost · 2018-09-14T08:01:41Z

I believe the auto cast to float for the numeric is desired in other operations

@WillAyd Which operations do you have in mind, where automatic conversion is desired?

WillAyd · 2018-09-14T16:05:47Z

@apnewberry when working with an object type holds numeric data. Taking the example above:

In [12]: df1 = df1.astype(int)  
In [12]: df1.sum(axis=1)
Out[12]: 
0    4.0
1    6.0
dtype: float64

ghost · 2018-09-14T18:38:11Z

On pandas 0.23.4 I'm seeing a different behavior: the resulting dtype is int64, rather than float64. Given the astype(int) my expectation would be that the resulting dtype is an int rather than a float. Is there automatic conversion going on here that I'm missing?

In [8]: pd.__version__
Out[8]: '0.23.4'

In [9]: df1 = pd.DataFrame(data={'a':['1', '2'], 'b':['3', '4']})

In [10]: df1.astype(int).sum(axis=1)
Out[10]: 
0    4
1    6
dtype: int64

WillAyd · 2018-09-14T21:35:33Z

Hmm OK that's what I have on master. Somewhat orthogonal to the original post but if you want to submit a PR with a test case for that would certainly take it (if there's not one already).

Regardless going back to the OP going from object to a numeric type where applicable is certainly desired as it can yield memory and will definitely yield performance improvements. I don't see a way to reasonably account for what is being asked without impacting a larger use case and there's already a more idiomatic approach to get what is desired so I'm -1 on any change here

ghost · 2018-09-15T02:52:00Z

Could you help me understand what you mean about performance improvements from casting object columns to numeric?

Here are a few timings in case it's helpful.

In [6]: n = 1_000_000
        df = pd.DataFrame(data={'a':['1' for _ in range(n)], 'b':['2' for _ in range(n)]})

In [9]: %timeit df.sum(axis=1)
209 ms ± 206 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]: %timeit df.astype(str).sum(axis=1)
288 ms ± 665 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [11]: %timeit df.astype(float).sum(axis=1)
147 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [12]: %timeit df.astype(object).sum(axis=1)
227 ms ± 2.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [13]: %timeit df.sum(axis=1).astype(float)
209 ms ± 245 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [14]: %timeit df.sum(axis=1).astype(str)
421 ms ± 433 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [15]: %timeit df.sum(axis=1).astype(object)
236 ms ± 192 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

WillAyd · 2018-09-15T03:00:46Z

@apnewberry if you have usage questions you should ask on SO or Gitter - please don't continue the conversation here as it hijacks the thread.

With that said, take the axis argument out of your first three calls and you'll see.

WillAyd added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Needs Discussion Requires discussion from core team before further action labels Sep 11, 2018

mroeschke added Bug Strings String extension data type and string data Numeric Operations Arithmetic, Comparison, and Logical operations and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels May 13, 2020

jbrockmendel added the Reduction Operations sum, mean, min, max, etc. label Sep 21, 2020

mroeschke removed the Numeric Operations Arithmetic, Comparison, and Logical operations label Jun 22, 2021

kinshukdua mentioned this issue Oct 21, 2021

BUG: make mean() raise an exception for strings #44131

Closed

9 tasks

jreback added this to the 1.4 milestone Nov 13, 2021

jreback modified the milestones: 1.4, Contributions Welcome Dec 23, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: df.sum() of string columns depends on whether or not they can be coerced to numbers #22642

BUG: df.sum() of string columns depends on whether or not they can be coerced to numbers #22642

rosnfeld commented Sep 8, 2018 •

edited

INSTALLED VERSIONS

WillAyd commented Sep 11, 2018

ghost commented Sep 14, 2018

WillAyd commented Sep 14, 2018

ghost commented Sep 14, 2018

WillAyd commented Sep 14, 2018

ghost commented Sep 15, 2018

WillAyd commented Sep 15, 2018

BUG: df.sum() of string columns depends on whether or not they can be coerced to numbers #22642

BUG: df.sum() of string columns depends on whether or not they can be coerced to numbers #22642

Comments

rosnfeld commented Sep 8, 2018 • edited

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented Sep 11, 2018

ghost commented Sep 14, 2018

WillAyd commented Sep 14, 2018

ghost commented Sep 14, 2018

WillAyd commented Sep 14, 2018

ghost commented Sep 15, 2018

WillAyd commented Sep 15, 2018

rosnfeld commented Sep 8, 2018 •

edited

Output of `pd.show_versions()`