Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: df.sum() of string columns depends on whether or not they can be coerced to numbers #22642

Open
rosnfeld opened this issue Sep 8, 2018 · 7 comments
Labels
Bug Needs Discussion Requires discussion from core team before further action Reduction Operations sum, mean, min, max, etc. Strings String extension data type and string data

Comments

@rosnfeld
Copy link
Contributor

rosnfeld commented Sep 8, 2018

Note that all the columns in this example are string columns:

In [32]: df1 = pd.DataFrame(data={'a':['1', '2'], 'b':['3', '4']})

In [33]: df1.sum(axis=1)
Out[33]: 
0    13.0
1    24.0
dtype: float64

In [34]: df2 = pd.DataFrame(data={'a':['i', 'j'], 'b':['m', 'n']})

In [35]: df2.sum(axis=1)
Out[35]: 
0    im
1    jn
dtype: object

Problem description

It would seem a bug, or at least very surprising behavior, that the sum() operation would depend on the contents of the strings when summing the columns.

Not sure if this has been reported before - I wasn't sure exactly what to search on, but wasn't able to find anything.

Expected Output

I would expect the columns to be concatenated as strings and then left as strings, just the same as if we did df1.a + df1.b.

In [25]: df1.a + df1.b
Out[25]: 
0    13
1    24
dtype: object

Output of pd.show_versions()

On current master as of this filing.

In [26]: pd.show_versions() No module named 'dask'

INSTALLED VERSIONS

commit: 996f361
python: 3.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-128-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.24.0.dev0+562.g996f361
pytest: 3.7.4
pip: 10.0.1
setuptools: 40.2.0
Cython: 0.28.5
numpy: 1.15.1
scipy: 1.1.0
pyarrow: 0.9.0
xarray: 0.10.8
IPython: 6.5.0
sphinx: 1.7.8
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: 0.4.0
matplotlib: 2.2.3
openpyxl: 2.5.6
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.0
lxml: 4.2.4
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.11
pymysql: 0.9.2
psycopg2: None
jinja2: 2.10
s3fs: 0.1.6
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None
gcsfs: 0.1.2

@WillAyd
Copy link
Member

WillAyd commented Sep 11, 2018

I suppose the behavior you are looking for is actually achieved at some point though there is a coercion at the end to float.

Using cat would be a much more idiomatic approach here, i.e. this:

In [14]: df1['a'].str.cat(df1['b'])
Out[14]: 
0    13
1    24
Name: a, dtype: object

I believe the auto cast to float for the numeric is desired in other operations, so not sure there's anything to be done for the sum use case here. Will see what others think

@WillAyd WillAyd added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Needs Discussion Requires discussion from core team before further action labels Sep 11, 2018
@ghost
Copy link

ghost commented Sep 14, 2018

I believe the auto cast to float for the numeric is desired in other operations

@WillAyd Which operations do you have in mind, where automatic conversion is desired?

@WillAyd
Copy link
Member

WillAyd commented Sep 14, 2018

@apnewberry when working with an object type holds numeric data. Taking the example above:

In [12]: df1 = df1.astype(int)  
In [12]: df1.sum(axis=1)
Out[12]: 
0    4.0
1    6.0
dtype: float64

@ghost
Copy link

ghost commented Sep 14, 2018

On pandas 0.23.4 I'm seeing a different behavior: the resulting dtype is int64, rather than float64. Given the astype(int) my expectation would be that the resulting dtype is an int rather than a float. Is there automatic conversion going on here that I'm missing?

In [8]: pd.__version__
Out[8]: '0.23.4'

In [9]: df1 = pd.DataFrame(data={'a':['1', '2'], 'b':['3', '4']})

In [10]: df1.astype(int).sum(axis=1)
Out[10]: 
0    4
1    6
dtype: int64

@WillAyd
Copy link
Member

WillAyd commented Sep 14, 2018

Hmm OK that's what I have on master. Somewhat orthogonal to the original post but if you want to submit a PR with a test case for that would certainly take it (if there's not one already).

Regardless going back to the OP going from object to a numeric type where applicable is certainly desired as it can yield memory and will definitely yield performance improvements. I don't see a way to reasonably account for what is being asked without impacting a larger use case and there's already a more idiomatic approach to get what is desired so I'm -1 on any change here

@ghost
Copy link

ghost commented Sep 15, 2018

Could you help me understand what you mean about performance improvements from casting object columns to numeric?

Here are a few timings in case it's helpful.

In [6]: n = 1_000_000
        df = pd.DataFrame(data={'a':['1' for _ in range(n)], 'b':['2' for _ in range(n)]})

In [9]: %timeit df.sum(axis=1)
209 ms ± 206 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]: %timeit df.astype(str).sum(axis=1)
288 ms ± 665 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [11]: %timeit df.astype(float).sum(axis=1)
147 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [12]: %timeit df.astype(object).sum(axis=1)
227 ms ± 2.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [13]: %timeit df.sum(axis=1).astype(float)
209 ms ± 245 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [14]: %timeit df.sum(axis=1).astype(str)
421 ms ± 433 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [15]: %timeit df.sum(axis=1).astype(object)
236 ms ± 192 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

@WillAyd
Copy link
Member

WillAyd commented Sep 15, 2018

@apnewberry if you have usage questions you should ask on SO or Gitter - please don't continue the conversation here as it hijacks the thread.

With that said, take the axis argument out of your first three calls and you'll see.

@mroeschke mroeschke added Bug Strings String extension data type and string data Numeric Operations Arithmetic, Comparison, and Logical operations and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels May 13, 2020
@jbrockmendel jbrockmendel added the Reduction Operations sum, mean, min, max, etc. label Sep 21, 2020
@mroeschke mroeschke removed the Numeric Operations Arithmetic, Comparison, and Logical operations label Jun 22, 2021
@jreback jreback added this to the 1.4 milestone Nov 13, 2021
@jreback jreback modified the milestones: 1.4, Contributions Welcome Dec 23, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Discussion Requires discussion from core team before further action Reduction Operations sum, mean, min, max, etc. Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants