New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str.cat does not align on index? #18657

Closed
h-vetinari opened this Issue Dec 6, 2017 · 5 comments

Comments

Projects
None yet
3 participants
@h-vetinari
Contributor

h-vetinari commented Dec 6, 2017

The implicit index-matching of pandas for operations between different DataFrame/Series is great and most of the times, it just works. It does so consistently enough, that the expectation (for me) is that different Series will be aligned before an operation is performed.

For some reason, str.cat does not seem to do so.

import pandas as pd # 0.21.0
import numpy as np # 1.13.3
col = pd.Series(['a','b','c','d','e','f','g','h','i','j'])

# choose random subsets
ss1 = [8, 1, 2, 0, 6] # list(col.sample(5).index) 
ss2 = [4, 0, 9, 2, 6] # list(col.sample(5).index)

# perform str.cat
col.loc[ss1].str.cat(col.loc[ss2], sep = '').sort_index()
# 0    ac <-- UNMATCHED!
# 1    ba <-- UNMATCHED!
# 2    cj <-- UNMATCHED!
# 6    gg <-- correct by sheer luck
# 8    ie <-- UNMATCHED!

# compared for example with Boolean operations on unmatched series
# (matching indices and returning Series with union of both indices),
# this is inconsistent!
b = col.loc[ss1].astype(bool) & col.loc[ss2].astype(bool)
b
# 0     True
# 1    False
# 2     True
# 4    False
# 6     True
# 8    False
# 9    False

# if we manually align the Series
# (easy here by masking from the Series we just subsampled, hard in practice),
# then the NaNs are handled as expected:
m = col.where(np.isin(col.index, ss1)).str.cat(col.where(np.isin(col.index, ss2)), sep = '')
m
# 0     aa
# 1    NaN
# 2     cc
# 3    NaN
# 4    NaN
# 5    NaN
# 6     gg
# 7    NaN
# 8    NaN
# 9    NaN

# based on the normal pandas-behaviour for unmatched Series
# (for example as for Boolean "and" above), the following would be
# the expected result of col.loc[ss1].str.cat(col.loc[ss2], sep = '').sort_index() !
m.loc[b.index]
# 0     aa <-- MATCHED!
# 1    NaN
# 2     cc <-- MATCHED!
# 4    NaN
# 6     gg <-- MATCHED!
# 8    NaN
# 9    NaN

@TomAugspurger TomAugspurger added this to the Next Major Release milestone Dec 6, 2017

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Dec 6, 2017

Yes, this should probably align when others is a Series. It'll require some care to get right, since it's an API breaking change.

I'd favor a keyword like align for controlling it. For now it'll be None by default and if others is a series it will not align, but warn that the default will change in the future to align.

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Dec 6, 2017

Do we have any other string operations that take a Series / DataFrame and should be handled similarly?

@h-vetinari

This comment has been minimized.

Contributor

h-vetinari commented Dec 6, 2017

Cool! I think your proposal for dealing with the API-breaking sounds good, except maybe calling it align = False by default first, and changing to True at some point in the future (not unlike the expand-keyword that .str.extract gained in 0.18)

Looking at https://pandas-docs.github.io/pandas-docs-travis/text.html#method-summary,
it seems that .str.cat is the only method that deals with two different Series. Maybe that's why the alignment for it fell through the cracks?

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Dec 6, 2017

@h-vetinari

This comment has been minimized.

Contributor

h-vetinari commented May 3, 2018

@TomAugspurger @jreback
This can now be closed.

@jreback jreback modified the milestones: Next Major Release, 0.23.0 May 3, 2018

@jreback jreback closed this May 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment