Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Series.str.replace() is not actually the same as str.replace() #16808
In : import pandas as pd In : series = pd.Series(['a', '(b)']) In : series.str.replace('a', '[a]') Out: 0 [a] 1 (b) dtype: object In : series.str.replace('(b)', '[b]') # unexpected behavior Out: 0 a 1 ([b]) dtype: object In : series.str.replace('\(b\)', '[b]') # need to escape Out: 0 a 1 [b] dtype: object In : '(b)'.replace('(b)', '[b]') # Python str.replace is different, uses literal string Out: '[b]'
The documentation for
However, that's not what is happening - it appears it's interpreting a string as a regex, so you need to escape characters like parentheses.
I would expect that for vanilla strings, it works like regular Python str.replace() - using literal strings instead of regexes.
Alternatively the documentation could be updated, but I think the Python str.replace() behavior is what most users would expect.
from the doc-string
it is a bit ambiguous, though so is a pattern like
I'm fine with either solution, not sure which is more intuitive or less disruptive to existing practice. I too had considered suggesting an additional "regex" param but had forgotten it when I went to write the issue up; if we go that way I'd argue for defaulting to regex=False.
For what it's worth, a user might not know they are passing a string that "looks like a regex". Maybe you got a result from some other operation and didn't inspect it, but still passed it in to series.str.replace(). I stumbled upon this issue doing something like the following: imagine an item description with a parenthetical remark - e.g. "Men's shirt (blue)" - and you try and replace all instances of that with some other text, maybe just "Men's clothing". When some of your replacements fail to take, it's pretty surprising.
In : import pandas as pd In : series = pd.Series(['No parens', 'Some text (with parens)']) In : series.str.replace('No parens', 'Replaced') Out: 0 Replaced 1 Some text (with parens) dtype: object In : series.str.replace('Some text (with parens)', 'Replaced') # fails to replace Out: 0 No parens 1 Some text (with parens) dtype: object In : series.str.replace(series.max(), 'Replaced') # nothing gets replaced! Out: 0 No parens 1 Some text (with parens) dtype: object
We just hit this issue. At this point the problem goes back at least to 0.16 and is still present in the current 0.23 dev. Problem is at
Thus any pattern len>1 is interpreted as a regex, whether it actually can be parsed as a regex or not.
At this point, any change that made it non-default to regex would be breaking, so I think the options are:
Any thoughts? Current behavior really is inconsistent with the doc.
I can do a PR. Question on API, though, before I write. Suppose someone gives us
Should that raise or should we just silently use regex?
maintains back compat, it does mean that
and have it only replace periods and not all characters. But maybe that's such a rare request anyway we can let it go for the sake of back compat?