Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix str.replace('.','') should replace every character #24935

Closed

Conversation

charlesdong1991
Copy link
Member

@charlesdong1991 charlesdong1991 commented Jan 25, 2019

@pep8speaks
Copy link

pep8speaks commented Jan 25, 2019

Hello @charlesdong1991! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-13 00:02:24 UTC

@charlesdong1991 charlesdong1991 changed the title fix str.replace('.','') should replace every character BUG: fix str.replace('.','') should replace every character Jan 25, 2019
@codecov
Copy link

codecov bot commented Jan 25, 2019

Codecov Report

Merging #24935 into master will decrease coverage by 49.48%.
The diff coverage is 28.57%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #24935       +/-   ##
===========================================
- Coverage   92.38%   42.89%   -49.49%     
===========================================
  Files         166      166               
  Lines       52388    52392        +4     
===========================================
- Hits        48397    22472    -25925     
- Misses       3991    29920    +25929
Flag Coverage Δ
#multiple ?
#single 42.89% <28.57%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/strings.py 32.81% <28.57%> (-65.78%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.35%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.15%) ⬇️
... and 124 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 83eb242...8061fda. Read the comment docs.

@codecov
Copy link

codecov bot commented Jan 25, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@d66da60). Click here to learn what that means.
The diff coverage is 20%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #24935   +/-   ##
=========================================
  Coverage          ?   41.12%           
=========================================
  Files             ?      179           
  Lines             ?    50709           
  Branches          ?        0           
=========================================
  Hits              ?    20854           
  Misses            ?    29855           
  Partials          ?        0
Flag Coverage Δ
#single 41.12% <20%> (?)
Impacted Files Coverage Δ
pandas/core/reshape/melt.py 12.6% <0%> (ø)
pandas/core/strings.py 37.76% <25%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d66da60...6267aae. Read the comment docs.

@charlesdong1991 charlesdong1991 force-pushed the issue_24804 branch 2 times, most recently from 8b8c210 to 4b9672a Compare January 26, 2019 08:17
@jreback jreback added the Strings String extension data type and string data label Jan 26, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are no tests for the deprecation

pandas/core/reshape/melt.py Show resolved Hide resolved
@@ -114,10 +114,8 @@ Conversion

Strings
^^^^^^^
- Bug in :func:`Series.str.replace` not applying regex in patterns of length 1 (:issue:`24804`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can leave this here, but need a note on the Deprecation warning change.

pandas/core/strings.py Outdated Show resolved Hide resolved
@@ -456,9 +456,10 @@ def str_replace(arr, pat, repl, n=-1, case=None, flags=0, regex=True):
flags : int, default 0 (no flags)
- re module flags, e.g. re.IGNORECASE
- Cannot be set if `pat` is a compiled regex
regex : boolean, default True
regex : boolean, default None
- If True, assumes the passed-in pattern is a regular expression.
- If False, treats the pattern as a literal string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a disconcerting change. You are essentially having a different default on what is being passed here. I would be ok with forcing the passing of regex always, possibly in preparation of changing the default to regex=False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand that @jreback it's a quite disruptive change. This change is following the proposal from @TomAugspurger in #24809 (comment) . The purpose is to make this argument more explicit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be ok with forcing the passing of regex always, possibly in preparation of changing the default to regex=False

@jreback can you clarify what you mean by "forcing"? If we raise when regex is not specified, that would be an API breaking change.

My proposal is the preserve the previous behavior, but warn when a length-1 regex is detected. Then we can get to the documented behavior of regex=True.

- If True, assumes the passed-in pattern is a regular expression.
- If False, treats the pattern as a literal string
- If pat is single special character, default regex is False.
Copy link
Contributor

@TomAugspurger TomAugspurger Jan 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should mention something about a warning being issued when regex is not specified. Something like

- If `pat` is a single character and `regex` is not specified, `pat` is interpreted as a
  string literal. If `pat` is also a regular expression symbol, a warning is issued that
  in the future `pat` will be interpreted as a regex, rather than a literal.

@@ -577,6 +578,12 @@ def str_replace(arr, pat, repl, n=-1, case=None, flags=0, regex=True):
if callable(repl):
raise ValueError("Cannot use a callable replacement when "
"regex=False")
# if regex is default None, and a single special character is given
# in pat, still take it as a literal, and raise the Future warning
if regex is None and len(pat) == 1 and re.findall(r"\W", pat):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the '\W' here. I think that's not quite right for non-word characters that are not regex symbols. Something like @

Does python have a list of regex symbols?

Is the heuristic re.escape(pat) == pat good enough?

In [46]: re.escape('@')
Out[46]: '@'

In [47]: re.escape('^')
Out[47]: '\\^'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger oh, indeed it's wrong! nice catch!! thanks for the correction!!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test to ensure that something like Series.str.replace('@', 'at') doesn't produce a warning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@charlesdong1991 charlesdong1991 Jan 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ehh, @TomAugspurger escape somehow changed... wired
i did get same output the first time i tested on my jupyter notebook:
re.escape("@") -> "@"
but now, i cannot replicate this and get re.escape("@") -> "\\@" ...
so i will go for this option:
pat in list("[\^$.|?*+()]")

-
-
-
- :func:`Series.str.replace`, the default regex for single character might be deprecated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be updated.

@charlesdong1991
Copy link
Member Author

any follow-up code change requests and reviews? @jreback @TomAugspurger

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather simply go whole hog and just change regex=None as a warning for all, meaning you must change to an explicit True/False. It makes it more clear. But this would then be a rather large change.

@@ -42,6 +42,7 @@ Other API Changes
Deprecations
~~~~~~~~~~~~

- :func:`Series.str.replace`, when pat is single special regex character and regex is not defined, regex is by default False for now, but this might be deprecated in the future.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs an issue reference; use double backticks on False, use single backticks on keywordds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks changed!

@charlesdong1991
Copy link
Member Author

yes, indeed will be a kind of api change. shall I open another issue to make this change if other maintainers agree with this? Or do you want to do this change in this PR? @jreback

@charlesdong1991 charlesdong1991 force-pushed the issue_24804 branch 2 times, most recently from 7e55526 to 2c9f4fb Compare February 18, 2019 08:41
@WillAyd
Copy link
Member

WillAyd commented Feb 28, 2019

So as an alternate what if we simply detected the case where the pattern provided is a single / special regex character with regex=True and raise a FutureWarning when that happens, noting that we will change the behavior to evaluate that as a regular expression regardless in the future

@charlesdong1991
Copy link
Member Author

@WillAyd based on your description, i think it's kind of what this PR is doing, right?

@WillAyd
Copy link
Member

WillAyd commented Feb 28, 2019

I'm saying don't even update the signature of the function, simply do something like:

if regex=True and pater in re_char_blacklist:
    raise FutureWarning("Single character patterns will be evaluated as regexes in the future")

Or something to the effect. Would minimize a lot of the volatility going on here

@TomAugspurger
Copy link
Contributor

@WillAyd the raise in your pseudocode should probably be a warnings.warn?

I (think?) then there is no way to get the desired future behavior without warning. IIUC, your prosposal would warn on

pd.Series(['aa']).str.replace('.', 'b', regex=True)

I think we'll need a period of

  1. regex=None (default): warning and output aa
  2. regex=True: no warning, output bb
  3. regex=False: no warning, output aa

@WillAyd
Copy link
Member

WillAyd commented Feb 28, 2019

Yea sorry meant warn instead of raise. Why do we need regex=None at all? I was hoping to avoid that API change

@TomAugspurger
Copy link
Contributor

Does the rest of
#24935 (comment) answer your question? The short version is (IIUC), your proposal wouldn't allow for option 2 (regex=True interprets . as a regex, and doest not warn).

@WillAyd
Copy link
Member

WillAyd commented Feb 28, 2019

Hmm unless I'm mistaken my proposal would only affect option two. So something like:

if regex=True and pater in {'.', '^', '$', '*', '+', '?', '}'}:
   warnings.warn("You've passed a single special character in with regex=True which is evaluated as a single character. This will change in the future to match regex behavior")

Warns the user about this behavior in the future but doesn't impact their current code, allowing for a smoother deprecation in the future.

To your point though the result of option 2 would still stay 'aa' as is today just with the warning

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 28, 2019 via email

@WillAyd
Copy link
Member

WillAyd commented Feb 28, 2019

Ah I gotcha. So in my mind if a user is passing a single special character and regex=True (whether explicitly or via the default argument) then it is a current "misuse" of that parameter.

If the warning said something like "explicitly pass regex=False to maintain current behavior" would that be better?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 28, 2019 via email

@@ -43,6 +43,7 @@ Other API Changes
Deprecations
~~~~~~~~~~~~

- :func:`Series.str.replace`, when pat is single special regex character and regex is not defined, regex is by default ``False`` for now, but this might be deprecated in the future. (:issue:`24804`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double back-ticks around pat.

..is single special regex character'.. (such as .|\ etc).

@@ -456,9 +456,13 @@ def str_replace(arr, pat, repl, n=-1, case=None, flags=0, regex=True):
flags : int, default 0 (no flags)
- re module flags, e.g. re.IGNORECASE
- Cannot be set if `pat` is a compiled regex
regex : bool, default True
regex : boolean, default None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default is still True

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, you mean we now dont raise a warning anymore when char is a special single character while user doesn't explicitly specify regex?
if set default to True, and i use @TomAugspurger example:

pd.Series(['aa']).str.replace('.', 'b')

we will get:

  1. default or explicitly regex=True, output is bb, no warning
  2. regex=False, output is aa, no warning.
    is it correct? @WillAyd @TomAugspurger @jreback

@jreback jreback added Bug Deprecate Functionality to remove in pandas labels Mar 3, 2019
@TomAugspurger
Copy link
Contributor

@WillAyd can you summarize your aversion to the (typical) use of None as a backwards compatible sentinel value for when we should warn?

@WillAyd
Copy link
Member

WillAyd commented Mar 3, 2019 via email

@jreback
Copy link
Contributor

jreback commented Mar 20, 2019

can you merge master

@charlesdong1991
Copy link
Member Author

Hi, @jreback I just merged master and fix conflict!

@jreback
Copy link
Contributor

jreback commented Mar 24, 2019

something still failing

@charlesdong1991
Copy link
Member Author

@jreback i just merged the master and fix test error

@WillAyd
Copy link
Member

WillAyd commented Apr 10, 2019

This one has been around for a while. Just so we are all on the same page - is the intention here definitely to force users down the road to explicitly provide regex=True or regex=False? I know that was mentioned in comments above just not sure we agreed to it.

An easy alternative would just be to raise a DeprecationWarning when a single special character gets passed and actually change behavior at a later date, which I think is what @TomAugspurger suggests in #24804 (comment)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 10, 2019 via email

@jreback
Copy link
Contributor

jreback commented Apr 10, 2019

ok, re-reading: #24809 (comment)

that looks like a good idea, not sure where that is in implementation.

@charlesdong1991
Copy link
Member Author

@jreback thanks for your comment! could you explain a little bit more? I think the change I made in this PR was following @TomAugspurger suggestion in 24809

@jreback
Copy link
Contributor

jreback commented Jun 8, 2019

@charlesdong1991 can you merge master and we'll see where this is

@charlesdong1991
Copy link
Member Author

@jreback I merged to master and resolved conflicts, pls take a look

@TomAugspurger
Copy link
Contributor

I just reread through the issue, and my thinking has changed a bit.

Do we think that .str.replace('.', '...') replacing every character is actually the most useful thing for users? Or is the change primarily motivated by the inconsistency?

If we think that most users will expect .str.replace(".") to only replace periods, then perhaps we just document that we don't auto-regex single-character pat (., ^, $, |) and tell users to pass re.compile(pat) if they want that?

@WillAyd
Copy link
Member

WillAyd commented Jun 13, 2019

Or is the change primarily motivated by the inconsistency

Definitely inconsistency IMO. I don't see that particular operation being useful but I don't think we should be special casing either, as our general handling of regexes in str replacement ops is all over the place

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 13, 2019 via email

@WillAyd
Copy link
Member

WillAyd commented Jun 13, 2019

I wouldn't be opposed to really doing nothing on this atm. The behavior can definitely be surprising as noted by the OP of the original issue, but to your point I don't see why the strictly standard behavior is useful either.

May be worth just punting this for a more comprehensive alignment of the regex keywords across string ops if it makes sense to tackle then

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 17, 2019 via email

@WillAyd
Copy link
Member

WillAyd commented Jun 17, 2019

Agreed - @charlesdong1991 make sense to you?

@charlesdong1991
Copy link
Member Author

ok, it does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Deprecate Functionality to remove in pandas Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

str.replace('.','') should replace every character?
5 participants