Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow dictionaries to be passed to pandas.Series.str.replace #51748

Closed
1 of 3 tasks
lukefeilberg opened this issue Mar 2, 2023 · 14 comments · Fixed by #56175
Closed
1 of 3 tasks

ENH: Allow dictionaries to be passed to pandas.Series.str.replace #51748

lukefeilberg opened this issue Mar 2, 2023 · 14 comments · Fixed by #56175
Assignees
Labels
API - Consistency Internal Consistency of API/Behavior API Design Enhancement Strings String extension data type and string data
Milestone

Comments

@lukefeilberg
Copy link

lukefeilberg commented Mar 2, 2023

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I often want to replace various strings in a Series with an empty string and it would be nice to be able to do so in one call passing dictionary rather than chaining multiple .str.replace calls.

Alternatively I can use the regex flag and use some OR logic there but I think it would be much cleaner and more consistent with other functions to be able to pass a dictionary in my opinion.

How it currently works

messy_series = pd.Series(
    data=['A', 'B_junk', 'C_gunk'],
    name='my_messy_col',
)

clean_series = messy_series.str.replace('_junk', '').str.replace('_gunk', '')
clean_series = messy_series.str.replace('_junk|_gunk', '', regex=True)

How I'd like for it work

messy_series = pd.Series(
    data=['A', 'B_junk', 'C_gunk'],
    name='my_messy_col',
)

clean_series = messy_series.str.replace({'_gunk':'', '_junk':''})

Curious folks' thoughts, thanks y'all!!

Feature Description

A simple way to solve the problem would be inspect if a dictionary is passed to str.replace, and if so iterate over the items of the dictionary and simply call replace on each key/value pair.

If this is an acceptable way to do it from the maintainers' perspectives then I'd happily submit such a PR.

Alternative Solutions

As mentioned above, both the following indeed currently work:

messy_series = pd.Series(
    data=['A', 'B_junk', 'C_gunk'],
    name='my_messy_col',
)

clean_series = messy_series.str.replace('_junk', '').str.replace('_gunk', '')
clean_series = messy_series.str.replace('_junk|_gunk', '', regex=True)

Additional Context

No response

@lukefeilberg lukefeilberg added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 2, 2023
@rhshadrach rhshadrach added the Strings String extension data type and string data label Mar 4, 2023
@rmhowe425
Copy link
Contributor

take

@rhshadrach
Copy link
Member

Alternatively I can use the regex flag and use some OR logic there but I think it would be much cleaner and more consistent with other functions to be able to pass a dictionary in my opinion.

But at the expense of expanding the API. Is using a dict less performant than using regex?

@lukefeilberg
Copy link
Author

Hey @rhshadrach,

It looks like using a dict indeed is a bit less performant than using regex for my example at least when I beef it up to 10M rows (shown in the screenshot below, I didn't test @rmhowe425's PR though I imagine it's similar).

That said, while my original example can be solved with a regex since I'm mapping all "keys" to the same value, I've also had times where that wasn't the case and then I'd be forced to chain the regex (or loop over the items).

It's not a hill I would I die on by any means but I could imagine others would run into this and benefit from it. But if it turns out I'm more alone in these use-cases than I thought then no worries 🙂.


image

@rhshadrach
Copy link
Member

It seems to me that for loop should live in user code rather than within pandas. To me, reasons why it should live in pandas would be a non-straight forward implementation, performance improvements, or API consistency. I don't think any of those qualify here.

@lukefeilberg
Copy link
Author

lukefeilberg commented Apr 1, 2023

Yeah that's reasonable. Although I would personally think this qualifies as API consistency since most similar functions in fact do accept dictionaries in this way (such as map) and I always expect a dictionary would work and then get this error -- hence I opened this thinking maybe others expect the same. But I'm cool to close this if I'm alone in that opinion.

@rhshadrach
Copy link
Member

cc @mroeschke @phofl @jorisvandenbossche for any thoughts

@rmhowe425
Copy link
Contributor

@rhshadrach Do we think we can close this issue?

@rhshadrach
Copy link
Member

Although I would personally think this qualifies as API consistency since most similar functions in fact do accept dictionaries in this way (such as map) and I always expect a dictionary would work

@rmhowe425 - I'm persuaded by this and am now leaning toward being +1. It does seem convenient from a user standpoint to be able to supply a dictionary. Would like to get some others thoughts. @mroeschke @MarcoGorelli @twoertwein

@phofl
Copy link
Member

phofl commented Aug 5, 2023

This is kind of like the regular replace, so slight +1 on my side as well

@twoertwein
Copy link
Member

+/-0: I like the idea because it feels similar to DataFrame.rename(columns={...}) but I think .str.<method> (and the other accessors) typically try to mirror the interface of the underlying datatype, so it would be inconsistent with str.replace.

@MarcoGorelli
Copy link
Member

to be honest I think looping over a dict is clear enough, and the timing difference in #51748 (comment) not really significant enough

but, no strong opinion, I wouldn't block this if others wanted it

@rmhowe425
Copy link
Contributor

@rhshadrach Do we think another PR can be opened for this issue?

@rhshadrach
Copy link
Member

but I think .str.<method> (and the other accessors) typically try to mirror the interface of the underlying datatype, so it would be inconsistent with str.replace.

For many str accessor methods, this is true. But there are various ones where we already break this. Python's str.replace accepts old, new, and count whereas pandas' is pat, repl, n, case, flags, regex.

I think applying principle of least surprise here would lead one to conclude that the Python arguments are all accepted by pandas, but not necessarily the other way around. pandas is a specialized library, and so in general I think users won't be surprised that where there is overlap in methods names that pandas has more functionality.

@rhshadrach rhshadrach removed the Needs Triage Issue that has not been reviewed by a pandas team member label Oct 13, 2023
@rhshadrach
Copy link
Member

@rmhowe425 - I think a PR for this would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior API Design Enhancement Strings String extension data type and string data
Projects
None yet
7 participants