Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow dictionaries to be passed to pandas.Series.str.replace #56175

Merged
merged 22 commits into from
Feb 12, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
b1ea8c2
Adding implementation, unit tests, and documentation updates.
rmhowe425 Nov 26, 2023
573e6e4
Fixing code check and parameterizing unit tests.
rmhowe425 Nov 26, 2023
f9fe71e
Added additional documentation.
rmhowe425 Nov 26, 2023
77a579e
Updating implementation based on reviewer feedback.
rmhowe425 Nov 30, 2023
5de632a
Merge branch 'main' into dev/Series/str-replace
rmhowe425 Nov 30, 2023
d472b01
Fixing documentation issues.
rmhowe425 Nov 30, 2023
eceb234
Attempting to fix double line break.
rmhowe425 Nov 30, 2023
5702ea9
Removed string casting for value parameter in call to _str_replace.
rmhowe425 Dec 3, 2023
ab28c9e
Merge branch 'main' into dev/Series/str-replace
rmhowe425 Dec 3, 2023
3be3c6b
Merge branch 'main' into dev/Series/str-replace
rmhowe425 Jan 9, 2024
f01728f
Updating whatsnew to fix merge conflict.
rmhowe425 Jan 9, 2024
9ca546b
Merge branch 'main' into dev/Series/str-replace
rmhowe425 Jan 19, 2024
591e380
Updated implementation based on reviewer feedback.
rmhowe425 Jan 20, 2024
bae43ed
Cleaning up implementation.
rmhowe425 Jan 20, 2024
0359039
Merge branch 'main' into dev/Series/str-replace
rmhowe425 Jan 20, 2024
f0dcd55
Merge branch 'main' into dev/Series/str-replace
rmhowe425 Feb 6, 2024
f626cfb
Moving contribution note to 3.0
rmhowe425 Feb 8, 2024
1f50035
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 8, 2024
a3014de
Update accessor.py
rmhowe425 Feb 10, 2024
95947a5
Merge branch 'pandas-dev:main' into dev/Series/str-replace
rmhowe425 Feb 10, 2024
470a648
Merge branch 'main' into dev/Series/str-replace
rmhowe425 Feb 10, 2024
c6f2d3a
Merge branch 'main' into dev/Series/str-replace
rmhowe425 Feb 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,7 @@ Other enhancements
- :func:`tseries.api.guess_datetime_format` is now part of the public API (:issue:`54727`)
- :meth:`ExtensionArray._explode` interface method added to allow extension type implementations of the ``explode`` method (:issue:`54833`)
- :meth:`ExtensionArray.duplicated` added to allow extension type implementations of the ``duplicated`` method (:issue:`55255`)
- Allow dictionaries to be passed to :meth:`pandas.Series.str.replace` via ``pat`` parameter (:issue:`51748`)
rhshadrach marked this conversation as resolved.
Show resolved Hide resolved
- Allow passing ``read_only``, ``data_only`` and ``keep_links`` arguments to openpyxl using ``engine_kwargs`` of :func:`read_excel` (:issue:`55027`)
- DataFrame.apply now allows the usage of numba (via ``engine="numba"``) to JIT compile the passed function, allowing for potential speedups (:issue:`54666`)
- Implement masked algorithms for :meth:`Series.value_counts` (:issue:`54984`)
Expand Down
43 changes: 35 additions & 8 deletions pandas/core/strings/accessor.py
Original file line number Diff line number Diff line change
Expand Up @@ -1395,8 +1395,8 @@ def fullmatch(self, pat, case: bool = True, flags: int = 0, na=None):
@forbid_nonstring_types(["bytes"])
def replace(
self,
pat: str | re.Pattern,
repl: str | Callable,
pat: str | re.Pattern | dict | None = None,
repl: str | Callable | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC if a dict is passed then repl is not needed. When would pat be None and why is it the default?

n: int = -1,
case: bool | None = None,
flags: int = 0,
Expand All @@ -1410,11 +1410,14 @@ def replace(

Parameters
----------
pat : str or compiled regex
pat : str, compiled regex, or a dict
String can be a character sequence or regular expression.
Dictionary contains <key : value> pairs of strings to be replaced
along with the updated value.
repl : str or callable
Replacement string or a callable. The callable is passed the regex
match object and must return a replacement string to be used.
Must have a value of None if `pat` is a dict
See :func:`re.sub`.
n : int, default -1 (all)
Number of replacements to make from start.
Expand Down Expand Up @@ -1448,6 +1451,7 @@ def replace(
* if `regex` is False and `repl` is a callable or `pat` is a compiled
regex
* if `pat` is a compiled regex and `case` or `flags` is set
* if `pat` is a dictionary and `repl` is not None.

Notes
-----
Expand All @@ -1457,6 +1461,15 @@ def replace(

Examples
--------
When `pat` is a dictionary, every key in `pat` is replaced
with its corresponding value:

>>> pd.Series(['A', 'B', np.nan]).str.replace(pat={'A': 'a', 'B': 'b'})
0 a
1 b
2 NaN
dtype: object

When `pat` is a string and `regex` is True, the given `pat`
is compiled as a regex. When `repl` is a string, it replaces matching
regex patterns as with :meth:`re.sub`. NaN value(s) in the Series are
Expand Down Expand Up @@ -1519,8 +1532,11 @@ def replace(
2 NaN
dtype: object
"""
if isinstance(pat, dict) and repl is not None:
raise ValueError("repl cannot be used when pat is a dictionary")

# Check whether repl is valid (GH 13438, GH 15055)
if not (isinstance(repl, str) or callable(repl)):
if not isinstance(pat, dict) and not (isinstance(repl, str) or callable(repl)):
raise TypeError("repl must be a string or callable")

is_compiled_re = is_re(pat)
Expand All @@ -1540,10 +1556,21 @@ def replace(
if case is None:
case = True

result = self._data.array._str_replace(
pat, repl, n=n, case=case, flags=flags, regex=regex
)
return self._wrap_result(result)
if isinstance(pat, dict):
rhshadrach marked this conversation as resolved.
Show resolved Hide resolved
res_output = self._data
for key, value in pat.items():
result = res_output.array._str_replace(
key, value, n=n, case=case, flags=flags, regex=regex
)
res_output = self._wrap_result(result)

else:
result = self._data.array._str_replace(
pat, repl, n=n, case=case, flags=flags, regex=regex
)
res_output = self._wrap_result(result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you can call _wrap_result just once at the end, rather than inside the for loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach wouldn't you need the for loop in the case that pat contained multiple key : value pairs of strings to be replaced?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - I'm not suggesting to remove the for loop entirely. Just to call self._wrap_result once after the for loop is done rather than every iteration. If you think this is incorrect, let me know and I can take a closer look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So after doing a bit of debugging it looks like we need to call _wrap_result after each iteration so we can save the output of our string replace and update res_output.

self._data is a Series and _str_replace() returns an NDArray. Since we can't update self._data.array, we need a container to save the output of our string replace, so we're converting it to a Series using _wrap_result() and then updating our container.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - thanks for checking!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than doing the loop here, is there any immediate advantage to passing the dict onto the arrays _str_replace method? like avoiding the _wrap_result

IIUC the accessors should only be validating the passed parameters, defining the "pandas string API", providing the documentation and wrapping the array result into a Series.

IMO the implementation should be at array level and then can be overridden if the array types can be optimized or use native methods.

For example, maybe using "._str_map" could be faster for object type and maybe pyarrow.compute.replace_substring_regex could be used for arrow backed strings?

The array level optimizations need not be in this PR though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The array level optimizations need not be in this PR though.

I think this is a good idea, but agreed it need not be here (this is perf neutral compared to the status quo). If not tackled here, we can throw up an issue noting the performance improvement. @rmhowe425 - thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! I think this idea is deserving of a separate issue. Happy to work that issue as well!


return res_output

@forbid_nonstring_types(["bytes"])
def repeat(self, repeats):
Expand Down
15 changes: 15 additions & 0 deletions pandas/tests/strings/test_find_replace.py
Original file line number Diff line number Diff line change
Expand Up @@ -355,6 +355,21 @@ def test_endswith_nullable_string_dtype(nullable_string_dtype, na):
# --------------------------------------------------------------------------------------
# str.replace
# --------------------------------------------------------------------------------------
def test_replace_dict_invalid(any_string_dtype):
# GH 51914
series = Series(data=["A", "B_junk", "C_gunk"], name="my_messy_col")
msg = "repl cannot be used when pat is a dictionary"

with pytest.raises(ValueError, match=msg):
series.str.replace(pat={"A": "a", "B": "b"}, repl="A")


def test_replace_dict(any_string_dtype):
# GH 51914
series = Series(data=["A", "B", "C"], name="my_messy_col")
new_series = series.str.replace(pat={"A": "a", "B": "b"})
expected = Series(data=["a", "b", "C"], name="my_messy_col")
tm.assert_series_equal(new_series, expected)


def test_replace(any_string_dtype):
Expand Down
Loading