Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: str.extract Method Not Implemented for pd.ArrowDtype(pa.string()) #56268

Closed
2 of 3 tasks
mattharrison opened this issue Nov 30, 2023 · 6 comments · Fixed by #56334
Closed
2 of 3 tasks

BUG: str.extract Method Not Implemented for pd.ArrowDtype(pa.string()) #56268

mattharrison opened this issue Nov 30, 2023 · 6 comments · Fixed by #56334
Labels
Arrow pyarrow functionality Bug Strings String extension data type and string data

Comments

@mattharrison
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import io

data ='''make,city08
Alfa Romeo,19
Ferrari,9
Dodge,23
Dodge,10
Subaru,17'''

df = pd.read_csv(io.StringIO(data), dtype_backend='pyarrow')
df.make.str.extract(r'([^a-z A-Z])')

Issue Description

The str.extract method raises a NotImplementedError when used on a series with the pd.ArrowDtype(pa.string()) data type.

Expected Behavior

Execute like it does with Pandas 1 strings.

Installed Versions

the str.extract method raises a NotImplementedError when used on a series with the pd.ArrowDtype(pa.string()) data type.

@mattharrison mattharrison added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 30, 2023
@phofl
Copy link
Member

phofl commented Dec 1, 2023

cc @mroeschke what's the status quo here?

@phofl phofl added Strings String extension data type and string data Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 1, 2023
@mroeschke
Copy link
Member

Yeah at the time never got around to implementing this one (this one was tricky since it can return a Series or DataFrame), but should be available from the pyarrow side with extract_regex https://arrow.apache.org/docs/python/generated/pyarrow.compute.extract_regex.html#pyarrow.compute.extract_regex

@mattharrison
Copy link
Author

In 2.2rc0 I get this error:

>>> print(make.str.extract(r'([^a-z A-Z])'))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[114], line 1
----> 1 print(make.str.extract(r'([^a-z A-Z])'))

File ~/.envs/pd22rc/lib/python3.11/site-packages/pandas/core/strings/accessor.py:137, in forbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)
    132     msg = (
    133         f"Cannot use .str.{func_name} with values of "
    134         f"inferred dtype '{self._inferred_dtype}'."
    135     )
    136     raise TypeError(msg)
--> 137 return func(self, *args, **kwargs)

File ~/.envs/pd22rc/lib/python3.11/site-packages/pandas/core/strings/accessor.py:2758, in StringMethods.extract(self, pat, flags, expand)
   2755     result = DataFrame(columns=columns, dtype=result_dtype)
   2757 else:
-> 2758     result_list = self._data.array._str_extract(
   2759         pat, flags=flags, expand=returns_df
   2760     )
   2762     result_index: Index | None
   2763     if isinstance(obj, ABCSeries):

File ~/.envs/pd22rc/lib/python3.11/site-packages/pandas/core/arrays/arrow/array.py:2399, in ArrowExtensionArray._str_extract(self, pat, flags, expand)
   2397 groups = re.compile(pat).groupindex.keys()
   2398 if len(groups) == 0:
-> 2399     raise ValueError(f"{pat=} must contain a symbolic group name.")
   2400 result = pc.extract_regex(self._pa_array, pat)
   2401 if expand:

ValueError: pat='([^a-z A-Z])' must contain a symbolic group name.

I got around it with this:

>>> print(make.str.extract(r'(?P<non_alpha>[^a-z A-Z])'))

I think this is a regression, as I now have to include a regex group name (which is one of those things I have to look up everytime I need to do it.)

@phofl
Copy link
Member

phofl commented Dec 27, 2023

Not necessarily a regression since it didn't work before at all? Am I missing something?

cc @mroeschke is this expected with your implementation?

@mattharrison
Copy link
Author

mattharrison commented Dec 27, 2023

@phofl I mean it worked in Pandas 1. You are correct, it didn't work in 2.

@mroeschke
Copy link
Member

Yeah unfortunately this is a limitation with pyarrow's extract_regex compute method: https://arrow.apache.org/docs/python/generated/pyarrow.compute.extract_regex.html#pyarrow.compute.extract_regex

This probably worked in pandas 1 with the object-backed string type as that iterates through the values using re.compile directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants