Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for "regex" library #22496

Open
pmav99 opened this issue Aug 24, 2018 · 21 comments
Open

Add support for "regex" library #22496

pmav99 opened this issue Aug 24, 2018 · 21 comments
Labels
Enhancement Strings String extension data type and string data

Comments

@pmav99
Copy link

pmav99 commented Aug 24, 2018

Code Sample, a copy-pastable example if possible

import re
import pandas as pd
import regex

df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "1", "2"]})
pattern = r"\d"

df.b.str.match(pattern)
df.b.str.match(re.compile(pattern))
df.b.str.match(regex.compile(pattern))     # throws typeError
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-65-eec2b9ae9613> in <module>()
      9 df.b.str.match(pattern)
     10 df.b.str.match(re.compile(pattern))
---> 11 df.b.str.match(regex.compile(pattern))

~/.virtualenvs/edgar/lib/python3.6/site-packages/pandas/core/strings.py in match(self, pat, case, flags, na, as_indexer)
   2421     def match(self, pat, case=True, flags=0, na=np.nan, as_indexer=None):
   2422         result = str_match(self._data, pat, case=case, flags=flags, na=na,
-> 2423                            as_indexer=as_indexer)
   2424         return self._wrap_result(result)
   2425 

~/.virtualenvs/edgar/lib/python3.6/site-packages/pandas/core/strings.py in str_match(arr, pat, case, flags, na, as_indexer)
    736         flags |= re.IGNORECASE
    737 
--> 738     regex = re.compile(pat, flags=flags)
    739 
    740     if (as_indexer is False) and (regex.groups > 0):

~/.virtualenvs/edgar/lib/python3.6/re.py in compile(pattern, flags)
    231 def compile(pattern, flags=0):
    232     "Compile a regular expression pattern, returning a pattern object."
--> 233     return _compile(pattern, flags)
    234 
    235 def purge():

~/.virtualenvs/edgar/lib/python3.6/re.py in _compile(pattern, flags)
    298         return pattern
    299     if not sre_compile.isstring(pattern):
--> 300         raise TypeError("first argument must be string or compiled pattern")
    301     p = sre_compile.compile(pattern, flags)
    302     if not (flags & DEBUG):

TypeError: first argument must be string or compiled pattern

A simpler way to demonstrate the problem is:

re.compile(regex.compile(pattern))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-64-38578ab20aeb> in <module>()
----> 1 re.compile(regex.compile(pattern))

~/.virtualenvs/edgar/lib/python3.6/re.py in compile(pattern, flags)
    231 def compile(pattern, flags=0):
    232     "Compile a regular expression pattern, returning a pattern object."
--> 233     return _compile(pattern, flags)
    234 
    235 def purge():

~/.virtualenvs/edgar/lib/python3.6/re.py in _compile(pattern, flags)
    298         return pattern
    299     if not sre_compile.isstring(pattern):
--> 300         raise TypeError("first argument must be string or compiled pattern")
    301     p = sre_compile.compile(pattern, flags)
    302     if not (flags & DEBUG):

TypeError: first argument must be string or compiled pattern

Problem description

The regex library seems not to be supported by pandas. Not sure if you want to add support for it, but I had a quick look and It seems relatively straight forward to add support for it (+ it would make maintainance for projects that have already opted for regex easier).

How to fix

So, I think that the steps that seem to be required are:

  1. pandas.core.dtypes.inference.is_re should return True for regex compiled patterns too (assuming that regex is installed of course).
  2. Make sure that you use call "is_re" before re.compile() (as is being done e.g. here):
if not is_re(pat):
    pat = re.compile(pat, flags)

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 4.17.5-1-ARCH
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.3
pytest: 3.7.1
pip: 18.0
setuptools: 40.0.0
Cython: 0.28.5
numpy: 1.15.0
scipy: 1.1.0
pyarrow: 0.10.0
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.7.3
bs4: 4.6.1
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 24, 2018

Are patterns compiled by regex instances of typing.re.Pattern?

@pmav99
Copy link
Author

pmav99 commented Aug 24, 2018

No they are not.

import re
import typing
import regex

re_pat = re.compile(r"\d")
regex_pat = regex.compile(r"\d")

re_pat.__class__.mro()                # [_sre.SRE_Pattern, object]
isinstance(re_pat, typing.Pattern)    # True

regex_pat.__class__.mro()                # [_regex.Pattern, object]
isinstance(regex_pat, typing.Pattern)    # False

@madimov
Copy link

madimov commented Jul 26, 2019

Hi @pmav99, any luck with this? Or did you happen to create a workaround for yourself?

@pmav99
Copy link
Author

pmav99 commented Jul 26, 2019

@madimov, I think I used vanila re for pandas, and regex for everything else. Not nice ,but there was no feedback and I needed to move on.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 26, 2019 via email

@madimov
Copy link

madimov commented Jul 29, 2019

@TomAugspurger that would be great! Might you have the time to give it a go?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 29, 2019 via email

@gwerbin
Copy link

gwerbin commented Sep 23, 2019

@TomAugspurger

Are patterns compiled by regex instances of typing.re.Pattern?

If the answer to this is "no", then that's an upstream bug IMO.

@gwerbin
Copy link

gwerbin commented Sep 23, 2019

That said, here is my attempt at a fix: master...gwerbin:patch-2

Just made the edits here on Github, so haven't actually run any tests yet.

@jbrockmendel
Copy link
Member

Might you have the time to give it a go?

Tom didn't have time, but PRs are welcome.

@mroeschke mroeschke added the Strings String extension data type and string data label Dec 25, 2019
@gwerbin
Copy link

gwerbin commented Apr 17, 2020

@jbrockmendel did you take a look at my proposed patch? It will probably need a major rebase obviously. Just want to make sure what I did is an acceptable approach before I put more time into it.

@TomAugspurger
Copy link
Contributor

That looks roughly correct. You'll need to update some of the CI envs in ci/deps to include regex and skip the test if it isn't present.

@jbrockmendel
Copy link
Member

@gwerbin thanks for pinging on this. Yah, that looks a lot less invasive than I expected, seems reasonable.

@lucazav
Copy link

lucazav commented Mar 17, 2021

Hi guys, any update on this? Using regex module in Pandas would be really useful for a lot of scenarios.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2021

@lucazav you or anyone in the community can submit a PR. all folks working in pandas are volunteers

@lucazav
Copy link

lucazav commented Mar 18, 2021

@jreback I'm not an experienced Pythonst. But I see that someone else has already proposed an easy solution just a few comments above, so I assumed it would be just as easy to submit a PR containing that code for you experts.

@jreback
Copy link
Contributor

jreback commented Mar 18, 2021

@lucazav and someone needs to make an actual pull request with testing and documentation

core devs can provide code review

@gwerbin
Copy link

gwerbin commented Mar 18, 2021

I totally forgot about this.

I am willing to take the lead on this, going through the effort to update the docs, run the test suite, etc.

However, I think my patch is a hack around the fact that regex objects are not instances of typing.Pattern. I can think of of two solutions that are better than the one I originally proposed:

  1. Use a runtime-checkable typing.Protocol that covers the relevant methods and attributes used within Pandas.
  2. Implement a Pattern type that, unlike the current typing.Pattern, is not an alias to re.Pattern, but is its own class with a __subclasshook__ implementation, much like the classes in collections.abc. I think this is generally an improvement over the existing typing.Pattern that can (and should) be contributed back to the Python community as a PEP.

The reason I believe a generic solution is better than a regex-specific solution is that there are yet other regex libraries that someone might want to use (e.g. RE2).

I am willing to start work on (1), free time permitting, and possibly even (2). But I'd like some feedback on this idea from the Pandas dev community before I commit a bunch of time for it.

@alegend4u
Copy link

The reason I believe a generic solution is better than a regex-specific solution is that there are yet other regex libraries that someone might want to use (e.g. RE2).

@gwerbin Above is so true. I wish I could use pythonnet's regex engine because I have a few modules in C# and a few in Python and I want to use a single regex engine for both.

Basically, we need to be able to switch the internal regex engine used for pandas' string methods.

@rootsmusic
Copy link

rootsmusic commented Jan 14, 2024

we need to be able to switch the internal regex engine used for pandas' string methods.

Like pcre2. See a comparison of language features for regular expression engines.

@rootsmusic
Copy link

rootsmusic commented Jan 14, 2024

Known differences between google-RE2 and re:

various PCRE features (e.g. backreferences, look-around assertions) are not supported. See the canonical reference, but known syntactic "gotchas" relative to Python are:

  • PCRE supports \Z and \z; RE2 supports \z; Python supports \z, but calls it \Z. You must rewrite \Z to \z in pattern strings.
  • The error class does not provide any error information as attributes.
  • The Options class replaces the re module's flags with RE2's options as gettable/settable properties. Please see re2.h for their documentation.
  • The pattern string and the input string do not have to be the same type. Any str will be encoded to UTF-8.
  • The pattern string cannot be str if the options specify Latin-1 encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests