Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use regex in addition to re? #590

Closed
julienmalard opened this issue May 14, 2020 · 5 comments
Closed

Use regex in addition to re? #590

julienmalard opened this issue May 14, 2020 · 5 comments

Comments

@julienmalard
Copy link
Contributor

Suggestion
While using Lark I came across an already-documented bug in the builtin re module, whereby abugidas with vowel marks (that is, most of South Asia and Oceania) fail to match the \w directive (e.g., as in re.match("^[\w\s][\w\s]*", "किशोरी", re.UNICODE), as taken from this question here).

It seems that the recommended "solution" is to simply used the regex module instead of re. However, this would mean adding a dependency to Lark, so I thought I would ask whether you would look favourably upon such a suggestion before submitting a pull request. I was thinking of proposing something along the lines of:

try:
    import regex as re
except ImportError:
    import re

so that regex is not required and nothing breaks if the dependency is not installed.

Describe alternatives you've considered

An alternative, which I am currently using as a workaround, is to generate the toxens manually. For instance, for a Python parser, I can use:

ID_START = ''.join([
    c for c in map(chr, range(sys.maxunicode))
    if category(c) in ['Lu', 'Ll', 'Lt', 'Lm', 'Lo', 'Nl']]) + '_'

ID_CONTINUE = ID_START + ''.join([
    c for c in map(chr, range(sys.maxunicode))
    if category(c) in ['Mn', 'Mc', 'Nd']]) + '·'

and then define my NAME token as ID_START ID_CONTINUE*. However, I think that it would be much nicer to be able to use \w according to Unicode standard. A side benefit of using regex is that projects using Lark and specifying regex as a dependency could also then explicitly use Unicode categories in their regexes as well.

Please do let me know!

@erezsh
Copy link
Member

erezsh commented May 14, 2020

Hi,

I'm certainly open to supporting the regex library, if it's useful.

Is it fully-compatible with re, or are there differences beyond bugfixes?

How does it compare in terms of performance?

I think it might be nice to support something like

pip install lark-parser[regex]

Not sure about making it the default though.

@julienmalard
Copy link
Contributor Author

Hello,
Thank you! I am not sure about performance; if you have a few recommended benchmarks for common regexes in grammars I could check quite quickly though.
I am quite sure that it is fully compatible. Apart from bugfixes, it also allows for unicode categories (see comments here).

If you like, I could make a pull request and we could then check performance, etc. from there?
Thanks!

@erezsh
Copy link
Member

erezsh commented May 14, 2020

Sure, that sounds like a good idea!

@julienmalard
Copy link
Contributor Author

Thanks! Done in #593.

@julienmalard
Copy link
Contributor Author

Closed by #593.

julienmalard added a commit to julienmalard/poetry-core that referenced this issue Feb 11, 2021
This fixes python-poetry#221 by exchanging the regex package for re (the former having fixed the long-standing bug in re (see [here](lark-parser/lark#590) for details). Not sure if this will work for poetry (i.e., whether adding regex as a dependency is acceptable)...but thought I would propose this change and wait for feedback to take it from there.
Thanks!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants