Use regex in addition to re? #590

julienmalard · 2020-05-14T14:44:59Z

Suggestion
While using Lark I came across an already-documented bug in the builtin re module, whereby abugidas with vowel marks (that is, most of South Asia and Oceania) fail to match the \w directive (e.g., as in re.match("^[\w\s][\w\s]*", "किशोरी", re.UNICODE), as taken from this question here).

It seems that the recommended "solution" is to simply used the regex module instead of re. However, this would mean adding a dependency to Lark, so I thought I would ask whether you would look favourably upon such a suggestion before submitting a pull request. I was thinking of proposing something along the lines of:

try:
    import regex as re
except ImportError:
    import re

so that regex is not required and nothing breaks if the dependency is not installed.

Describe alternatives you've considered

An alternative, which I am currently using as a workaround, is to generate the toxens manually. For instance, for a Python parser, I can use:

ID_START = ''.join([
    c for c in map(chr, range(sys.maxunicode))
    if category(c) in ['Lu', 'Ll', 'Lt', 'Lm', 'Lo', 'Nl']]) + '_'

ID_CONTINUE = ID_START + ''.join([
    c for c in map(chr, range(sys.maxunicode))
    if category(c) in ['Mn', 'Mc', 'Nd']]) + '·'

and then define my NAME token as ID_START ID_CONTINUE*. However, I think that it would be much nicer to be able to use \w according to Unicode standard. A side benefit of using regex is that projects using Lark and specifying regex as a dependency could also then explicitly use Unicode categories in their regexes as well.

Please do let me know!

The text was updated successfully, but these errors were encountered:

erezsh · 2020-05-14T15:01:43Z

Hi,

I'm certainly open to supporting the regex library, if it's useful.

Is it fully-compatible with re, or are there differences beyond bugfixes?

How does it compare in terms of performance?

I think it might be nice to support something like

pip install lark-parser[regex]

Not sure about making it the default though.

julienmalard · 2020-05-14T15:41:48Z

Hello,
Thank you! I am not sure about performance; if you have a few recommended benchmarks for common regexes in grammars I could check quite quickly though.
I am quite sure that it is fully compatible. Apart from bugfixes, it also allows for unicode categories (see comments here).

If you like, I could make a pull request and we could then check performance, etc. from there?
Thanks!

erezsh · 2020-05-14T15:55:02Z

Sure, that sounds like a good idea!

julienmalard · 2020-05-15T14:13:22Z

Thanks! Done in #593.

julienmalard · 2020-07-04T19:05:29Z

Closed by #593.

This fixes python-poetry#221 by exchanging the regex package for re (the former having fixed the long-standing bug in re (see [here](lark-parser/lark#590) for details). Not sure if this will work for poetry (i.e., whether adding regex as a dependency is acceptable)...but thought I would propose this change and wait for feedback to take it from there. Thanks!

julienmalard added the enhancement label May 14, 2020

julienmalard mentioned this issue May 15, 2020

Use regex instead of re? erikrose/parsimonious#162

Closed

julienmalard mentioned this issue May 15, 2020

Added regex module option. #593

Merged

julienmalard closed this as completed Jul 4, 2020

julienmalard mentioned this issue Nov 8, 2020

poetry develop failing on non-ASCII characters python-poetry/poetry#221

Closed

julienmalard mentioned this issue Feb 11, 2021

Fix unicode error python-poetry/poetry-core#136

Closed

2 tasks

julienmalard mentioned this issue Aug 22, 2022

Ensure that authors and maintainers are normalized. python-poetry/poetry-core#276

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use regex in addition to re? #590

Use regex in addition to re? #590

julienmalard commented May 14, 2020

erezsh commented May 14, 2020

julienmalard commented May 14, 2020

erezsh commented May 14, 2020

julienmalard commented May 15, 2020

julienmalard commented Jul 4, 2020

Use regex in addition to re? #590

Use regex in addition to re? #590

Comments

julienmalard commented May 14, 2020

erezsh commented May 14, 2020

julienmalard commented May 14, 2020

erezsh commented May 14, 2020

julienmalard commented May 15, 2020

julienmalard commented Jul 4, 2020