New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sanitise regular expressions in lexer #78
Comments
Aha, I think I was getting confused among the different Python versions. It seems the >>> import re, regex
>>> re.search('\u0041', 'ABC')
<no match>
>>> regex.search('\u0041', 'ABC')
<regex.Match object; span=(0, 1), match='A'> I did think there were a suspicious number of backslash escapes in some of the patterns. Glad it's not just me! |
Unfortunately, the |
I believe it's possible to hook module loading to substitute a compatible dependency. That's a bit fancy though, but might me acceptable since it could be specific to python2, which is near the end of its support window. However, it turns out Python will concatenate |
I also went through some of the other escape sequences. Marking the strings as raw silenced the warnings, but they're still unnecessary. In particular, only |
Some of the lexer rules specify their regular expressions using strings which will become invalid in a future Python version (a
DeprecationWarning
is raised from version 3.6 onwards).These are:
t_ID
t_transctrl_ID
terminates_para
The simplest way would be to double escape (
\\.
) the special characters there. Converting to raw strings might not be an option, because these strings also contain unicode escape sequences (\u2019
), which may or may not work within raw strings (I haven't tested it).In fact, some of the characters in them don't need to be escaped at all, but we will need to check what the intention for each regex was (and double check that it actually matches the intended thing correctly now...).
The text was updated successfully, but these errors were encountered: