Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitise regular expressions in lexer #78

Closed
ageorgou opened this issue May 30, 2019 · 4 comments · Fixed by #81
Closed

Sanitise regular expressions in lexer #78

ageorgou opened this issue May 30, 2019 · 4 comments · Fixed by #81

Comments

@ageorgou
Copy link
Contributor

Some of the lexer rules specify their regular expressions using strings which will become invalid in a future Python version (a DeprecationWarning is raised from version 3.6 onwards).

These are:

  • t_ID
  • t_transctrl_ID
  • terminates_para

The simplest way would be to double escape (\\.) the special characters there. Converting to raw strings might not be an option, because these strings also contain unicode escape sequences (\u2019), which may or may not work within raw strings (I haven't tested it).
In fact, some of the characters in them don't need to be escaped at all, but we will need to check what the intention for each regex was (and double check that it actually matches the intended thing correctly now...).

@rillian
Copy link
Contributor

rillian commented May 30, 2019

Aha, I think I was getting confused among the different Python versions.

It seems the \u escapes wouldn't actually be a problem in python 3, where the re module will expand them, so including them in a raw string is fine. Unfortunately this won't work for python 2. One possibility is to use the external regex package on python 2:

>>> import re, regex

>>> re.search('\u0041', 'ABC')
<no match>

>>> regex.search('\u0041', 'ABC')
<regex.Match object; span=(0, 1), match='A'>

I did think there were a suspicious number of backslash escapes in some of the patterns. Glad it's not just me!

@ageorgou
Copy link
Contributor Author

Unfortunately, the ply library which pyoracc uses works on top of the standard library re module, and I'm not sure whether we can subsitute regex for that. It's worth keeping in mind though!

@rillian
Copy link
Contributor

rillian commented Jun 10, 2019

I believe it's possible to hook module loading to substitute a compatible dependency. That's a bit fancy though, but might me acceptable since it could be specific to python2, which is near the end of its support window.

However, it turns out Python will concatenate r'' and u'' strings without complaint, so I think a better approach to the invalid escape sequence warnings is to just split the unicode portions of the pattern strings into a separate literal and use raw strings for the re sequences. I have a patch for that.

@rillian
Copy link
Contributor

rillian commented Jun 10, 2019

I also went through some of the other escape sequences. Marking the strings as raw silenced the warnings, but they're still unnecessary. In particular, only ^$.*+? and brackets need escaping in normal patterns; other punctuation is fine. Inside the [] only ] and - need escaping.

rillian added a commit to rillian/pyoracc that referenced this issue Jun 20, 2019
The #note environment can span multiple lines, but is supposed
to be terminated by a new line label. There was a test for
termination by an @object, but not for this.

Check all the primed variants for line numbers to prevent regression
of an issue I introduced addressing issue oracc#78.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants