Sanitise regular expressions in lexer #78

ageorgou · 2019-05-30T14:36:31Z

Some of the lexer rules specify their regular expressions using strings which will become invalid in a future Python version (a DeprecationWarning is raised from version 3.6 onwards).

These are:

t_ID
t_transctrl_ID
terminates_para

The simplest way would be to double escape (\\.) the special characters there. Converting to raw strings might not be an option, because these strings also contain unicode escape sequences (\u2019), which may or may not work within raw strings (I haven't tested it).
In fact, some of the characters in them don't need to be escaped at all, but we will need to check what the intention for each regex was (and double check that it actually matches the intended thing correctly now...).

The text was updated successfully, but these errors were encountered:

rillian · 2019-05-30T15:25:03Z

Aha, I think I was getting confused among the different Python versions.

It seems the \u escapes wouldn't actually be a problem in python 3, where the re module will expand them, so including them in a raw string is fine. Unfortunately this won't work for python 2. One possibility is to use the external regex package on python 2:

>>> import re, regex

>>> re.search('\u0041', 'ABC')
<no match>

>>> regex.search('\u0041', 'ABC')
<regex.Match object; span=(0, 1), match='A'>

I did think there were a suspicious number of backslash escapes in some of the patterns. Glad it's not just me!

ageorgou · 2019-05-30T15:34:23Z

Unfortunately, the ply library which pyoracc uses works on top of the standard library re module, and I'm not sure whether we can subsitute regex for that. It's worth keeping in mind though!

rillian · 2019-06-10T19:02:09Z

I believe it's possible to hook module loading to substitute a compatible dependency. That's a bit fancy though, but might me acceptable since it could be specific to python2, which is near the end of its support window.

However, it turns out Python will concatenate r'' and u'' strings without complaint, so I think a better approach to the invalid escape sequence warnings is to just split the unicode portions of the pattern strings into a separate literal and use raw strings for the re sequences. I have a patch for that.

rillian · 2019-06-10T19:22:03Z

I also went through some of the other escape sequences. Marking the strings as raw silenced the warnings, but they're still unnecessary. In particular, only ^$.*+? and brackets need escaping in normal patterns; other punctuation is fine. Inside the [] only ] and - need escaping.

@object

The #note environment can span multiple lines, but is supposed to be terminated by a new line label. There was a test for termination by an @object, but not for this. Check all the primed variants for line numbers to prevent regression of an issue I introduced addressing issue oracc#78.

ageorgou mentioned this issue May 30, 2019

More test and warning fixes #77

Merged

rillian mentioned this issue Jun 10, 2019

Fix remaining travis warnings #81

Merged

rillian mentioned this issue Jun 21, 2019

Consolidate line label parsing #83

Open

ageorgou closed this as completed in #81 Jun 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sanitise regular expressions in lexer #78

Sanitise regular expressions in lexer #78

ageorgou commented May 30, 2019

rillian commented May 30, 2019

ageorgou commented May 30, 2019

rillian commented Jun 10, 2019 •

edited

rillian commented Jun 10, 2019

Sanitise regular expressions in lexer #78

Sanitise regular expressions in lexer #78

Comments

ageorgou commented May 30, 2019

rillian commented May 30, 2019

ageorgou commented May 30, 2019

rillian commented Jun 10, 2019 • edited

rillian commented Jun 10, 2019

rillian commented Jun 10, 2019 •

edited