-
Notifications
You must be signed in to change notification settings - Fork 24
Detect source encoding to properly interpret string literals #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Whoops looks like python3 is busted. Will fix. |
The tests that parse string literals now use UnicodeOnly and BytesOnly values in the expected values to make sure both values and types are identical. Making this work for 2.x and 3.x grammars and run properly under 2.x and 3.x interpreters required some hackery. PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some stylistic changes. Mostly the PR seems good; I'll look further at some of the conversions to see if they can perhaps be made cheaper, after that.
pythonparser/lexer.py
Outdated
@@ -451,24 +465,24 @@ def _replace_escape(self, range, mode, value): | |||
|
|||
# Process the escape | |||
if match.group(1) is not None: # single-char | |||
chr = match.group(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover from an older version where I had a chr() function.
pythonparser/lexer.py
Outdated
@@ -418,28 +419,41 @@ def _string_literal(self, options, begin_span, data, data_span, end_span): | |||
"strend")) | |||
|
|||
def _replace_escape(self, range, mode, value): | |||
is_raw = ("r" in mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer having these variables instead of poking into mode
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Now is_unicode and is_byte are the inverse of each other so I've left out is_byte.
By the way...
it's illegal to have any actual code before the future imports, so this is immaterial. |
Thanks for merging! The module docstring would be bytes whereas it should be unicode. Certainly unlikely to cause any problems but it is a material difference. |
And only in Python 2 mode, right? That's fine by me, Python 2 support is only really included for completeness. |
Right, Python 2 only. Also fine by me. Just making sure it's documented. |
This addresses #6
Buffers now always hold unicode source, whereas before they could hold bytes if that's what were passed in to the constructor. This is possible because we determine the encoding and then use that to decode() the bytes. The side effect is that
Buffer.__init__
could raise UnicodeDecodeError if the input is badly encoded.The Buffer encoding is then used by the lexer to produce a strdata token of the correct type for string literals. For unicode literals, escaping happens much as before via _replace_escape(). For bytes, there's a different code path that calls encode() using the Buffer's encoding followed by a special escaping function that ensures the value's not accidentally promoted to unicode.
The parser behavior for multi-string literals (e.g.
"foo" "bar"
) also had to change. When any of the literals are unicode, the result is unicode. When all the literals are bytes the resulting value is also bytes.This PR also implements the unicode_literals future feature in a similar fashion to print_function. The one shortcoming of that approach is that it does not affect all string literals in the file, only the ones after the current lexer position. I think a proper implementation would require doing a lexer pass on the input to activate future flags before parsing. I didn't want to add any more complexity to this PR since it's already getting a bit big and this seems good enough for now.