Skip to content

Conversation

trotterdylan
Copy link
Contributor

This addresses #6

Buffers now always hold unicode source, whereas before they could hold bytes if that's what were passed in to the constructor. This is possible because we determine the encoding and then use that to decode() the bytes. The side effect is that Buffer.__init__ could raise UnicodeDecodeError if the input is badly encoded.

The Buffer encoding is then used by the lexer to produce a strdata token of the correct type for string literals. For unicode literals, escaping happens much as before via _replace_escape(). For bytes, there's a different code path that calls encode() using the Buffer's encoding followed by a special escaping function that ensures the value's not accidentally promoted to unicode.

The parser behavior for multi-string literals (e.g. "foo" "bar") also had to change. When any of the literals are unicode, the result is unicode. When all the literals are bytes the resulting value is also bytes.

This PR also implements the unicode_literals future feature in a similar fashion to print_function. The one shortcoming of that approach is that it does not affect all string literals in the file, only the ones after the current lexer position. I think a proper implementation would require doing a lexer pass on the input to activate future flags before parsing. I didn't want to add any more complexity to this PR since it's already getting a bit big and this seems good enough for now.

@trotterdylan
Copy link
Contributor Author

Whoops looks like python3 is busted. Will fix.

@trotterdylan
Copy link
Contributor Author

The tests that parse string literals now use UnicodeOnly and BytesOnly values in the expected values to make sure both values and types are identical. Making this work for 2.x and 3.x grammars and run properly under 2.x and 3.x interpreters required some hackery.

PTAL

Copy link
Contributor

@whitequark whitequark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some stylistic changes. Mostly the PR seems good; I'll look further at some of the conversions to see if they can perhaps be made cheaper, after that.

@@ -451,24 +465,24 @@ def _replace_escape(self, range, mode, value):

# Process the escape
if match.group(1) is not None: # single-char
chr = match.group(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover from an older version where I had a chr() function.

@@ -418,28 +419,41 @@ def _string_literal(self, options, begin_span, data, data_span, end_span):
"strend"))

def _replace_escape(self, range, mode, value):
is_raw = ("r" in mode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer having these variables instead of poking into mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Now is_unicode and is_byte are the inverse of each other so I've left out is_byte.

@whitequark
Copy link
Contributor

By the way...

This PR also implements the unicode_literals future feature in a similar fashion to print_function. The one shortcoming of that approach is that it does not affect all string literals in the file, only the ones after the current lexer position. I think a proper implementation would require doing a lexer pass on the input to activate future flags before parsing. I didn't want to add any more complexity to this PR since it's already getting a bit big and this seems good enough for now.

it's illegal to have any actual code before the future imports, so this is immaterial.

@trotterdylan
Copy link
Contributor Author

Thanks for merging!

The module docstring would be bytes whereas it should be unicode. Certainly unlikely to cause any problems but it is a material difference.

@whitequark
Copy link
Contributor

The module docstring would be bytes whereas it should be unicode. Certainly unlikely to cause any problems but it is a material difference.

And only in Python 2 mode, right? That's fine by me, Python 2 support is only really included for completeness.

@trotterdylan
Copy link
Contributor Author

Right, Python 2 only. Also fine by me. Just making sure it's documented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants