Detect source encoding to properly interpret string literals #17

trotterdylan · 2017-02-10T01:18:04Z

This addresses #6

Buffers now always hold unicode source, whereas before they could hold bytes if that's what were passed in to the constructor. This is possible because we determine the encoding and then use that to decode() the bytes. The side effect is that Buffer.__init__ could raise UnicodeDecodeError if the input is badly encoded.

The Buffer encoding is then used by the lexer to produce a strdata token of the correct type for string literals. For unicode literals, escaping happens much as before via _replace_escape(). For bytes, there's a different code path that calls encode() using the Buffer's encoding followed by a special escaping function that ensures the value's not accidentally promoted to unicode.

The parser behavior for multi-string literals (e.g. "foo" "bar") also had to change. When any of the literals are unicode, the result is unicode. When all the literals are bytes the resulting value is also bytes.

This PR also implements the unicode_literals future feature in a similar fashion to print_function. The one shortcoming of that approach is that it does not affect all string literals in the file, only the ones after the current lexer position. I think a proper implementation would require doing a lexer pass on the input to activate future flags before parsing. I didn't want to add any more complexity to this PR since it's already getting a bit big and this seems good enough for now.

…rals.

trotterdylan · 2017-02-10T01:19:58Z

Whoops looks like python3 is busted. Will fix.

trotterdylan · 2017-02-10T06:36:34Z

The tests that parse string literals now use UnicodeOnly and BytesOnly values in the expected values to make sure both values and types are identical. Making this work for 2.x and 3.x grammars and run properly under 2.x and 3.x interpreters required some hackery.

PTAL

whitequark

Some stylistic changes. Mostly the PR seems good; I'll look further at some of the conversions to see if they can perhaps be made cheaper, after that.

whitequark · 2017-02-10T14:47:38Z

pythonparser/lexer.py

@@ -451,24 +465,24 @@ def _replace_escape(self, range, mode, value):

            # Process the escape
            if match.group(1) is not None: # single-char
-                chr = match.group(1)


What's the point of this change?

Leftover from an older version where I had a chr() function.

whitequark · 2017-02-10T14:48:42Z

pythonparser/lexer.py

@@ -418,28 +419,41 @@ def _string_literal(self, options, begin_span, data, data_span, end_span):
                          "strend"))

    def _replace_escape(self, range, mode, value):
-        is_raw     = ("r" in mode)


I prefer having these variables instead of poking into mode.

Done. Now is_unicode and is_byte are the inverse of each other so I've left out is_byte.

whitequark · 2017-02-11T12:59:29Z

By the way...

This PR also implements the unicode_literals future feature in a similar fashion to print_function. The one shortcoming of that approach is that it does not affect all string literals in the file, only the ones after the current lexer position. I think a proper implementation would require doing a lexer pass on the input to activate future flags before parsing. I didn't want to add any more complexity to this PR since it's already getting a bit big and this seems good enough for now.

it's illegal to have any actual code before the future imports, so this is immaterial.

trotterdylan · 2017-02-11T16:16:05Z

Thanks for merging!

The module docstring would be bytes whereas it should be unicode. Certainly unlikely to cause any problems but it is a material difference.

whitequark · 2017-02-11T16:57:10Z

The module docstring would be bytes whereas it should be unicode. Certainly unlikely to cause any problems but it is a material difference.

And only in Python 2 mode, right? That's fine by me, Python 2 support is only really included for completeness.

trotterdylan · 2017-02-11T17:20:28Z

Right, Python 2 only. Also fine by me. Just making sure it's documented.

Detect source encoding and use that to properly interpret string lite…

3e6fa95

…rals.

Dylan Trotter added 3 commits February 9, 2017 22:03

Fix tests broken for Python 3.

db93555

Disable validation on 3.4 for a test that behaves weirdly.

5524109

Disable validation for real.

e7ba1f6

whitequark suggested changes Feb 10, 2017

View reviewed changes

Style changes.

f68eeee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detect source encoding to properly interpret string literals #17

Detect source encoding to properly interpret string literals #17

Uh oh!

trotterdylan commented Feb 10, 2017

Uh oh!

trotterdylan commented Feb 10, 2017

Uh oh!

trotterdylan commented Feb 10, 2017

Uh oh!

whitequark left a comment

Uh oh!

whitequark Feb 10, 2017

Uh oh!

trotterdylan Feb 10, 2017

Uh oh!

whitequark Feb 10, 2017

Uh oh!

trotterdylan Feb 10, 2017

Uh oh!

whitequark commented Feb 11, 2017

Uh oh!

trotterdylan commented Feb 11, 2017

Uh oh!

whitequark commented Feb 11, 2017

Uh oh!

trotterdylan commented Feb 11, 2017

Uh oh!

Uh oh!

Detect source encoding to properly interpret string literals #17

Detect source encoding to properly interpret string literals #17

Uh oh!

Conversation

trotterdylan commented Feb 10, 2017

Uh oh!

trotterdylan commented Feb 10, 2017

Uh oh!

trotterdylan commented Feb 10, 2017

Uh oh!

whitequark left a comment

Choose a reason for hiding this comment

Uh oh!

whitequark Feb 10, 2017

Choose a reason for hiding this comment

Uh oh!

trotterdylan Feb 10, 2017

Choose a reason for hiding this comment

Uh oh!

whitequark Feb 10, 2017

Choose a reason for hiding this comment

Uh oh!

trotterdylan Feb 10, 2017

Choose a reason for hiding this comment

Uh oh!

whitequark commented Feb 11, 2017

Uh oh!

trotterdylan commented Feb 11, 2017

Uh oh!

whitequark commented Feb 11, 2017

Uh oh!

trotterdylan commented Feb 11, 2017

Uh oh!

Uh oh!