Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

py: Implement partial PEP-498 (f-string) support #4998

Open
wants to merge 4 commits into
base: master
from

Conversation

@klardotsh
Copy link

commented Aug 11, 2019

This is a near-identical PR to adafruit#2054, which is where I originally implemented this functionality. Since none of it (except the localization stuff) was CircuitPython-specific, I'm opening this up so upstream can benefit as well. The actual diff between the two PRs is entirely translation-related.

This refs #2415

This implements (most of) the PEP-498 spec for f-strings, with two
exceptions:

  • raw f-strings (fr or rf prefixes) raise NotImplementedError
  • one special corner case does not function as specified in the PEP
    (more on that in a moment)

This is implemented in the core as a syntax translation, brute-forcing
all f-strings to run through String.format. For example, the statement
x='world'; print(f'hello {x}') gets translated at a syntax level
(injected into the lexer) to x='world'; print('hello {}'.format(x)).
While this may lead to weird column results in tracebacks, it seemed
like the fastest, most efficient, and likely most RAM-friendly option,
despite being implemented under the hood with a completely separate
vstr_t.

Since string concatenation of adjacent literals is implemented in the
lexer
,
two side effects emerge:

  • All strings with at least one f-string portion are concatenated into a
    single literal which must be run through String.format() wholesale,
    and:
  • Concatenation of a raw string with interpolation characters with an
    f-string will cause IndexError/KeyError, which is both different
    from CPython and different from the corner case mentioned in the PEP
    (which gave an example of the following:)
x = 10
y = 'hi'
assert ('a' 'b' f'{x}' '{c}' f'str<{y:^4}>' 'd' 'e') == 'ab10{c}str< hi >de'

The above-linked commit detailed a pretty solid case for leaving string
concatenation in the lexer rather than putting it in the parser, and
undoing that decision would likely be disproportionately costly on
resources for the sake of a probably-low-impact corner case. An
alternative to become complaint with this corner case of the PEP would
be to revert to string concatenation in the parser only when an
f-string is part of concatenation
, though I've done no investigation on
the difficulty or costs of doing this.

A decent set of tests is included. I've manually tested this on the
unix port on Linux and
things seem sane.

py: Implement partial PEP-498 (f-string) support
This implements (most of) the PEP-498 spec for f-strings, with two
exceptions:

- raw f-strings (`fr` or `rf` prefixes) raise `NotImplementedError`
- one special corner case does not function as specified in the PEP
(more on that in a moment)

This is implemented in the core as a syntax translation, brute-forcing
all f-strings to run through `String.format`. For example, the statement
`x='world'; print(f'hello {x}')` gets translated *at a syntax level*
(injected into the lexer) to `x='world'; print('hello {}'.format(x))`.
While this may lead to weird column results in tracebacks, it seemed
like the fastest, most efficient, and *likely* most RAM-friendly option,
despite being implemented under the hood with a completely separate
`vstr_t`.

Since [string concatenation of adjacent literals is implemented in the
lexer](534b7c3),
two side effects emerge:

- All strings with at least one f-string portion are concatenated into a
single literal which *must* be run through `String.format()` wholesale,
and:
- Concatenation of a raw string with interpolation characters with an
f-string will cause `IndexError`/`KeyError`, which is both different
from CPython *and* different from the corner case mentioned in the PEP
(which gave an example of the following:)

```python
x = 10
y = 'hi'
assert ('a' 'b' f'{x}' '{c}' f'str<{y:^4}>' 'd' 'e') == 'ab10{c}str< hi >de'
```

The above-linked commit detailed a pretty solid case for leaving string
concatenation in the lexer rather than putting it in the parser, and
undoing that decision would likely be disproportionately costly on
resources for the sake of a probably-low-impact corner case. An
alternative to become complaint with this corner case of the PEP would
be to revert to string concatenation in the parser *only when an
f-string is part of concatenation*, though I've done no investigation on
the difficulty or costs of doing this.

A decent set of tests is included. I've manually tested this on the
`unix` port on Linux and on a Feather M4 Express (`atmel-samd`) and
things seem sane.
@klardotsh

This comment has been minimized.

Copy link
Author

commented Aug 11, 2019

I'm not entirely sure why SyntaxErrors are being thrown here. The unix port in particular is what this was built on, and at that, it looks like some test runs are passing, others didn't seem to build the feature at all.

I'm getting one similar weird case on the CircuitPython PR, so at least this isn't a totally fringe issue. Happy to patch up this PR as needed to get tests passing, but not entirely sure where to start (since I haven't hacked much/at all on uPy core before)

@dpgeorge

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

Wow, thanks for this, it's great to see such progress on implementing this feature!

The issue with f-strings is that they introduce more coupling between the parser and lexer, but the way it's implemented here (at the token level) is very neat indeed.

This will need a full review, but for now to fix the CI errors:

  • generate locally the tests/basics/string_pep498_fstring.py.exp file by running the test through CPython 3.6+, and commit that .exp file in this PR
  • fix tests/cmdline/cmd_parsetree.py.exp to match what is actually output; tok(4)->tok(10) and tok(14)->tok(20)
@klardotsh

This comment has been minimized.

Copy link
Author

commented Aug 13, 2019

I was able to push a quick fix for that first .exp file thing, thanks for the heads up on how that works! I'll be able to fix up the parsetree one some time this evening probably.

Looks like bare-arm is failing as well because the code size increased ~800 bytes. Not sure how to tell it "that's fine because the language spec increased", or if I should (as ended up discussed in the CircuitPython PR, where CN-localized SAMD21 builds are now oversized) feature flag this - makes an inconsistent UX if different targets do/don't have f-strings, but given the implementation that may be necessary.

klardotsh added some commits Aug 13, 2019


lex->chr0 = lex->vstr_postfix.len > 0 ? lex->vstr_postfix.buf[0] : 0;
lex->chr1 = lex->vstr_postfix.len > 1 ? lex->vstr_postfix.buf[1] : 0;
lex->chr2 = lex->vstr_postfix.len > 2 ? lex->vstr_postfix.buf[2] : 0;

This comment has been minimized.

Copy link
@dpgeorge

dpgeorge Aug 14, 2019

Member

It's probably not necessary to check len in any of these expressions, and instead just copy in the data. Because at this point vstr_posfix has at least 3 valid chars (if I understand the code right, at least ( a char and )). That will save code size.

This comment has been minimized.

Copy link
@klardotsh

klardotsh Aug 14, 2019

Author

This was my own form of defensiveness (the number of times I've segfaulted C by not guarding things like this is... embarrassing at best). In this PR, vstr_postfix will always have at least .format(x) or more. If this postfix is ever used for something else (I'm not sure what else you'd want to syntax transform at a lexer level, but who knows), this defensiveness may be useful, otherwise, happy to scrap it for simplicity.


h0 = lex->chr0;
h1 = lex->chr1;
h2 = lex->chr2;

This comment has been minimized.

Copy link
@dpgeorge

dpgeorge Aug 14, 2019

Member

No need for temporary h variables, can just write directly to chr3-5.

This comment has been minimized.

Copy link
@klardotsh

klardotsh Aug 14, 2019

Author

Yep - that's now true. A former implementation here used those holding vars (for what, I'm not sure at this point). Good catch.

lex->chr4 = h1;
lex->chr5 = h2;

lex->vstr_postfix_idx = lex->vstr_postfix.len > 2 ? 3 : lex->vstr_postfix.len;

This comment has been minimized.

Copy link
@dpgeorge

dpgeorge Aug 14, 2019

Member

Are there cases where len<3?

This comment has been minimized.

Copy link
@klardotsh

klardotsh Aug 14, 2019

Author

See above comment about defensiveness.

lex->chr2 = lex->chr5;
lex->chr3 = 0;
lex->chr4 = 0;
lex->chr5 = 0;

This comment has been minimized.

Copy link
@dpgeorge

dpgeorge Aug 14, 2019

Member

Is it necessary to reset chr3-5 to 0? Could just leave it out (to save code size).

This comment has been minimized.

Copy link
@klardotsh

klardotsh Aug 14, 2019

Author

Not technically necessary, just something I did to ensure we weren't keeping now-garbage data around. Probably sane to throw this out.

lex->chr5 = 0;

vstr_reset(&lex->vstr_postfix);
lex->vstr_postfix_idx = 0;

This comment has been minimized.

Copy link
@dpgeorge

dpgeorge Aug 14, 2019

Member

Does idx need to be reset? Do the vstr need to be reset?

This comment has been minimized.

Copy link
@klardotsh

klardotsh Aug 14, 2019

Author

Given the current implementation: I think since I blindly write .format( and the postfix translations to vstr_postfix at least the string should be reset/cleared. vstr_postfix_idx needs reset somewhere, though there may well be better places to do it.

If you're implying it'd be lighter on code size/CPU cycles/memory usage to just maintain a constantly growing vstr that starts to look roughly like .format(x).format(y).format(z), with an ever-increasing vstr_postfix_idx, I'm totally open to trying that. I'm not sure how expensive the vstr_resets are

break;
case MP_TOKEN_FSTRING_RAW:
exc = mp_obj_new_exception_msg(&mp_type_NotImplementedError,
"raw f-strings are not implemented");

This comment has been minimized.

Copy link
@dpgeorge

dpgeorge Aug 14, 2019

Member

These error messages are the things that will really increase code size. I don't know how useful it is to have them so detailed, probably all the SyntaxError ones can be collapsed into a single "malformed f-string" message.

There is actually support for terse/normal/detailed errors, so at the least the above code should use that feature, and in terse/normal mode just have a single message, and for detailed mode can use all the different messages.

This comment has been minimized.

Copy link
@klardotsh

klardotsh Aug 14, 2019

Author

I did this mostly for compatibility with CPython's error messaging (and admittedly, as a side effect of firing blind at the lexer at first to try to understand how it worked) and I'm in no way attached to it. Happy to cull some strings here (it'll also lighten up the CircuitPython PR which was oversized for one of the translation targets...).

I'll take a look at this error leveling system - sounds neat.

except NotImplementedError:
pass
else:
if sys.implementation.name in {'micropython', 'circuitpython'}:

This comment has been minimized.

Copy link
@dpgeorge

dpgeorge Aug 14, 2019

Member

Best not to rely on this kind of auto-detection. A better place for these not-implemented tests would be tests/misc/non_compliant_lexer.py.

This comment has been minimized.

Copy link
@klardotsh

klardotsh Aug 14, 2019

Author

This was my hack to ensure I wasn't erroneously throwing RuntimeError when generating the .exp file (since CPython does implement raw f-strings - admittedly I probably should at some point, but I wanted to scope this first PR to the probably most useful bits of the PEP).

Given that, do you still recommend moving these tests?

@klardotsh
Copy link
Author

left a comment

Mostly clarifying responses - I'll submit some code updates this evening / tomorrowish for some/many of these.


h0 = lex->chr0;
h1 = lex->chr1;
h2 = lex->chr2;

This comment has been minimized.

Copy link
@klardotsh

klardotsh Aug 14, 2019

Author

Yep - that's now true. A former implementation here used those holding vars (for what, I'm not sure at this point). Good catch.


lex->chr0 = lex->vstr_postfix.len > 0 ? lex->vstr_postfix.buf[0] : 0;
lex->chr1 = lex->vstr_postfix.len > 1 ? lex->vstr_postfix.buf[1] : 0;
lex->chr2 = lex->vstr_postfix.len > 2 ? lex->vstr_postfix.buf[2] : 0;

This comment has been minimized.

Copy link
@klardotsh

klardotsh Aug 14, 2019

Author

This was my own form of defensiveness (the number of times I've segfaulted C by not guarding things like this is... embarrassing at best). In this PR, vstr_postfix will always have at least .format(x) or more. If this postfix is ever used for something else (I'm not sure what else you'd want to syntax transform at a lexer level, but who knows), this defensiveness may be useful, otherwise, happy to scrap it for simplicity.

lex->chr4 = h1;
lex->chr5 = h2;

lex->vstr_postfix_idx = lex->vstr_postfix.len > 2 ? 3 : lex->vstr_postfix.len;

This comment has been minimized.

Copy link
@klardotsh

klardotsh Aug 14, 2019

Author

See above comment about defensiveness.

lex->chr2 = lex->chr5;
lex->chr3 = 0;
lex->chr4 = 0;
lex->chr5 = 0;

This comment has been minimized.

Copy link
@klardotsh

klardotsh Aug 14, 2019

Author

Not technically necessary, just something I did to ensure we weren't keeping now-garbage data around. Probably sane to throw this out.

lex->chr5 = 0;

vstr_reset(&lex->vstr_postfix);
lex->vstr_postfix_idx = 0;

This comment has been minimized.

Copy link
@klardotsh

klardotsh Aug 14, 2019

Author

Given the current implementation: I think since I blindly write .format( and the postfix translations to vstr_postfix at least the string should be reset/cleared. vstr_postfix_idx needs reset somewhere, though there may well be better places to do it.

If you're implying it'd be lighter on code size/CPU cycles/memory usage to just maintain a constantly growing vstr that starts to look roughly like .format(x).format(y).format(z), with an ever-increasing vstr_postfix_idx, I'm totally open to trying that. I'm not sure how expensive the vstr_resets are

break;
case MP_TOKEN_FSTRING_RAW:
exc = mp_obj_new_exception_msg(&mp_type_NotImplementedError,
"raw f-strings are not implemented");

This comment has been minimized.

Copy link
@klardotsh

klardotsh Aug 14, 2019

Author

I did this mostly for compatibility with CPython's error messaging (and admittedly, as a side effect of firing blind at the lexer at first to try to understand how it worked) and I'm in no way attached to it. Happy to cull some strings here (it'll also lighten up the CircuitPython PR which was oversized for one of the translation targets...).

I'll take a look at this error leveling system - sounds neat.

except NotImplementedError:
pass
else:
if sys.implementation.name in {'micropython', 'circuitpython'}:

This comment has been minimized.

Copy link
@klardotsh

klardotsh Aug 14, 2019

Author

This was my hack to ensure I wasn't erroneously throwing RuntimeError when generating the .exp file (since CPython does implement raw f-strings - admittedly I probably should at some point, but I wanted to scope this first PR to the probably most useful bits of the PEP).

Given that, do you still recommend moving these tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.