New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexical Feedback - B #1221

Merged
merged 5 commits into from May 25, 2015

Conversation

6 participants
@marcioAlmada
Contributor

marcioAlmada commented Apr 5, 2015

This is related to the Context Sensitive Language RFC:

About this implementation:

  • Has no regression or forward compatibility risks and is highly predictable
  • Has a very small footprint compared to the previous attempt involving a pure lexical approach
  • It's somewhat configurable, semi-reserved words are on an inclusive parser rule
  • Doesn't require compile time checks

About the tokenizer extension port:

We basically added a mechanism to store the token stream during parsing and exposed the entire parser stack on the tokenizer extension through an opt in flag: token_get_all($src, TOKEN_PARSE). This has a few advantages:

  • Simple implementation
  • Near zero maintenance (no need to manually maintain the tokenizer extension like on #1158)
  • No BC breaks on token_get_all as the TOKEN_PARSE flag is opt in
  • Inconsistencies related to how __halt_compiler() is handled are gone - currently the tokenizer extension tries to "mimic" the parser and doesn't do an accurate job __halt_compiler is buggy without real parsing :(
  • People implementing static analyzers will be able to benefit from the TOKEN_PARSE flag as the following code is now possible:
try {
    token_get_all('<?php invalid code;', TOKEN_PARSE);
} catch (ParseException $e) {
    echo $e->getMessage(); // ParseException: syntax error, unexpected 'code' (T_STRING) on line 1
}

NOTE: Nothing being tokenized through token_get_all will be compiled or executed. The generated AST is simply destroyed and only the collected token stream is kept.

TO DO

  • Make PHP context sensitive
  • Port|Improve tokenizer extension
    • Implement TOKEN_PARSE flag and collect token stream concurrently to parsing so token_get_all() can be optionally context aware
    • Make token_get_all($source, TOKEN_PARSE) emit a catchable ParseException
    • Handle halt_compiler(); in a backwards compatible way
  • Clean up commit history
  • Review

This pull request supersedes #1158.

@marcioAlmada marcioAlmada referenced this pull request Apr 5, 2015

Closed

Lexical Feedback #1158

8 of 8 tasks complete
@marcioAlmada

This comment has been minimized.

Contributor

marcioAlmada commented Apr 6, 2015

This approach ended up being much more simpler than #1158, with the advantage of a near zero maintenance cost so it's worth moving forward with it.

@marcioAlmada

This comment has been minimized.

Contributor

marcioAlmada commented Apr 11, 2015

I intend to open a mailing list thread to evaluate this patch next week, so we can analyze if it's good enough to fulfill the RFC without problems.

Thanks.

@smalyshev smalyshev added the RFC label Apr 13, 2015

@acasademont

This comment has been minimized.

acasademont commented Apr 20, 2015

Hi Marcio, thanks for the awesome work! Just a little suggestion: once the patch is considered ready to be merged I'd squash all commits into a single one to have a clearer git history in master.

@marcioAlmada

This comment has been minimized.

Contributor

marcioAlmada commented Apr 27, 2015

I posted this on the mailing list and apparently there is no objections there. Does this means the PR can be merged or we should wait more?

travis failure is unrelated: ext/date/tests/timezone_location_get.phpt.

marcioAlmada added some commits Mar 7, 2015

implement context sensitive language with lexical feedback
The implementation has no regression risks, has an even smaller footprint
compared to the previous attempt involving a pure lexical approach, is higly
predictable and higly configurable.

To turn a word semi-reserved you only need to edit the "SEMI_RESERVED" parser rule,
it's an inclusive list of all the words that should be matched as T_STRING on specific contexts.
Example:

```
method_modifiers function returns_ref indentifier '(' parameter_list ')' ...
```

instead of:

```
method_modifiers function returns_ref T_STRING '(' parameter_list ')' ...
```

TODO: port ext tokenizer
ext tokenizer port + cleanup unused lexer states
we basically added a mechanism to store the token stream during parsing
and exposed the entire parser stack on the tokenizer extension through
an opt in flag: token_get_all($src, TOKEN_PARSE).

this change allows easy future language enhancements regarding context
aware parsing & scanning without further maintance on the tokenizer
extension while solves known inconsistencies "parseless" tokenizer
extension has when it handles `__halt_compiler()` presence.

@nikic nikic added the PHP7 label May 1, 2015

@acasademont

This comment has been minimized.

acasademont commented May 21, 2015

Hi @marcioAlmada ! AFAIK a code like this

class True {
}

would still not be possible with your patch right? There are some BC issues in the Symfony Validator component and I wanted to make sure if this patch will solve them or not.

Thanks!

@marcioAlmada

This comment has been minimized.

Contributor

marcioAlmada commented May 21, 2015

@acasademont It won't. Even if we manage to allow reserved words as class names and namespace names in the future, the words representing types would still collide because types and classes dispute the same keyspace.

The only solution for that is syntactic stropping but I don't want to speculate about that now. One step at a time :)

@HallofFamer

This comment has been minimized.

HallofFamer commented May 21, 2015

I think the only way it would work is to make PHP core case-sensitive, so that True and true are different. If PHP reserved words only covers lower case, it will work perfectly. For instance, string is the primitive type, but String can be used as class name. Even String and STRing are considered different.

But I think such a breaking change will not occur at least until PHP 8, theres also no way to know whether reserved words will only cover lower case even if PHP becomes case sensitive.

@php-pulls php-pulls merged commit c2f3091 into php:master May 25, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@marcioAlmada

This comment has been minimized.

Contributor

marcioAlmada commented May 26, 2015

Thanks!
👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment