Skip to content

Lexical Feedback - B #1221

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 25, 2015
Merged

Lexical Feedback - B #1221

merged 5 commits into from
May 25, 2015

Conversation

marcioAlmada
Copy link
Contributor

This is related to the Context Sensitive Language RFC:

About this implementation:

  • Has no regression or forward compatibility risks and is highly predictable
  • Has a very small footprint compared to the previous attempt involving a pure lexical approach
  • It's somewhat configurable, semi-reserved words are on an inclusive parser rule
  • Doesn't require compile time checks

About the tokenizer extension port:

We basically added a mechanism to store the token stream during parsing and exposed the entire parser stack on the tokenizer extension through an opt in flag: token_get_all($src, TOKEN_PARSE). This has a few advantages:

  • Simple implementation
  • Near zero maintenance (no need to manually maintain the tokenizer extension like on Lexical Feedback #1158)
  • No BC breaks on token_get_all as the TOKEN_PARSE flag is opt in
  • Inconsistencies related to how __halt_compiler() is handled are gone - currently the tokenizer extension tries to "mimic" the parser and doesn't do an accurate job __halt_compiler is buggy without real parsing :(
  • People implementing static analyzers will be able to benefit from the TOKEN_PARSE flag as the following code is now possible:
try {
    token_get_all('<?php invalid code;', TOKEN_PARSE);
} catch (ParseException $e) {
    echo $e->getMessage(); // ParseException: syntax error, unexpected 'code' (T_STRING) on line 1
}

NOTE: Nothing being tokenized through token_get_all will be compiled or executed. The generated AST is simply destroyed and only the collected token stream is kept.

TO DO

  • Make PHP context sensitive
  • Port|Improve tokenizer extension
    • Implement TOKEN_PARSE flag and collect token stream concurrently to parsing so token_get_all() can be optionally context aware
    • Make token_get_all($source, TOKEN_PARSE) emit a catchable ParseException
    • Handle halt_compiler(); in a backwards compatible way
  • Clean up commit history
  • Review

This pull request supersedes #1158.

@marcioAlmada marcioAlmada mentioned this pull request Apr 5, 2015
8 tasks
@marcioAlmada
Copy link
Contributor Author

This approach ended up being much more simpler than #1158, with the advantage of a near zero maintenance cost so it's worth moving forward with it.

@marcioAlmada marcioAlmada force-pushed the lex-feedback-b branch 3 times, most recently from de191c7 to a7a7cbd Compare April 10, 2015 16:26
@marcioAlmada
Copy link
Contributor Author

I intend to open a mailing list thread to evaluate this patch next week, so we can analyze if it's good enough to fulfill the RFC without problems.

Thanks.

@smalyshev smalyshev added the RFC label Apr 13, 2015
@marcioAlmada marcioAlmada force-pushed the lex-feedback-b branch 2 times, most recently from 3c255b6 to d908055 Compare April 18, 2015 15:41
@acasademont
Copy link
Contributor

Hi Marcio, thanks for the awesome work! Just a little suggestion: once the patch is considered ready to be merged I'd squash all commits into a single one to have a clearer git history in master.

@marcioAlmada marcioAlmada force-pushed the lex-feedback-b branch 2 times, most recently from c9518f2 to 3c17b00 Compare April 25, 2015 18:15
@marcioAlmada
Copy link
Contributor Author

I posted this on the mailing list and apparently there is no objections there. Does this means the PR can be merged or we should wait more?

travis failure is unrelated: ext/date/tests/timezone_location_get.phpt.

@marcioAlmada marcioAlmada force-pushed the lex-feedback-b branch 2 times, most recently from 451fdc8 to 95d517b Compare April 27, 2015 21:25
The implementation has no regression risks, has an even smaller footprint
compared to the previous attempt involving a pure lexical approach, is higly
predictable and higly configurable.

To turn a word semi-reserved you only need to edit the "SEMI_RESERVED" parser rule,
it's an inclusive list of all the words that should be matched as T_STRING on specific contexts.
Example:

```
method_modifiers function returns_ref indentifier '(' parameter_list ')' ...
```

instead of:

```
method_modifiers function returns_ref T_STRING '(' parameter_list ')' ...
```

TODO: port ext tokenizer
we basically added a mechanism to store the token stream during parsing
and exposed the entire parser stack on the tokenizer extension through
an opt in flag: token_get_all($src, TOKEN_PARSE).

this change allows easy future language enhancements regarding context
aware parsing & scanning without further maintance on the tokenizer
extension while solves known inconsistencies "parseless" tokenizer
extension has when it handles `__halt_compiler()` presence.
@acasademont
Copy link
Contributor

Hi @marcioAlmada ! AFAIK a code like this

class True {
}

would still not be possible with your patch right? There are some BC issues in the Symfony Validator component and I wanted to make sure if this patch will solve them or not.

Thanks!

@marcioAlmada
Copy link
Contributor Author

@acasademont It won't. Even if we manage to allow reserved words as class names and namespace names in the future, the words representing types would still collide because types and classes dispute the same keyspace.

The only solution for that is syntactic stropping but I don't want to speculate about that now. One step at a time :)

@HallofFamer
Copy link

I think the only way it would work is to make PHP core case-sensitive, so that True and true are different. If PHP reserved words only covers lower case, it will work perfectly. For instance, string is the primitive type, but String can be used as class name. Even String and STRing are considered different.

But I think such a breaking change will not occur at least until PHP 8, theres also no way to know whether reserved words will only cover lower case even if PHP becomes case sensitive.

@php-pulls php-pulls merged commit c2f3091 into php:master May 25, 2015
@marcioAlmada
Copy link
Contributor Author

Thanks!
👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants