Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context Sensitive Lexer #1054

Closed
wants to merge 3 commits into from

Conversation

@marcioAlmada
Copy link
Contributor

commented Feb 5, 2015

This aims to be a consistent implementation of semi-reserved words grammar on OO context and OO access context.

RFC: https://wiki.php.net/rfc/context_sensitive_lexer

marcioAlmada added a commit to marcioAlmada/php-src that referenced this pull request Feb 5, 2015

@marcioAlmada marcioAlmada force-pushed the marcioAlmada:context branch 6 times, most recently from e3c8519 to 94a4100 Feb 9, 2015

static unsigned int in_class = 0;
static zend_bool in_const_list = 0;

ZEND_ASSERT(in_class >= 0);

This comment has been minimized.

Copy link
@nikic

nikic Feb 11, 2015

Member

in_class is an unsigned int ... I would be very surprised if it became smaller than zero ;)

This comment has been minimized.

Copy link
@marcioAlmada

marcioAlmada Feb 11, 2015

Author Contributor

I made a test with static unsigned int in_class = -1; and it caused strange bugs without errors here :P

This comment has been minimized.

Copy link
@marcioAlmada

marcioAlmada Feb 11, 2015

Author Contributor

I removed the unsigned and kept the assertion because it's currently not very obvious to detect bugs if somebody decrements in_class in wrong conditions.

@@ -2,7 +2,7 @@
+----------------------------------------------------------------------+
| PHP Version 7 |
+----------------------------------------------------------------------+
| Copyright (c) 1997-2015 The PHP Group |
| Copyright (c) 1997-2014 The PHP Group |

This comment has been minimized.

Copy link
@reeze

reeze Feb 11, 2015

Contributor

Oh, we go back to last year ;)

This comment has been minimized.

Copy link
@marcioAlmada

marcioAlmada Feb 11, 2015

Author Contributor

Looks like tokenizer_data_gen.sh does time travel :)

This comment has been minimized.

Copy link
@marcioAlmada

marcioAlmada Feb 11, 2015

Author Contributor

fixed

@marcioAlmada marcioAlmada force-pushed the marcioAlmada:context branch 2 times, most recently from 8f862e7 to ddea7e8 Feb 11, 2015

@laruence

This comment has been minimized.

Copy link
Member

commented Feb 16, 2015

@jpauli jpauli added the RFC label Feb 16, 2015

@marcioAlmada marcioAlmada changed the title Allow context sensitive grammar for object context Context sensitive lexer for OO context Feb 16, 2015

@marcioAlmada marcioAlmada force-pushed the marcioAlmada:context branch 3 times, most recently from 4c384aa to 636e210 Feb 16, 2015

@POPSuL

This comment has been minimized.

Copy link

commented Feb 24, 2015

It's great idea!

@marcioAlmada marcioAlmada force-pushed the marcioAlmada:context branch 5 times, most recently from a8effa2 to 1fe57cd Feb 25, 2015

@HallofFamer

This comment has been minimized.

Copy link

commented Feb 27, 2015

I have a question though, does this also apply to classes, interfaces or traits, or only to properties, methods and constants? Lets say if scalar type hinting is passed and the scalar names become reserved words, while I have an Int class and a String class. Will this RFC solve the problem?

@marcioAlmada

This comment has been minimized.

Copy link
Contributor Author

commented Feb 27, 2015

Hi,

Will this RFC solve the problem?

No.

I tried a more ambitious RFC before that targeted namespaces, classes,
traits and interfaces. The other patch was fully functional but it had some
drawbacks and conflicted with other possible future RFCs. So the RFC was
reverted to this version, targeting only OO declaration context and OO
access context.

A pure lexical approach, without significant changes on the lexer
structure, will probably not solve this problem. I have some ideas for
future work targeting the wider issue though, involving multiple lexers or
maybe find a way to do lexing feedback without break tokenizer extension.
But it wont't/can't be finished before the feature freeze.

Anyway, the current solution is a great start and already mitigates a lot
of the BC breaks from other RFCs that will soon be put to vote, like:

Hope that's useful information.

Thanks,
Márcio.

@HallofFamer

This comment has been minimized.

Copy link

commented Feb 27, 2015

I see, thanks for your response, I will be looking forward to the more ambitious case-sensitive patch too. It's unfortunate that scalar names will become reserved words, I have mixed feelings about scalar type hinting. Part of me like it since its a step towards strongly typed PHP I was hoping for, but reserving scalar types makes it inconvenient for me. I know Levi Morrison proposed reserve some types in PHP 7, I am okay with his proposal though as I fully understand why this may be necessary.

@marcioAlmada marcioAlmada changed the title Context sensitive lexer for OO context Context Sensitive Lexer Mar 1, 2015

#define ZEND_RESERVED_PRIVATE 4
#define ZEND_RESERVED_PROTECTED 5
#define ZEND_RESERVED_PUBLIC 6
#define ZEND_RESERVED_CLASS 7

This comment has been minimized.

Copy link
@nikic

nikic Mar 2, 2015

Member

Is the distinction between the individual reserved keywords really relevant here? It look to me like it would be good enough to just return a boolean from the check.

This comment has been minimized.

Copy link
@marcioAlmada

marcioAlmada Mar 2, 2015

Author Contributor

It could be useful in case one wants to check for a specific identifier to give more specialized err messages, like: https://github.com/php/php-src/pull/1054/files#diff-3a8139128d4026ce0cb0c86beba4e6b9R4511

This comment has been minimized.

Copy link
@nikic

nikic Mar 2, 2015

Member

I've seen the error message check, but the distinction seems really weird there ... it's pretty much the same message with slightly different wording. That does not seem worthwhile.

goto restart;
}

<ST_IN_SCRIPTING>{LABEL}/"\\"|{OPTIONAL_WHITESPACE}"::" {

This comment has been minimized.

Copy link
@nikic

nikic Mar 2, 2015

Member

Is this rule still necessary? Looks like a leftover from class name handling to me.

This comment has been minimized.

Copy link
@nikic

nikic Mar 2, 2015

Member

Actually there seem to be many rules like this. Is this alredy the new implementation just for method/consts or is this the previous one with class support?

This comment has been minimized.

Copy link
@marcioAlmada

marcioAlmada Mar 2, 2015

Author Contributor

I'll have to double check that and see if some test will fail, not 100% sure just by reading it here.

This comment has been minimized.

Copy link
@marcioAlmada

marcioAlmada Mar 2, 2015

Author Contributor

Nope, the rule wasn't dead. It was there for a very specific reason. But I just noticed it was causing a syntax regression. The following code is "valid" on PHP 5.6:

interface Thingable{}
class Thing implements\Thingable {}
// notice there is no space between implements and \

http://3v4l.org/lKM1s

PHP 5.6 doesn't complain about it but the patched lexer was requiring white space between implements and \. I would consider it an accidental bug fix. But since bug fixes are BC breaks, the rule was removed.

PS: Some rules were simplified without regression by the last commit.

yy_push_state(ST_LOOKING_FOR_SR_NAME);
}
return ',';
}

This comment has been minimized.

Copy link
@nikic

nikic Mar 2, 2015

Member

What about const FOO = [1, isset(BAR[0])]; will the isset be lexed as T_ISSET or as T_STRING? It should be a T_ISSET. We currently don't support any language constructors in constants I think, but that's just because they don't happen to be implemented, we may want to add them in the future.

This comment has been minimized.

Copy link
@marcioAlmada

marcioAlmada Mar 2, 2015

Author Contributor

T_ISSET because the ST_LOOKING_FOR_SR_NAME pops itself after first match so in const FOO = [1, isset(BAR[0])]; only FOO will be the T_STRING.

This comment has been minimized.

Copy link
@nikic

nikic Mar 2, 2015

Member

Not sure I understand your response. In that case in_class and in_const_list should be 1, so that would push LOOKING_FOR_SR_NAME after the , - or not? How does it distinguish between A = 1, B = 2 and A = [1, B] for example?

This comment has been minimized.

Copy link
@marcioAlmada

marcioAlmada Mar 2, 2015

Author Contributor

OK. Fixed with marcioAlmada@a8b1f21. Thanks ^^

in_trait_conflict = 1;
}
in_class++;
}
yy_push_state(ST_IN_SCRIPTING);
return '{';
}

This comment has been minimized.

Copy link
@nikic

nikic Mar 3, 2015

Member

What about this case:

class A {
    use B { list as foo; }
}

It doesn't look like non-absolute trait method references are currently handled.

This comment has been minimized.

Copy link
@marcioAlmada

marcioAlmada Mar 3, 2015

Author Contributor

Sigh. So this is valid syntax too... not surprised there isn't a single test for that on the test suite AFAIK. After #1054 (comment) nothing surprises me.

This comment has been minimized.

Copy link
@marcioAlmada

marcioAlmada Mar 3, 2015

Author Contributor

Thanks, this one is a quick fix. I can't push it immediately, but I'll do it in a few ours.

This comment has been minimized.

Copy link
@marcioAlmada

marcioAlmada Mar 3, 2015

Author Contributor

return T_OBJECT_OPERATOR;
}

<ST_IN_SCRIPTING,ST_LOOKING_FOR_SR_NAME>"namespace"/"\\"|{OPTIONAL_WHITESPACE}"::" {

This comment has been minimized.

Copy link
@nikic

nikic Mar 3, 2015

Member

What's the second part (optional ws with ::) for? Also shouldn't there be optional ws before the backslash as well?

This comment has been minimized.

Copy link
@marcioAlmada

marcioAlmada Mar 3, 2015

Author Contributor

Yes, it should be:

<ST_IN_SCRIPTING,ST_LOOKING_FOR_SR_NAME>"namespace"/{OPTIONAL_WHITESPACE}("\\"|"::") {

About ::, there is a single test on the test suite involving namespace:: access but I'll double check it again anyway.

in_use_list = 0;
in_trait_conflict = 0;
}
}

This comment has been minimized.

Copy link
@nikic

nikic Mar 3, 2015

Member

What about code like class A { function foo() { "${$a}"; } function list() {} }, is this handled correclty? The opening { for the string interpolation won't be lexed as a plain curly open brace in this case.

This comment has been minimized.

Copy link
@marcioAlmada

marcioAlmada Mar 3, 2015

Author Contributor

This only fails if "${$a}" (string interpolation) is part of ST_IN_SCRIPTING state. I can't check now as I'm far from working computer.

But if your question is just rhetorical and you already know the answer is yes, the solution would be to conditionally in_class++ here too + some other things I can't envision right now.

This comment has been minimized.

Copy link
@marcioAlmada

marcioAlmada Mar 3, 2015

Author Contributor

Ok thinking a bit more, there is other possibility:

If somebody used BEGIN() to go back to ST_IN_SCRIPTING and discard ST_DOUBLE_QUOTES, ST_BACKQUOTE or ST_HEREDOC states instead of yy_pop_state() this would break the state trail and the previously pushed states will get lost. It should be fixable. But if it's not fixable... game over.

This comment has been minimized.

Copy link
@marcioAlmada

marcioAlmada Mar 3, 2015

Author Contributor

Phew... ok, cases for ${ and {$, encapsed vars, on class declaration context is closed. Indeed it was a easy fix.

Now moving to the "non qualified trait method name" case...

@marcioAlmada marcioAlmada force-pushed the marcioAlmada:context branch from 7c5523e to 3592484 Mar 4, 2015

@marcioAlmada marcioAlmada force-pushed the marcioAlmada:context branch from 3592484 to a098f87 Mar 4, 2015

@marcioAlmada marcioAlmada referenced this pull request Mar 7, 2015
8 of 8 tasks complete
@marcioAlmada

This comment has been minimized.

Copy link
Contributor Author

commented Mar 24, 2015

superseded by #1158

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
7 participants
You can’t perform that action at this time.