Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHP 8.0+ does no longer accept a comment in the whitespace between ampersand and argument in function signature #10083

Closed
olleharstedt opened this issue Dec 11, 2022 · 17 comments
Labels

Comments

@olleharstedt
Copy link

olleharstedt commented Dec 11, 2022

Description

The following code:

<?php

function foo(
array
#if __PHP__
&
#endif
$bodies
)
{
}

Resulted in this output:

    Parse error: syntax error, unexpected variable "$bodies" in /in/aq2rT on line 8

    Process exited with code 255.

But I expected this output instead:

<Empty, should compile without syntax error as in PHP 7.4

PHP Version

PHP 8.1 and above, works fine in PHP 7.4

Operating System

No response

@iluuu1994
Copy link
Member

This is expected and caused by some parser/tokenizer hackery for intersection types that was necessary to avoid ambiguity. See 069a9fa.

@olleharstedt
Copy link
Author

Won't fix = impossible to fix?

@iluuu1994
Copy link
Member

Yes 🙂

@olleharstedt
Copy link
Author

😢

@iluuu1994
Copy link
Member

iluuu1994 commented Dec 11, 2022

I think in theory we could adjust the tokens so that variables are assumed after & /* comment */ instead of intersection types, but one case will always not work.

@olleharstedt
Copy link
Author

olleharstedt commented Dec 11, 2022

Are comments part of the token stream? That's the problem? Because whitespace and newline works fine.

@iluuu1994
Copy link
Member

iluuu1994 commented Dec 11, 2022

No. The problem is that the grammar is ambiguous. Our LR(1) parser only gets one token of lookahead, but code like function foo(Foo |& Bar $param) { (| denoting the current position) requires multiple tokens of lookahead to understand if this is a param by reference (Foo &$param) or an intersection type. We avoid this ambiguity by creating different tokens for & depending on what follows, but this has limitations in that comments will influence tokenization.

<ST_IN_SCRIPTING>"&"[ \t\r\n]*("$"|"...") {
yyless(1);
RETURN_TOKEN(T_AMPERSAND_FOLLOWED_BY_VAR_OR_VARARG);
}
<ST_IN_SCRIPTING>"&" {
RETURN_TOKEN(T_AMPERSAND_NOT_FOLLOWED_BY_VAR_OR_VARARG);
}

@olleharstedt
Copy link
Author

olleharstedt commented Dec 11, 2022

Ah ok. But expanding <ST_IN_SCRIPTING>"&"[ \t\r\n]*("$"|"...") { to also eat comments could fix it? If that's possible during lexing. But maybe not good for lexing performance. Or possible only eat the single-line comments # and //.

@iluuu1994
Copy link
Member

I suppose, yes. Currently lexing of comments is more complex although fully expressable via regex. I guess if we can embed those as lookahead it would work fine. /\*.*?\*/ should work. the other are straight forward. Feel free to give it a try if you like.

@iluuu1994
Copy link
Member

Let's reopen until we verify that this approach will or won't work.

@olleharstedt
Copy link
Author

Yep, can do, thanks!

@Danack
Copy link
Contributor

Danack commented Dec 11, 2022

To be clear, this appears to have been found during someone doing code-golf. It doesn't appear to affect an in production codebase.

@olleharstedt
Copy link
Author

olleharstedt commented Dec 11, 2022

<ST_IN_SCRIPTING>"&"([ \t\r\n]|("#".*"\n")|("//".*"\n")|("/*".*"*/"))*("$"|"...") {

Possibly this. But does not match newlines inside /* */ comments. :/

@iluuu1994
Copy link
Member

Attempted fix #10084 showed this doesn't seem feasible for now.

@iluuu1994 iluuu1994 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 12, 2022
@cmb69
Copy link
Member

cmb69 commented Dec 12, 2022

I agree that we shouldn't try to fix this with more regexps; in the long run, we might consider using a GLP parser. Still, it may make sense to document the current restriction.

@iluuu1994
Copy link
Member

Reopening as it looks like there's a viable solution after all. See GH-10125.

@iluuu1994 iluuu1994 reopened this Dec 18, 2022
@derickr
Copy link
Member

derickr commented Feb 8, 2023

This seems to break installs with re2c 0.15.3, which the date/time parser still uses. The source code also defines the minimum as 0.13.4 which is also broken. It looks like re2c 1.0.3 (which is what my release VM insisted on) does work.

I can't find in the GIT repo or mailing list history why the date/time parser (which we commit) couldn't deal with a later version. I think we should change the minimum in PHP to 1.0.3 (like the release images do) and I will also start using that for timelib's parser.

I've made a PR to add that check: #10542

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants