Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

r|!r does not always match everything #567

Closed
lsf37 opened this issue Oct 22, 2019 · 12 comments · Fixed by #581
Closed

r|!r does not always match everything #567

lsf37 opened this issue Oct 22, 2019 · 12 comments · Fixed by #581
Assignees
Labels
bug Not working as intended
Milestone

Comments

@lsf37
Copy link
Member

lsf37 commented Oct 22, 2019

Original bug report:

hello,
I found an issue with the negation operator "!"
With the following specification, string "baba" is not matched
by either EXP ou !EXP .

Pascal Hennequin

%%
%standalone
%{
void ECHO(String cat) { System.out.print("["+cat+":"+yytext()+"]"); }
%}

EXP = ( [^a] [^]* [^a] )
ALL = {EXP} | ! {EXP}  

%%
{ALL}  { ECHO("1"); }
baba   { ECHO("2"); }
@lsf37
Copy link
Member Author

lsf37 commented Oct 22, 2019

Slightly reduced test case:

%%
%standalone
%debug
%%
([^]* [^a]) | !([^]* [^a])  { /* all */ }
aba   { /* should not match */ }

Input aba produces:

match: --aba--
action [9] { /* should not match */ }

(should match first action instead)

Input baba produces:

match: --bab--
action [8] { /* all */ }
match: --a--
action [8] { /* all */ }

(should match all of the input in one match)

Variations of the expression (e.g. .|!.) correctly produce a warning that the second action cannot be matched and produce a DFA with one state, but the test spec produces 8 states (and no warning).

@lsf37 lsf37 self-assigned this Oct 22, 2019
@lsf37 lsf37 added the bug Not working as intended label Oct 22, 2019
@lsf37
Copy link
Member Author

lsf37 commented Oct 22, 2019

This means the issue is in the automaton generation, not the engine. Most likely in the NFA/negation part.

@lsf37
Copy link
Member Author

lsf37 commented Oct 22, 2019

The NFA looks at least strange: nfa.pdf

aba is represented correctly, as is [^]* [^a] in the final automaton as well as the source of the negation, but the negation doesn't look right. The single character match comes from %standalone.

@lsf37
Copy link
Member Author

lsf37 commented Oct 22, 2019

Further reduced:

%%
%int
%debug
%%
!([^]*[^a]) { }

produces on aba

match: --a--
action [5] {  }
match: --ba--
action [5] {  }
-1

but should match the entire input, because aba ends in a and therefore does not match [^]*[^a]

@lsf37
Copy link
Member Author

lsf37 commented Oct 22, 2019

NFA for that: nfa.pdf

@lsf37
Copy link
Member Author

lsf37 commented Oct 22, 2019

Confusingly !([^]*a) produces the correct NFA: nfa3.pdf

@lsf37
Copy link
Member Author

lsf37 commented Oct 23, 2019

And the spec that should be equivalent to !([^]*a) produces the error:

%%
%int
%debug
%%
!([^]*[^\u{0}-`b-\u{10FFFF}]) { }

With input bab we get:

match: --b--
action [5] {  }
match: --ab--
action [5] {  }
-1

i.e. there is something funny going on with (negated?) char classes and negation, even though the input NFA to the negation looks identical (nfa4.pdf), and the big negated class does come out to just a.

The expression [^\u{0}-`b-\u{10FFFF}] alone gives the expected correct nfa that just matches a.

@lsf37
Copy link
Member Author

lsf37 commented Oct 23, 2019

(reproduced back to jflex-1.3, i.e. not a recent regression)

@lsf37
Copy link
Member Author

lsf37 commented Nov 15, 2019

The internal order of character class partitions seems to play a role. The error-example

%%
%int
%debug
%%
!([^]*[^a]) { }

produces { [^a], [a] }. If we force [a] to be created first, e.g.

%%
%int
%debug
DUMMY=[\u{0}-`b-\u{10FFFF}]
%%
!([^]*[^a]) { }

and get the partitions { [a], [^a] }, the result is correct.

@lsf37
Copy link
Member Author

lsf37 commented Nov 15, 2019

The bug appears to be in the removeDead method, which traverses the NFA in different order when the input classes are in different order. Probable cause: modifies NFA table while traversing it.

@lsf37 lsf37 mentioned this issue Nov 16, 2019
@lsf37
Copy link
Member Author

lsf37 commented Nov 16, 2019

And indeed the removeDead method was a bit of a mess, now cleaned up.

@lsf37 lsf37 added this to the 1.8.0 milestone Nov 16, 2019
lsf37 added a commit that referenced this issue Nov 16, 2019
lsf37 added a commit that referenced this issue Nov 16, 2019
@lsf37
Copy link
Member Author

lsf37 commented Nov 17, 2019

For posterity: the implementation of the negation operator uses a pass to remove states that can never reach a final state from the negated sub-NFA (needed to conform with assumptions of the scanning engine). This function removeDead would in specific circumstances not correctly identify a state as live (able to reach a final state), and therefore incorrectly remove it.

This was triggered rarely (e.g. in none of the examples or regression test or other specs we test JFlex on). You can determine wether a lexer spec was affected by looking at the number of DFA states before minimisation in JFlex 1.7.0 and (the upcoming) JFLex 1.8.0. If the number of states differ, it may have been affected by the bug, if the number of states is equal, it was not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Not working as intended
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant