Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested Bounded Repetitions causes "Illegal Argument Exception" #952

Closed
hellotommmy opened this issue Mar 3, 2022 · 3 comments · Fixed by #1003
Closed

Nested Bounded Repetitions causes "Illegal Argument Exception" #952

hellotommmy opened this issue Mar 3, 2022 · 3 comments · Fixed by #1003
Labels
bug Not working as intended
Milestone

Comments

@hellotommmy
Copy link

Running jflex on the following code snippet,

%%


%class search
%standalone


%% 
((a|b )*b.{10}){3} {System.out.printf("*** found match");}

And I get the error:

147972 states before minimization, 79107 states in minimized DFA
Old file "search.java" saved as "search.java~"
Writing code to "search.java"

Unexpected exception encountered. This indicates a bug in JFlex.
Please consider filing an issue at http://github.com/jflex-de/jflex/issues/new


character value expected
java.lang.IllegalArgumentException: character value expected
        at jflex.generator.PackEmitter.emitUC(PackEmitter.java:105)
        at jflex.generator.CountEmitter.emit(CountEmitter.java:116)
        at jflex.generator.Emitter.emitDynamicInit(Emitter.java:530)
        at jflex.generator.Emitter.emit(Emitter.java:1369)
        at jflex.generator.LexGenerator.generate(LexGenerator.java:115)
        at jflex.Main.generate(Main.java:320)
        at jflex.Main.main(Main.java:336)

As suggested, I report this error.

Can I also ask a question together with this issue report:
I understand that jflex uses the Thompson construction to create an NFA first,
and then determinises it, and minimises the DFA.
I am wondering if these are the main methods for making jflex fast, or are there any
other optimizations involved? Are these methods documented somewhere other
than the user manual? Can someone give a pointer if possible?

Thanks a lot!

@lsf37
Copy link
Member

lsf37 commented Mar 3, 2022

Thanks for the report, will look into it.

That looks like a very high number of states for this small expression. Is this the output for the reduced spec or for a larger spec?

The speed is based in a large part on the execution engine being based on a DFA. It doesn't matter too much by which way we arrive at the DFA (it does matter for generator speed, but not runtime speed).

The other part of the speed is that the DFA implementation in the runtime engine is reasonably highly optimised. Those latter kinds of optimisations did make a difference (e.g. compared to the earlier tool JLex), and are not really documented much. They are mostly based on things like trying to make it easy for the JVM to keep values in registers and in the cache, which are all things that can easily go out of date as JVMs develop, so it is quite possible that there is additional optimisation possible, because I haven't done much to the runtime engine in the last few years.

@lsf37 lsf37 added the bug Not working as intended label Nov 10, 2022
@lsf37
Copy link
Member

lsf37 commented Dec 29, 2022

To answer my own question from above: this is the minimal spec, it does indeed generate a DFA with 79k states. This is likely the cause of the exception, i.e. there is something going wrong in the table output.

@lsf37 lsf37 added this to the 1.9.0 milestone Dec 30, 2022
lsf37 added a commit that referenced this issue Jan 1, 2023
@lsf37 lsf37 mentioned this issue Jan 1, 2023
2 tasks
@lsf37 lsf37 linked a pull request Jan 2, 2023 that will close this issue
2 tasks
lsf37 added a commit that referenced this issue Jan 2, 2023
@lsf37
Copy link
Member

lsf37 commented Jan 2, 2023

What is wrong in the table output is that the table is encoded as a Java string to circumvent java class size restrictions (other methods would have failed earlier). Java strings have 16 bit characters, which means that state values > 0xFFFF are outside the permitted value range. This implicitly limits the max scanner size to 64k states.

Usually 64k states is very hard to hit, but DFA size is known to be potentially exponential in the regexp size, and this is such a (rare) case (the mix of * and longish fixed-length repetition blows up the size).

It's probably better to try to rewrite the expression to lead to a smaller DFA, but it'd be reasonably easy to extend the encoding such that larger sizes are possible without major performance impact.

lsf37 added a commit that referenced this issue Jan 2, 2023
lsf37 added a commit that referenced this issue Jan 3, 2023
lsf37 added a commit that referenced this issue Jan 3, 2023
The previous Emitter instances could only emit DFAs up to 2^16 states,
which is large, but can be hit with small regexps that produce
exponentially large DFAs.

Now, when the DFA has more than 0xFFFE states, a different emitter is
chosen that can represent values up to 2^32.

Fixes #952
lsf37 added a commit that referenced this issue Jan 3, 2023
The previous Emitter instances could only emit DFAs up to 2^16 states,
which is large, but can be hit with small regexps that produce
exponentially large DFAs.

Now, when the DFA has more than 0xFFFE states, a different emitter is
chosen that can represent values up to 2^32.

Fixes #952
@lsf37 lsf37 closed this as completed in c9d79f6 Jan 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Not working as intended
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants