-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nested Bounded Repetitions causes "Illegal Argument Exception" #952
Comments
Thanks for the report, will look into it. That looks like a very high number of states for this small expression. Is this the output for the reduced spec or for a larger spec? The speed is based in a large part on the execution engine being based on a DFA. It doesn't matter too much by which way we arrive at the DFA (it does matter for generator speed, but not runtime speed). The other part of the speed is that the DFA implementation in the runtime engine is reasonably highly optimised. Those latter kinds of optimisations did make a difference (e.g. compared to the earlier tool JLex), and are not really documented much. They are mostly based on things like trying to make it easy for the JVM to keep values in registers and in the cache, which are all things that can easily go out of date as JVMs develop, so it is quite possible that there is additional optimisation possible, because I haven't done much to the runtime engine in the last few years. |
To answer my own question from above: this is the minimal spec, it does indeed generate a DFA with 79k states. This is likely the cause of the exception, i.e. there is something going wrong in the table output. |
What is wrong in the table output is that the table is encoded as a Java string to circumvent java class size restrictions (other methods would have failed earlier). Java strings have 16 bit characters, which means that state values > 0xFFFF are outside the permitted value range. This implicitly limits the max scanner size to 64k states. Usually 64k states is very hard to hit, but DFA size is known to be potentially exponential in the regexp size, and this is such a (rare) case (the mix of * and longish fixed-length repetition blows up the size). It's probably better to try to rewrite the expression to lead to a smaller DFA, but it'd be reasonably easy to extend the encoding such that larger sizes are possible without major performance impact. |
The previous Emitter instances could only emit DFAs up to 2^16 states, which is large, but can be hit with small regexps that produce exponentially large DFAs. Now, when the DFA has more than 0xFFFE states, a different emitter is chosen that can represent values up to 2^32. Fixes #952
The previous Emitter instances could only emit DFAs up to 2^16 states, which is large, but can be hit with small regexps that produce exponentially large DFAs. Now, when the DFA has more than 0xFFFE states, a different emitter is chosen that can represent values up to 2^32. Fixes #952
Running jflex on the following code snippet,
And I get the error:
As suggested, I report this error.
Can I also ask a question together with this issue report:
I understand that jflex uses the Thompson construction to create an NFA first,
and then determinises it, and minimises the DFA.
I am wondering if these are the main methods for making jflex fast, or are there any
other optimizations involved? Are these methods documented somewhere other
than the user manual? Can someone give a pointer if possible?
Thanks a lot!
The text was updated successfully, but these errors were encountered: