Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make yychar long #605

Merged
merged 19 commits into from
Nov 26, 2019
Merged

Make yychar long #605

merged 19 commits into from
Nov 26, 2019

Conversation

regisd
Copy link
Member

@regisd regisd commented Nov 24, 2019

Fix #536.

When the input is larger than 2GB, the scanner internal state for the number of read characters (aka zzchar) is negative. This obviously doesn't make sense.

The test added in #603 reproduces the problem and throws an exception, when the zzchar is negative.

The problem is caused by yychar being an int, hence subject to overflow.

Fix #558 by making yychar long. This is similar to #558 by @sarowe with a lighter approach for testing.

Note that some problems remain:

  • yyline and yycolumn are still integers.
  • if the match section is larger than 2GB, then
    • you will likely have an OutOfMemory error
    • even if you have enough heap size, zzRefill() will throw a NegativeArraySizeException like this

      java.lang.NegativeArraySizeException
      at jflex.testcase.large_input.LargeInputScanner.zzRefill(LargeInputScanner.java:288)
      at jflex.testcase.large_input.LargeInputScanner.yylex(LargeInputScanner.java:601)
      at jflex.testcase.large_input.LargeInputTest.consumeLargeInput(LargeInputTest.java:22)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

@regisd regisd requested a review from lsf37 as a code owner November 24, 2019 11:30
@regisd regisd added the bug Not working as intended label Nov 24, 2019
@regisd regisd self-assigned this Nov 24, 2019
Don't prepare a large buffer if we read only a few characters.
@regisd regisd requested a review from sarowe as a code owner November 24, 2019 13:51
@regisd regisd added this to the 1.8.0 milestone Nov 25, 2019
@regisd regisd mentioned this pull request Nov 25, 2019
Copy link
Member

@lsf37 lsf37 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this, esp generating the large input is nice.

We should announce this change fairly prominently, because it is API-breaking: users will have to make the same change we had to make in the example.

I wonder if we should make the type configurable (in a separate PR). The ability to handle input files >2GB is not necessarily what everyone needs, some might prefer not having to change their code.

@regisd
Copy link
Member Author

regisd commented Nov 26, 2019

Thanks for the positive feedback, Gerwin.

My opinion is that:

  • It's OK to do API breaking changes, particularly because there was none in the last year (with a slow release cycle, changes can/will be more brutal), and very few in the last decade.
  • There is no need to announce much in advance. Users have to read the changelog and update their code when they upgrade the dep.

That being said, I think 1.8 has accumulated enough changes to be released.

@regisd regisd merged commit 8c9c006 into jflex-de:master Nov 26, 2019
@regisd regisd deleted the long-yychar branch November 26, 2019 23:02
regisd pushed a commit that referenced this pull request Nov 26, 2019
Author: Régis Décamps <regisd@google.com>
Date:   Wed Nov 27 00:02:51 2019 +0100

    Make `yychar` long (#605)

    * Replace `yychar` by long in skeletons (default & nested)

    * Update example for long yyChar
      * Also, add some positivity assertion in `example/simple/Yytoken` to give some defensive programming as an example.

    * Update manual and READMe

    * Update test added in #603 .
       * Improve doc of RepeatContentReader
       * Improve RepeatContentReader
          Don't prepare a large buffer if we read only a few characters.

    * Also update testcase/example/simple for long `yychar`.

    * Delete testcase/simple which is just redundant with the example.

Updated from target/jflex-parent-1.8.0-SNAPSHOT-sources.jar
@lsf37
Copy link
Member

lsf37 commented Nov 26, 2019

Yes, I meant making it prominent in the release announcement (just because API breaking changes are rare for JFlex), I agree that we don't want to announce it before.

@lsf37
Copy link
Member

lsf37 commented Nov 27, 2019

Forgot to say: I agree that 1.8 is getting there. Still labouring under the illusion that I'll manage to pull in the char class macros (#216) into 1.8. Recently made some progress on that, but there is still some way to go.

regisd added a commit to regisd/jflex that referenced this pull request Dec 3, 2019
Introduce a fakeRead() that changes `yychar` so that we can fakely jump to yychar just before Integer.MAX_VALUE and make the test run in a snap.

Follow-up of jflex-de#603 and jflex-de#605
regisd added a commit that referenced this pull request Dec 3, 2019
)

* Make LargeInputTest run much faster by faking the scanner position

Introduce a fakeRead() that changes `yychar` so that we can fakely jump to yychar just before Integer.MAX_VALUE and make the test run in a snap.

Follow-up of #603 and #605
@regisd
Copy link
Member Author

regisd commented Mar 23, 2020

Follow-up comment for those who hit this breaking change. There are two options to fix it:

  • either adapt the consuming code and use long as well.
  • or there is a guarantee that yychar is neither larger or equal to 2^31 in which case it's possible to wrap yychar with Math.toExactInt()

regisd added a commit to jflex-de/smali that referenced this pull request Mar 25, 2020
Starting from JFlex 1.8.0, `yychar` is now a long.

However antlr `CommonToken` only takes an int, hence explicit casting is needed.

I propose to use `java.lang.Math.toIntExact()` which throws ArithmeticException in case of overflow. This is acceptable, because the application will have an illegal state today (negative character position caused by the int overflow).

See jflex-de/jflex#605 for more details.
regisd added a commit to jflex-de/smali that referenced this pull request Mar 25, 2020
Since JFlex 1.8.0, the `yychar` is a long.
However, it is used to build an antlr `CommonToken` which takes an int.

I propose to cast the `yychar` with `java.lang.Math.toIntExact()` which throw an ArithmeticException in case of overflow.
This is acceptable because the previous code would have returned an overflowed (negative) position.

See jflex-de/jflex#605 for details.
regisd added a commit to jflex-de/smali that referenced this pull request Mar 25, 2020
Starting from JFlex 1.8.0, `yychar` is a long.

However, it is used to build an antlr `CommonToken` which takes an int.
Hence, explicit casting is needed.

I propose to cast the `yychar` with `java.lang.Math.toIntExact()` which throw an ArithmeticException in case of overflow.
This is acceptable because the previous code would have returned an overflowed (negative) position.

See jflex-de/jflex#605 for details.
JesusFreke pushed a commit to JesusFreke/smali that referenced this pull request Mar 30, 2020
Starting from JFlex 1.8.0, `yychar` is a long.

However, it is used to build an antlr `CommonToken` which takes an int.
Hence, explicit casting is needed.

I propose to cast the `yychar` with `java.lang.Math.toIntExact()` which throw an ArithmeticException in case of overflow.
This is acceptable because the previous code would have returned an overflowed (negative) position.

See jflex-de/jflex#605 for details.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Not working as intended
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lexer has negative yychar when processing files larger than 2 GiB
2 participants