Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8224225: Tokenizer improvements #435

wants to merge 2 commits into from


Copy link

@JimLaskey JimLaskey commented Sep 30, 2020

Please review these changes to the javac scanner.

I recommend looking at the "new" versions of 1. UnicodeReader, then 2. JavaTokenizer and then 3. JavadocTokenizer before venturing into the diffs.

Rationale, under the heading of technical debt and separation of concerns: There is a lot "going on" in the JavaTokenizer/JavadocTokenizer that needed to be cleaned up.

  • The UnicodeReader shouldn't really be accumulating characters for literals.
  • A tokenizer shouldn't need to be aware of the unicode translations.
  • There is no need for peek ahead.
  • There were a lot of repetitive tasks that should be done in methods instead of complex expressions.
  • Names of existing methods were often confusing.

To avoid disruption, I avoided changing logical, except in the UnicodeReader. There are some relics in the JavaTokenizer/JavadocTokenizer that could be cleaned up but require deeper analysis.

Some details;

  • UnicodeReader was reworked to provide tokenizers a running stream of unicode characters/codepoints. Steps:

    • characters are read from the buffer.
    • if the character is a '' then check to see if the character is the beginning of an unicode escape sequence, If so, then translate.
    • if the character is a high surrogate then check to see if next character is the low surrogate. If so then combine.
      • A tokenizer can test a codepoint with the isSurrogate predicate (when/if needed.)
        The result of putting this logic on UnicodeReader's shoulders means that a tokenizer does not need have any unicode "logical."
  • The old UnicodeReader modified the source buffer to insert an EOI character at the end to mark the last character.

    • This meant the buffer had to be large enough (grown) to accommodate.
    • There really was no need since we can simply return an EOI when trying to read past the end of buffer.
  • The only buffer mutability left behind is when reading digits.

    • Unicode digits are still replaced with ASCII digits.
      • This seems unnecessary, but I didn't want to risk messing around with the existing logic. Maybe someone can enlighten me.
  • The sequence '\' is special cased in the UnicodeReader so that the sequence "\uXXXX" is handled properly.

    • Thus, tokenizers don't have to special case '\' (happened frequently in the JavadocTokenizer.)
  • JavaTokenizer was modified to accumulate scanned literals in a StringBuilder.

    • This simplified/clarified the code significantly.
  • Since a lot of the functionality needed by the JavaTokenizer comes directly from a UnicodeReader, I made JavaTokenizer a subclass of UnicodeReader.

    • Otherwise, I would have had to reference "reader." everywhere or would have to create JavaTokenizer methods to repeat the same logic. This was simpler and cleaner.
  • Since the pattern "if (ch == 'X') bpos++" occurred a lot, I switched to using "if (accept('X')) " patterns.

    • Actually, I tightened up a lot of these patterns, as you will see in the code.
  • There are a lot of great mysteries in JavadocTokenizer, but I think I cracked most of them. The code is simpler and more modular.

  • The new scanner is slower to warm up due to new layers of method calls (ex. HelloWorld is 5% slower). However, once warmed up, this new scanner is faster than the existing code. The JDK java code compiles 5-10% faster.

Previous review:


  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed




$ git fetch pull/435/head:pull/435
$ git checkout pull/435

Copy link

bridgekeeper bot commented Sep 30, 2020

👋 Welcome back jlaskey! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Copy link

openjdk bot commented Sep 30, 2020

@JimLaskey The following label will be automatically applied to this pull request:

  • compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the compiler label Sep 30, 2020
@JimLaskey JimLaskey marked this pull request as ready for review September 30, 2020 14:23
@openjdk openjdk bot added the rfr Pull request is ready for review label Sep 30, 2020
Copy link
Member Author

@lahodaj @vicente-romero-oracle @mcimadamore Please review. No changes since last time.

Copy link

mlbridge bot commented Sep 30, 2020


Copy link
Member Author


Copy link

openjdk bot commented Sep 30, 2020

Could not create test job

Copy link
Member Author


Copy link

openjdk bot commented Oct 1, 2020

Could not create test job

Copy link

@mcimadamore mcimadamore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

openjdk bot commented Oct 1, 2020

@JimLaskey This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file for more details.

After integration, the commit message for the final commit will be:

8224225: Tokenizer improvements

Reviewed-by: mcimadamore

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Oct 1, 2020
Copy link
Member Author


@openjdk openjdk bot closed this Oct 1, 2020
@openjdk openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Oct 1, 2020
Copy link

openjdk bot commented Oct 1, 2020

@JimLaskey Since your change was applied there have been 2 commits pushed to the master branch:

  • 9670425: 8253822: Remove unused exception_address_is_unpack_entry
  • 8440279: 8180514: test fails with -XX:-TieredCompilation

Your commit was automatically rebased without conflicts.

Pushed as commit 90c131f.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
compiler integrated Pull request has been integrated
2 participants