8224225: Tokenizer improvements #435

JimLaskey · 2020-09-30T14:19:21Z

Please review these changes to the javac scanner.

I recommend looking at the "new" versions of 1. UnicodeReader, then 2. JavaTokenizer and then 3. JavadocTokenizer before venturing into the diffs.

Rationale, under the heading of technical debt and separation of concerns: There is a lot "going on" in the JavaTokenizer/JavadocTokenizer that needed to be cleaned up.

The UnicodeReader shouldn't really be accumulating characters for literals.
A tokenizer shouldn't need to be aware of the unicode translations.
There is no need for peek ahead.
There were a lot of repetitive tasks that should be done in methods instead of complex expressions.
Names of existing methods were often confusing.

To avoid disruption, I avoided changing logical, except in the UnicodeReader. There are some relics in the JavaTokenizer/JavadocTokenizer that could be cleaned up but require deeper analysis.

Some details;

UnicodeReader was reworked to provide tokenizers a running stream of unicode characters/codepoints. Steps:
- characters are read from the buffer.
- if the character is a '' then check to see if the character is the beginning of an unicode escape sequence, If so, then translate.
- if the character is a high surrogate then check to see if next character is the low surrogate. If so then combine.
  - A tokenizer can test a codepoint with the isSurrogate predicate (when/if needed.)
    The result of putting this logic on UnicodeReader's shoulders means that a tokenizer does not need have any unicode "logical."
The old UnicodeReader modified the source buffer to insert an EOI character at the end to mark the last character.
- This meant the buffer had to be large enough (grown) to accommodate.
- There really was no need since we can simply return an EOI when trying to read past the end of buffer.
The only buffer mutability left behind is when reading digits.
- Unicode digits are still replaced with ASCII digits.
  - This seems unnecessary, but I didn't want to risk messing around with the existing logic. Maybe someone can enlighten me.
The sequence '\' is special cased in the UnicodeReader so that the sequence "\uXXXX" is handled properly.
- Thus, tokenizers don't have to special case '\' (happened frequently in the JavadocTokenizer.)
JavaTokenizer was modified to accumulate scanned literals in a StringBuilder.
- This simplified/clarified the code significantly.
Since a lot of the functionality needed by the JavaTokenizer comes directly from a UnicodeReader, I made JavaTokenizer a subclass of UnicodeReader.
- Otherwise, I would have had to reference "reader." everywhere or would have to create JavaTokenizer methods to repeat the same logic. This was simpler and cleaner.
Since the pattern "if (ch == 'X') bpos++" occurred a lot, I switched to using "if (accept('X')) " patterns.
- Actually, I tightened up a lot of these patterns, as you will see in the code.
There are a lot of great mysteries in JavadocTokenizer, but I think I cracked most of them. The code is simpler and more modular.
The new scanner is slower to warm up due to new layers of method calls (ex. HelloWorld is 5% slower). However, once warmed up, this new scanner is faster than the existing code. The JDK java code compiles 5-10% faster.

Previous review: https://mail.openjdk.java.net/pipermail/compiler-dev/2020-August/014806.html

Progress

Change must not contain extraneous whitespace
Commit message must refer to an issue
Change must be properly reviewed

Issue

JDK-8224225: Tokenizer improvements

Reviewers

Maurizio Cimadamore (@mcimadamore - Reviewer)

Download

$ git fetch https://git.openjdk.java.net/jdk pull/435/head:pull/435
$ git checkout pull/435

bridgekeeper · 2020-09-30T14:21:25Z

👋 Welcome back jlaskey! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2020-09-30T14:22:44Z

@JimLaskey The following label will be automatically applied to this pull request:

compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

JimLaskey · 2020-09-30T14:25:40Z

@lahodaj @vicente-romero-oracle @mcimadamore Please review. No changes since last time.

mlbridge · 2020-09-30T14:28:53Z

Webrevs

01: Full - Incremental (0ee36a3)
00: Full (c453147)

JimLaskey · 2020-09-30T19:17:36Z

/test

openjdk · 2020-09-30T19:48:39Z

Could not create test job

JimLaskey · 2020-10-01T10:48:25Z

/test

openjdk · 2020-10-01T10:48:52Z

Could not create test job

mcimadamore

Approved as per:
https://mail.openjdk.java.net/pipermail/compiler-dev/2020-August/014810.html

openjdk · 2020-10-01T13:39:01Z

@JimLaskey This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for more details.

After integration, the commit message for the final commit will be:

8224225: Tokenizer improvements

Reviewed-by: mcimadamore

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

JimLaskey · 2020-10-01T15:38:23Z

/integrate

openjdk · 2020-10-01T15:39:14Z

@JimLaskey Since your change was applied there have been 2 commits pushed to the master branch:

9670425: 8253822: Remove unused exception_address_is_unpack_entry
8440279: 8180514: TestPrintMdo.java test fails with -XX:-TieredCompilation

Your commit was automatically rebased without conflicts.

Pushed as commit 90c131f.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

8224225: Tokenizer improvements

c453147

openjdk bot added the compiler compiler-dev@openjdk.org label Sep 30, 2020

JimLaskey marked this pull request as ready for review September 30, 2020 14:23

openjdk bot added the rfr Pull request is ready for review label Sep 30, 2020

Merge branch 'master' into 8224225

0ee36a3

mcimadamore approved these changes Oct 1, 2020

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Oct 1, 2020

openjdk bot closed this Oct 1, 2020

openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Oct 1, 2020

JimLaskey mentioned this pull request Oct 6, 2020

8254073: Tokenizer improvements (revised) #525

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

8224225: Tokenizer improvements #435

8224225: Tokenizer improvements #435

Uh oh!

JimLaskey commented Sep 30, 2020 •

edited by openjdk bot

Loading

Uh oh!

bridgekeeper bot commented Sep 30, 2020

Uh oh!

openjdk bot commented Sep 30, 2020

Uh oh!

JimLaskey commented Sep 30, 2020

Uh oh!

mlbridge bot commented Sep 30, 2020 •

edited

Loading

Uh oh!

JimLaskey commented Sep 30, 2020

Uh oh!

openjdk bot commented Sep 30, 2020

Uh oh!

JimLaskey commented Oct 1, 2020

Uh oh!

openjdk bot commented Oct 1, 2020

Uh oh!

mcimadamore left a comment

Uh oh!

openjdk bot commented Oct 1, 2020 •

edited

Loading

Uh oh!

JimLaskey commented Oct 1, 2020

Uh oh!

openjdk bot commented Oct 1, 2020

Uh oh!

Uh oh!

8224225: Tokenizer improvements #435

8224225: Tokenizer improvements #435

Uh oh!

Conversation

JimLaskey commented Sep 30, 2020 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Download

Uh oh!

bridgekeeper bot commented Sep 30, 2020

Uh oh!

openjdk bot commented Sep 30, 2020

Uh oh!

JimLaskey commented Sep 30, 2020

Uh oh!

mlbridge bot commented Sep 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

JimLaskey commented Sep 30, 2020

Uh oh!

openjdk bot commented Sep 30, 2020

Uh oh!

JimLaskey commented Oct 1, 2020

Uh oh!

openjdk bot commented Oct 1, 2020

Uh oh!

mcimadamore left a comment

Choose a reason for hiding this comment

Uh oh!

openjdk bot commented Oct 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JimLaskey commented Oct 1, 2020

Uh oh!

openjdk bot commented Oct 1, 2020

Uh oh!

Uh oh!

JimLaskey commented Sep 30, 2020 •

edited by openjdk bot

Loading

mlbridge bot commented Sep 30, 2020 •

edited

Loading

openjdk bot commented Oct 1, 2020 •

edited

Loading