A JLS-compliant yet simplistic lexer for the Java language. Focus is on ease of implementation and understanding.
What is provided:
Classes for the different types of input elements and tokens.
An unicode expander which expands unicode escapes (e.g
A lexer that emits a list of input elements or tokens for a given input string.
Upon lexing failure, choice between generating "garbage tokens" or throwing an exception.
Regexes for all of Java's input elements, plus a few extras (lists of keywords, operators, etc).
Various utilities dealing with escape handling and number parsing.
This has only been tested very lightly. It's very likely bugs or outright breakages are still lurking around. Use at your own risk!
If you are using Maven (or another popular JVM build tool), see here.
A self-contained JAR file is also available here.
The lexer follows the specification laid by Chapter 3 of the Java Language Specification for Java 9. This is largely compatible with all previous versions, with the following pitfalls:
_(underscore) is a keyword since Java 9.
- Java 9 adds ten [restricted keywords] that are sometimes parsed as identifiers, sometimes as keywords (under some conditions, in module declarations). For simplicity's sake, we always parse them as identifiers. This shouldn't be an issue in practice.
Looking as far back as Java 6, the following (backward-compatible) changes were made:
->is a keyword since Java 8.
::are separators since Java 7.
- Underscores may appear inside integer and floating-point literals since Java 7.