A Regex Based Tokenizer for Parsers
This package provides a flexible and efficient tokenizer that utilizes regular expressions to break down text into tokens for parsing. It's primarily designed for use within parsers developed by Porifa, but can be adapted for other parsing tasks as well.
- Regex-based tokenization: Define token patterns using regular expressions for precise control.
- Token types: Assign a unique type to each token for identification and processing.
- Trivia handling: Attach additional data to tokens, such as comments or whitespace, for custom handling.
- Skippable tokens: Specify token types to be ignored during parsing.
- Customizable tokenization logic: Inject custom tokenization logic through a trivia function.
- EOF handling: Detects the end of the input stream.
- Peeking: Preview upcoming tokens without consuming them.
- Utility functions: Build regular expressions from special characters for convenience.
npm install @porifa/tokenizer
- Import the Tokenizer class:
import { Tokenizer, TokenDefinition } from '@porifa/tokenizer';
- Define token definitions:
enum TokenKind {
EOF = 'eof',
UNRECOGNIZED = 'unrecognized',
WHITESPACE = 'whitespace',
IDENTIFIER = 'identifier',
IF = 'if',
ELSE = 'else',
WHILE = 'while',
FOR = 'for',
SEMICOLON = ',',
}
const keywordMap: Record<string, TokenKind> = {
if: TokenKind.IF,
else: TokenKind.ELSE,
while: TokenKind.WHILE,
for: TokenKind.FOR,
};
const tokenDefinitions: TokenDefinition<TokenKind>[] = [
{ regex: /if|else|while|for/, tokenMap: keywordMap },
{ regex: /[a-zA-Z_][a-zA-Z0-9_]*/, tokenMap: {}, kind: TokenKind.IDENTIFIER },
{ regex: /\s+/, tokenMap: {}, kind: TokenKind.WHITESPACE },
// ... more token definitions
];
- Create a tokenizer instance:
const tokenizer = new Tokenizer<TokenKind, { text: string }>(
tokenDefinitions,
[TokenKind.WHITESPACE],
TokenKind.EOF,
TokenKind.UNRECOGNIZED,
(token, tokenizer) => ({ text: tokenizer.code.substring(token.start, token.start + token.length) })
);
- Provide input code:
tokenizer.setInput(code);
- Iterate through tokens:
while (!tokenizer.isEndOfFile()) {
const token = tokenizer.nextToken();
console.log(token.kind, token.triviaData?.text);
}
- Explore the Token and Tokenizer classes for detailed properties and methods.
- Refer to the examples directory for usage in different parsing scenarios.
- Consider contributing to the package for enhancements and bug fixes.
We welcome contributions to this package! Please follow our contribution guidelines.
This package is licensed under the MIT License.