@porifa/tokenizer

A Regex Based Tokenizer for Parsers

Description

This package provides a flexible and efficient tokenizer that utilizes regular expressions to break down text into tokens for parsing. It's primarily designed for use within parsers developed by Porifa, but can be adapted for other parsing tasks as well.

Features

Regex-based tokenization: Define token patterns using regular expressions for precise control.
Token types: Assign a unique type to each token for identification and processing.
Trivia handling: Attach additional data to tokens, such as comments or whitespace, for custom handling.
Skippable tokens: Specify token types to be ignored during parsing.
Customizable tokenization logic: Inject custom tokenization logic through a trivia function.
EOF handling: Detects the end of the input stream.
Peeking: Preview upcoming tokens without consuming them.
Utility functions: Build regular expressions from special characters for convenience.

Installation

npm install @porifa/tokenizer

Usage

Import the Tokenizer class:

import { Tokenizer, TokenDefinition } from '@porifa/tokenizer';

Define token definitions:

enum TokenKind {
    EOF = 'eof',
    UNRECOGNIZED = 'unrecognized',
    WHITESPACE = 'whitespace',
    IDENTIFIER = 'identifier',

    IF = 'if',
    ELSE = 'else',
    WHILE = 'while',
    FOR = 'for',
    SEMICOLON = ',',
}

const keywordMap: Record<string, TokenKind> = {
    if: TokenKind.IF,
    else: TokenKind.ELSE,
    while: TokenKind.WHILE,
    for: TokenKind.FOR,
};

const tokenDefinitions: TokenDefinition<TokenKind>[] = [
    { regex: /if|else|while|for/, tokenMap: keywordMap },
    { regex: /[a-zA-Z_][a-zA-Z0-9_]*/, tokenMap: {}, kind: TokenKind.IDENTIFIER },
    { regex: /\s+/, tokenMap: {}, kind: TokenKind.WHITESPACE },
    // ... more token definitions
];

Create a tokenizer instance:

const tokenizer = new Tokenizer<TokenKind, { text: string }>(
    tokenDefinitions,
    [TokenKind.WHITESPACE],
    TokenKind.EOF,
    TokenKind.UNRECOGNIZED,
    (token, tokenizer) => ({ text: tokenizer.code.substring(token.start, token.start + token.length) })
);

Provide input code:

tokenizer.setInput(code);

Iterate through tokens:

while (!tokenizer.isEndOfFile()) {
    const token = tokenizer.nextToken();
    console.log(token.kind, token.triviaData?.text);
}

Additional Information

Explore the Token and Tokenizer classes for detailed properties and methods.
Refer to the examples directory for usage in different parsing scenarios.
Consider contributing to the package for enhancements and bug fixes.

Contributing

We welcome contributions to this package! Please follow our contribution guidelines.

License

This package is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
src		src
.gitignore		.gitignore
.npmignore		.npmignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

@porifa/tokenizer

Description

Features

Installation

Usage

Additional Information

Contributing

License

About

Releases 7

Packages

Languages

porifa/tokenizer

Folders and files

Latest commit

History

Repository files navigation

@porifa/tokenizer

Description

Features

Installation

Usage

Additional Information

Contributing

License

About

Resources

Stars

Watchers

Forks

Releases 7

Packages 0

Languages

Packages