Skip to content

Conversation

kaidokert
Copy link
Owner

@kaidokert kaidokert commented Jul 5, 2025

Summary by Sourcery

Constify character handling in the JSON tokenizer by introducing compile-time lookup tables and replacing dynamic match logic with table lookups

Enhancements:

  • Add const lookup tables for escape sequences and character classes (digits, whitespace, hex digits)
  • Replace runtime match statements in the tokenizer with table-driven lookups to simplify parsing logic and boost performance

Summary by CodeRabbit

  • Refactor
    • Improved performance and efficiency of JSON parsing by optimizing character classification and escape sequence handling.
    • Replaced multiple runtime checks with faster constant-time lookups for whitespace, digits, and escape sequences.

Copy link

sourcery-ai bot commented Jul 5, 2025

Reviewer's Guide

Introduces compile-time character classification and escape-token lookup tables to replace runtime matching, streamlining the tokenizer’s escape, whitespace, digit, and hexadecimal handling for improved performance and maintainability.

Class diagram for updated Tokenizer character handling

classDiagram
    class Tokenizer {
        +const ESCAPE_TOKENS: [Option<EventToken>; 256]
        +const IS_DIGIT: [bool; 256]
        +const IS_WHITESPACE: [bool; 256]
        +const IS_HEX_DIGIT: [bool; 256]
        +const NUM_TRANSITIONS: [[Option<Num>; 256]; 8]
        ...
    }
    Tokenizer --|> BitBucket : uses
    Tokenizer --|> DepthCounter : uses
    Tokenizer ..> EventToken : escape/char tables
    Tokenizer ..> Num : number transitions
Loading

Flow diagram for character classification in Tokenizer

flowchart TD
    A[Input character] --> B{Is whitespace?}
    B -- Yes --> C[Use IS_WHITESPACE lookup]
    B -- No --> D{Is digit?}
    D -- Yes --> E[Use IS_DIGIT lookup]
    D -- No --> F{Is hex digit?}
    F -- Yes --> G[Use IS_HEX_DIGIT lookup]
    F -- No --> H{Is escape char?}
    H -- Yes --> I[Use ESCAPE_TOKENS lookup]
    H -- No --> J[Other handling]
Loading

File-Level Changes

Change Details Files
Introduced const lookup tables for character classification and escape tokens
  • Define ESCAPE_TOKENS table mapping ASCII codes to EventToken variants
  • Define IS_DIGIT table for numeric character detection
  • Define IS_WHITESPACE table for whitespace detection
  • Define IS_HEX_DIGIT table for hexadecimal digit detection
picojson/src/ujson/tokenizer/mod.rs
Replaced runtime matches with table lookups
  • Swap runtime match on escape characters for ESCAPE_TOKENS lookup
  • Replace whitespace pattern matches with IS_WHITESPACE checks
  • Replace digit range matches with IS_DIGIT checks
  • Replace hex-digit range matches with IS_HEX_DIGIT checks
picojson/src/ujson/tokenizer/mod.rs
Refactored string escape and unicode handling
  • Use ESCAPE_TOKENS to handle simple escapes and branch to unicode on 'u'
  • Remove duplicated unicode entry match arms in state transitions
  • Consolidate unicode hex-digit validation under IS_HEX_DIGIT
picojson/src/ujson/tokenizer/mod.rs

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

coderabbitai bot commented Jul 5, 2025

Walkthrough

The changes introduce compile-time constant lookup tables within the Tokenizer implementation to replace various runtime pattern matches for character classification and escape sequence handling in JSON parsing. These tables provide constant-time array lookups for whitespace, digits, hex digits, and escape tokens, refactoring several internal logic branches.

Changes

File(s) Change Summary
picojson/src/ujson/tokenizer/mod.rs Added compile-time constant lookup tables for escape tokens, whitespace, non-zero digits, and hex digits. Refactored character classification and escape sequence handling to use these tables instead of match/if statements. Merged and adjusted logic for string escape handling and removed a redundant error case.

Sequence Diagram(s)

sequenceDiagram
    participant Input as Input Byte
    participant Tokenizer
    participant LookupTables
    participant Parser

    Input->>Tokenizer: Provide next byte
    Tokenizer->>LookupTables: Check IS_WHITESPACE / IS_NON_ZERO_DIGIT / IS_HEX_DIGIT / ESCAPE_TOKENS
    LookupTables-->>Tokenizer: Return lookup result
    Tokenizer->>Parser: Emit EventToken or error based on lookup and state
Loading

Possibly related PRs

  • kaidokert/picojson-rs#33: Refactors Tokenizer to use constant lookup tables for number parsing state transitions, similar in approach to this PR's refactoring for character classification and escape sequences.

Poem

In the warren where tokens hop and leap,
Lookup tables now guard the JSON sheep.
No more matching in the runtime night—
Arrays bring answers, swift and bright!
With each byte checked in constant time,
Parsing JSON feels sublime.
🥕

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Clippy (1.86.0)
Updating crates.io index

warning: failed to write cache, path: /usr/local/registry/index/index.crates.io-1949cf8c6b5b557f/.cache/de/fm/defmt, error: Permission denied (os error 13)
Locking 68 packages to latest compatible versions
Downloading crates ...
Downloaded log v0.4.27
error: failed to create directory /usr/local/registry/cache/index.crates.io-1949cf8c6b5b557f

Caused by:
Permission denied (os error 13)


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2a2064b and fe6ba02.

📒 Files selected for processing (1)
  • picojson/src/ujson/tokenizer/mod.rs (9 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: kaidokert
PR: kaidokert/picojson-rs#5
File: picojson/src/lib.rs:0-0
Timestamp: 2025-06-29T17:48:18.188Z
Learning: In the picojson-rs project, the `use tokenizer as ujson;` alias in lib.rs is a transitionary and fully internal private alias used during crate reorganization. Examples and external code no longer depend on this alias, making the private visibility appropriate.
Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: tokenizer/src/bitstack/mod.rs:0-0
Timestamp: 2025-06-28T23:43:22.754Z
Learning: In the picojson-rs project, the BitStack trait was redesigned to return bool instead of Option<bool> for pop() and top() methods. Empty stacks return false rather than None, which simplifies the API and avoids Option handling.
Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: stax/src/flex_parser.rs:0-0
Timestamp: 2025-06-28T18:12:29.968Z
Learning: In the stax JSON parser codebase, EscapeSequence event handlers exist in flex_parser.rs not because they're needed by that parser variant, but to avoid catch-all patterns in match statements. The flex parser doesn't need to process EscapeSequence events, but the other parser variant (direct parser) does need them.
picojson/src/ujson/tokenizer/mod.rs (3)
Learnt from: kaidokert
PR: kaidokert/picojson-rs#5
File: picojson/src/lib.rs:0-0
Timestamp: 2025-06-29T17:48:18.188Z
Learning: In the picojson-rs project, the `use tokenizer as ujson;` alias in lib.rs is a transitionary and fully internal private alias used during crate reorganization. Examples and external code no longer depend on this alias, making the private visibility appropriate.
Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: stax/src/flex_parser.rs:0-0
Timestamp: 2025-06-28T18:12:29.968Z
Learning: In the stax JSON parser codebase, EscapeSequence event handlers exist in flex_parser.rs not because they're needed by that parser variant, but to avoid catch-all patterns in match statements. The flex parser doesn't need to process EscapeSequence events, but the other parser variant (direct parser) does need them.
Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: tokenizer/src/bitstack/mod.rs:0-0
Timestamp: 2025-06-28T23:43:22.754Z
Learning: In the picojson-rs project, the BitStack trait was redesigned to return bool instead of Option<bool> for pop() and top() methods. Empty stacks return false rather than None, which simplifies the API and avoids Option handling.
⏰ Context from checks skipped due to timeout of 90000ms (29)
  • GitHub Check: Tests (int32 + float-truncate, --no-default-features --features "int32,float-truncate")
  • GitHub Check: Tests (int32 + float-skip, --no-default-features --features "int32,float-skip")
  • GitHub Check: Tests (int64 + float-skip, --no-default-features --features "int64,float-skip")
  • GitHub Check: Tests (int64 + float-truncate, --no-default-features --features "int64,float-truncate")
  • GitHub Check: Tests (int8 + float-truncate, --no-default-features --features "int8,float-truncate")
  • GitHub Check: Tests (int64 + float-error, --no-default-features --features "int64,float-error")
  • GitHub Check: Tests (int32 + float-error, --no-default-features --features "int32,float-error")
  • GitHub Check: Tests (int32 + float, --no-default-features --features "int32,float")
  • GitHub Check: Tests (int8 + float-skip, --no-default-features --features "int8,float-skip")
  • GitHub Check: Tests (Default (int64 + float))
  • GitHub Check: AVR Panic Prevention (int32, pico-huge, dev)
  • GitHub Check: AVR Panic Prevention (int32, pico-tiny, release)
  • GitHub Check: AVR Panic Prevention (int32, pico-tiny, dev)
  • GitHub Check: AVR Panic Prevention (int32, pico-huge, release)
  • GitHub Check: AVR Panic Prevention (int8, pico-tiny, dev)
  • GitHub Check: AVR Panic Prevention (int8, pico-huge, dev)
  • GitHub Check: coverage
  • GitHub Check: AVR Panic Prevention (int8, pico-tiny, release)
  • GitHub Check: AVR Panic Prevention (int8, pico-huge, release)
  • GitHub Check: Run AVR Test Suites
  • GitHub Check: AVR Panic Prevention (int32, pico-huge, dev)
  • GitHub Check: AVR Panic Prevention (int32, pico-tiny, dev)
  • GitHub Check: AVR Panic Prevention (int32, pico-huge, release)
  • GitHub Check: AVR Panic Prevention (int32, pico-tiny, release)
  • GitHub Check: AVR Panic Prevention (int8, pico-huge, dev)
  • GitHub Check: AVR Panic Prevention (int8, pico-tiny, dev)
  • GitHub Check: AVR Panic Prevention (int8, pico-huge, release)
  • GitHub Check: AVR Panic Prevention (int8, pico-tiny, release)
  • GitHub Check: Run AVR Test Suites
🔇 Additional comments (5)
picojson/src/ujson/tokenizer/mod.rs (5)

285-297: Well-structured compile-time lookup table for escape sequences.

The ESCAPE_TOKENS table correctly maps all standard JSON escape characters to their corresponding tokens. The exclusion of 'u' is appropriate since unicode escapes require special handling to read the subsequent hex digits.


310-337: Accurate character classification tables.

Both IS_WHITESPACE and IS_HEX_DIGIT correctly implement their respective character sets:

  • Whitespace includes all four JSON whitespace characters (space, tab, newline, carriage return)
  • Hex digits properly include both uppercase and lowercase variants

The loop-based construction is efficient and maintainable.


690-712: Clean refactoring of escape sequence handling.

The two-step approach correctly handles both simple escapes (via lookup table) and unicode escapes (special case for 'u'). The error handling for invalid escape sequences is properly preserved.


622-622: Consistent whitespace detection using lookup table.

The replacement of runtime pattern matching with IS_WHITESPACE lookups improves performance while maintaining identical behavior.

Also applies to: 786-787


719-720: Efficient character classification using lookup tables.

All character classification checks have been consistently updated to use the compile-time lookup tables, providing O(1) performance instead of runtime comparisons while preserving the exact parsing logic.

Also applies to: 732-733, 742-743, 752-753, 882-883

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @kaidokert, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes the ujson tokenizer by shifting character classification and escape sequence mapping from runtime computations to compile-time constant lookup tables. This change aims to reduce CPU cycles spent on character analysis during JSON parsing, leading to improved performance and a more efficient tokenizer.

Highlights

  • Performance Optimization via Const Tables: Introduced const lookup tables (ESCAPE_TOKENS, IS_DIGIT, IS_WHITESPACE, IS_HEX_DIGIT) within picojson/src/ujson/tokenizer/mod.rs. These tables pre-compute character classifications and escape sequence mappings at compile time, reducing runtime overhead.
  • Runtime Logic Replacement: Replaced various runtime match statements, matches! macros, and byte range checks (e.g., b' ' | b'\t' | b'\n' | b'\r' or b'0'..=b'9') with efficient array lookups using the newly introduced const tables across the tokenizer's state machine.
  • Streamlined String and Unicode Parsing: Refactored the string escaping and Unicode escape sequence parsing logic to leverage the ESCAPE_TOKENS and IS_HEX_DIGIT tables, simplifying the code and improving character handling efficiency.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kaidokert - I've reviewed your changes and they look great!


Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively replaces runtime character matching with compile-time lookup tables for escape sequences and character classes. This is a great improvement for both performance and code clarity. The changes are well-implemented. I have one suggestion to improve the maintainability of one of the new checks.

@kaidokert
Copy link
Owner Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances performance by using compile-time lookup tables for character classification, simplifying escape sequence handling. A suggestion is provided to improve the readability of the hex digit lookup table generation by avoiding variable shadowing.

@kaidokert kaidokert merged commit da9039e into main Jul 5, 2025
54 checks passed
@kaidokert kaidokert deleted the moreconst branch July 5, 2025 04:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant