Implement surrogate pairs #49

kaidokert · 2025-07-11T04:20:57Z

Summary by Sourcery

Implement full support for UTF-16 surrogate pairs in JSON Unicode escape processing across both pull and stream parsers.

New Features:

Add high/low surrogate detection and combination utilities for UTF-16 surrogate pairs
Support decoding of surrogate pairs in Unicode escape sequences with pending state tracking

Bug Fixes:

Reject incomplete or interrupted surrogate pairs and clear surrogate state on simple escapes

Enhancements:

Update EscapeProcessor and UnicodeEscapeCollector to return optional UTF-8 output and pending surrogate state
Adjust process_unicode_escape_sequence to handle surrogate pairs and correct escape start positions

Tests:

Introduce comprehensive unit tests for surrogate detection, combination, and escape processing
Add integration tests and examples covering valid surrogate pairs, error cases, and buffer boundary scenarios

Summary by CodeRabbit

New Features
- Improved Unicode handling with full support for UTF-16 surrogate pairs in JSON string parsing.
- Added comprehensive tests to ensure correct parsing of surrogate pairs, including edge cases and buffer boundary scenarios.
- Introduced a new example demonstrating surrogate pair decoding.
Bug Fixes
- Enhanced validation to detect and reject malformed or incomplete surrogate pairs during parsing.
Tests
- Expanded test coverage for valid and invalid surrogate pair sequences across different parser configurations.

sourcery-ai · 2025-07-11T04:21:01Z

Reviewer's Guide

This PR augments JSON escape processing to fully support UTF-16 surrogate pairs. It introduces utilities for detecting and combining high/low surrogates, refactors the core escape routines and collector to track pending high surrogates and emit optional UTF-8 output, adjusts both stream and slice parsers to reset or validate surrogate state around simple escapes and at string/key boundaries, and adds extensive unit and example tests to cover correct decoding and error cases across buffer boundaries.

Sequence diagram for surrogate pair handling in Unicode escape processing

sequenceDiagram
    participant Parser
    participant UnicodeEscapeCollector
    participant EscapeProcessor
    participant Buffer

    Parser->>UnicodeEscapeCollector: add_hex_digit(digit)
    Parser->>UnicodeEscapeCollector: process_to_utf8(&mut buffer)
    UnicodeEscapeCollector->>EscapeProcessor: process_unicode_escape(hex_slice, utf8_buffer, pending_high_surrogate)
    alt No pending high surrogate
        EscapeProcessor->>EscapeProcessor: is_high_surrogate(codepoint)
        alt High surrogate
            EscapeProcessor-->>UnicodeEscapeCollector: (None, Some(high_surrogate))
        else Not a surrogate
            EscapeProcessor-->>UnicodeEscapeCollector: (Some(utf8_bytes), None)
        end
    else Pending high surrogate
        EscapeProcessor->>EscapeProcessor: is_low_surrogate(codepoint)
        alt Low surrogate
            EscapeProcessor->>EscapeProcessor: combine_surrogate_pair(high, low)
            EscapeProcessor-->>UnicodeEscapeCollector: (Some(utf8_bytes), None)
        else Not a low surrogate
            EscapeProcessor-->>UnicodeEscapeCollector: Error
        end
    end
    UnicodeEscapeCollector-->>Parser: (Option<&[u8]>, bool)
    alt UTF-8 bytes present
        Parser->>Buffer: Write utf8_bytes
    else High surrogate pending
        Parser->>Buffer: Wait for next escape
    end

Class diagram for updated Unicode escape processing and surrogate pair support

classDiagram
    class EscapeProcessor {
        +is_high_surrogate(codepoint: u32) bool
        +is_low_surrogate(codepoint: u32) bool
        +combine_surrogate_pair(high: u32, low: u32) Result<u32, ParseError>
        +process_unicode_escape<'a>(hex_slice: &[u8], utf8_buffer: &'a mut [u8], pending_high_surrogate: Option<u32>) Result<(Option<&'a [u8]>, Option<u32>), ParseError>
    }

    class UnicodeEscapeCollector {
        -hex_buffer: [u8; 4]
        -hex_pos: usize
        -pending_high_surrogate: Option<u32>
        +reset()
        +reset_all()
        +add_hex_digit(digit: u8) Result<bool, ParseError>
        +process_to_utf8<'a>(utf8_buffer: &'a mut [u8]) Result<(Option<&'a [u8]>, bool), ParseError>
        +has_pending_high_surrogate() bool
    }

    EscapeProcessor <.. UnicodeEscapeCollector : uses

Class diagram for parser changes related to surrogate pair validation

classDiagram
    class StreamParser {
        -unicode_escape_collector: UnicodeEscapeCollector
        +handle_simple_escape(escape_token)
        +next_event()
    }
    class SliceParser {
        -unicode_escape_collector: UnicodeEscapeCollector
        +handle_simple_escape(escape_char)
        +handle_unicode_escape()
    }
    StreamParser o-- UnicodeEscapeCollector
    SliceParser o-- UnicodeEscapeCollector

File-Level Changes

Change	Details	Files
Introduce surrogate detection and combination utilities in EscapeProcessor	Add is_high_surrogate and is_low_surrogate methods Add combine_surrogate_pair method with validation	`picojson/src/escape_processor.rs`
Enhance EscapeProcessor::process_unicode_escape to support surrogate pairs	Change signature to accept pending_high_surrogate Branch logic for pending, high, low, and regular codepoints Return optional UTF-8 bytes and new pending surrogate	`picojson/src/escape_processor.rs`
Extend UnicodeEscapeCollector to maintain pending surrogate state	Add pending_high_surrogate field and has_pending_high_surrogate method Update reset, add reset_all to clear/retain state appropriately Modify process_to_utf8 to propagate surrogate state changes	`picojson/src/escape_processor.rs`
Update shared escape sequence handler to adjust start position and return optional bytes	Change return type to include Option<&[u8]> Adjust escape_start_pos when completing surrogate pairs Incorporate surrogate state when processing sequence	`picojson/src/escape_processor.rs`
Integrate surrogate pair logic in StreamParser	Clear pending state on simple escapes Skip or copy bytes based on pending surrogate flag Check for unclosed high surrogate at string/key termination	`picojson/src/stream_parser.rs`
Integrate surrogate pair logic in SliceParser	Reset surrogate state on simple escapes Validate no pending high surrogate before ending strings/keys Handle optional UTF-8 output for surrogate pairs	`picojson/src/slice_parser.rs`
Add comprehensive surrogate pair tests and example	Add tests/surrogate_pairs.rs covering detection, combination, parser behavior, errors and buffer boundaries Add examples/test_surrogate_pairs.rs demonstrating correct decoding	`picojson/tests/surrogate_pairs.rs` `picojson/examples/test_surrogate_pairs.rs`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

coderabbitai · 2025-07-11T04:21:03Z

Walkthrough

This update adds comprehensive support for UTF-16 surrogate pairs in JSON string parsing. It introduces new logic for detecting, combining, and validating surrogate pairs, updates escape processing and parser internals, and adds extensive tests to ensure correct handling—including edge cases, error conditions, and buffer boundary scenarios—across both slice-based and streaming parsers. Additionally, a new example demonstrates surrogate pair parsing, and minor internal code simplifications are applied.

Changes

File(s)	Change Summary
picojson/examples/test_surrogate_pairs.rs	Added an executable example to test parsing and decoding of Unicode surrogate pairs in JSON strings.
picojson/src/escape_processor.rs	Enhanced Unicode escape processing: added surrogate pair detection/combination, state tracking, and new methods.
picojson/src/slice_parser.rs	Updated escape handling to track surrogate pair state and validate sequences during parsing.
picojson/src/stream_parser.rs	Added checks for incomplete surrogate pairs, improved buffer handling, and updated error reporting.
picojson/tests/surrogate_pairs.rs	Introduced comprehensive tests for surrogate pair handling, including edge cases and buffer boundary scenarios.
picojson/src/stream_buffer.rs	Simplified internal buffer copy logic by removing redundant pattern matching and conditions.

Sequence Diagram(s)

sequenceDiagram
    participant Input as JSON Input
    participant Parser as Slice/Stream Parser
    participant EscapeProc as EscapeProcessor
    participant Collector as UnicodeEscapeCollector

    Input->>Parser: Provide JSON string with \\uXXXX escapes
    Parser->>EscapeProc: Detect \\uXXXX, check for surrogate
    EscapeProc->>Collector: Track pending high surrogate (if any)
    alt High surrogate found
        Collector->>Parser: Wait for next \\uXXXX (low surrogate)
        Parser->>EscapeProc: Provide next \\uXXXX
        EscapeProc->>Collector: Combine surrogates, output UTF-8
    else No surrogate or completed pair
        EscapeProc->>Collector: Output UTF-8 bytes
    end
    Collector->>Parser: Return decoded UTF-8
    Parser->>Input: Emit string event or error (if invalid sequence)

Poem

In the warren of bytes where the surrogates play,
We’ve taught our parser to hop the right way.
High and low, they pair in delight,
UTF-8 carrots gleam in the night.
No more lone bunnies lost in the code—
Every emoji finds its Unicode abode! 🐇✨

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7b584d3 and 1edc230.

📒 Files selected for processing (4)

picojson/src/escape_processor.rs (13 hunks)
picojson/src/stream_buffer.rs (2 hunks)
picojson/src/stream_parser.rs (8 hunks)
picojson/tests/surrogate_pairs.rs (1 hunks)

✅ Files skipped from review due to trivial changes (1)

picojson/src/stream_buffer.rs

🚧 Files skipped from review as they are similar to previous changes (1)

picojson/src/stream_parser.rs

🧰 Additional context used

🧠 Learnings (3)

📓 Common learnings

Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: stax/src/flex_parser.rs:0-0
Timestamp: 2025-06-28T18:12:30.015Z
Learning: In the stax JSON parser codebase, EscapeSequence event handlers exist in flex_parser.rs not because they're needed by that parser variant, but to avoid catch-all patterns in match statements. The flex parser doesn't need to process EscapeSequence events, but the other parser variant (direct parser) does need them.

picojson/tests/surrogate_pairs.rs (4)

Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: stax/src/flex_parser.rs:0-0
Timestamp: 2025-06-28T18:12:30.015Z
Learning: In the stax JSON parser codebase, EscapeSequence event handlers exist in flex_parser.rs not because they're needed by that parser variant, but to avoid catch-all patterns in match statements. The flex parser doesn't need to process EscapeSequence events, but the other parser variant (direct parser) does need them.

Learnt from: kaidokert
PR: kaidokert/picojson-rs#5
File: picojson/src/lib.rs:0-0
Timestamp: 2025-06-29T17:48:18.198Z
Learning: In the picojson-rs project, the `use tokenizer as ujson;` alias in lib.rs is a transitionary and fully internal private alias used during crate reorganization. Examples and external code no longer depend on this alias, making the private visibility appropriate.

Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: stax/src/lib.rs:19-21
Timestamp: 2025-06-28T18:14:22.871Z
Learning: In Rust crate organization, functions can be imported into lib.rs via `use` statements not for direct usage within lib.rs itself, but to make them available to other modules within the same crate via the `crate::` path. This is a valid pattern and such imports should not be flagged as unused even if lib.rs doesn't directly call them.

Learnt from: kaidokert
PR: kaidokert/picojson-rs#44
File: picojson/src/chunk_reader.rs:28-33
Timestamp: 2025-07-07T01:39:55.177Z
Learning: The PullParser trait in picojson-rs provides both next() and next_event() methods. The next() method is an iterator-like convenience method that returns Option<Result<Event, ParseError>>, returning None when EndDocument is reached. The next_event() method returns Result<Event, ParseError> directly. Both methods are valid and the choice depends on whether you want iterator-style usage (next) or direct result handling (next_event).

picojson/src/escape_processor.rs (2)

Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: stax/src/flex_parser.rs:0-0
Timestamp: 2025-06-28T18:12:30.015Z
Learning: In the stax JSON parser codebase, EscapeSequence event handlers exist in flex_parser.rs not because they're needed by that parser variant, but to avoid catch-all patterns in match statements. The flex parser doesn't need to process EscapeSequence events, but the other parser variant (direct parser) does need them.

Learnt from: kaidokert
PR: kaidokert/picojson-rs#44
File: picojson/src/chunk_reader.rs:28-33
Timestamp: 2025-07-07T01:39:55.177Z
Learning: The PullParser trait in picojson-rs provides both next() and next_event() methods. The next() method is an iterator-like convenience method that returns Option<Result<Event, ParseError>>, returning None when EndDocument is reached. The next_event() method returns Result<Event, ParseError> directly. Both methods are valid and the choice depends on whether you want iterator-style usage (next) or direct result handling (next_event).

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Run AVR Test Suites
GitHub Check: Run AVR Test Suites

🔇 Additional comments (9)

picojson/tests/surrogate_pairs.rs (4)

12-50: Well-structured test fixture with proper event comparison logic.

The test fixture correctly handles the nuanced comparison needed for different Event types, particularly distinguishing between string content vs. event type matching. The use of std::mem::discriminant for type comparison is appropriate.

52-67: Robust error testing pattern.

The error fixture properly continues parsing until an error is encountered, ensuring it catches the specific error condition rather than stopping at the first successful event. This is important for testing scenarios where errors occur mid-parsing.

69-94: Proper buffer management without memory leaks.

The helper functions correctly manage buffer ownership by accepting mutable references from the caller and clearing buffers before use. This avoids the memory leak issues that were previously identified in past reviews.

96-442: Comprehensive test coverage for surrogate pair handling.

The test suite provides excellent coverage across multiple dimensions:

Valid surrogate pairs (basic, musical symbols, multiple pairs)

Error cases (lone surrogates, interrupted pairs, invalid combinations)

Edge cases (boundary values, pathological cases)

Buffer boundary conditions (small chunks, split across boundaries)

Integration scenarios (object keys, nested structures)

All tests appropriately verify behavior across different parser configurations (SliceParser, StreamParser with various chunk sizes), ensuring consistent behavior regardless of implementation.

picojson/src/escape_processor.rs (5)

105-124: Accurate surrogate pair detection and combination logic.

The implementation correctly identifies high surrogates (0xD800-0xDBFF) and low surrogates (0xDC00-0xDFFF) per the UTF-16 specification. The combination formula 0x10000 + ((high & 0x3FF) << 10) + (low & 0x3FF) properly implements the Unicode surrogate pair decoding algorithm.

126-182: Well-designed Unicode escape processing with surrogate support.

The modified process_unicode_escape method elegantly handles the complexity of surrogate pairs by:

Accepting optional pending high surrogate state

Returning a tuple of optional UTF-8 bytes and optional pending high surrogate

Properly deferring output when a high surrogate is encountered

Validating surrogate pair sequences and rejecting invalid combinations

The logic correctly handles all cases: regular Unicode characters, high surrogates (deferred), low surrogates (combined or errored), and validation of surrogate sequences.

187-268: Robust state management for surrogate pair processing.

The UnicodeEscapeCollector enhancements properly manage surrogate state:

pending_high_surrogate field tracks state across escape sequences

reset() preserves surrogate state (as documented) while reset_all() clears it

process_to_utf8() returns both UTF-8 bytes and state change information

has_pending_high_surrogate() provides state inspection capability

The distinction between partial and full reset is crucial for proper surrogate pair handling across multiple escape sequences.

642-700: Intelligent escape position tracking for surrogate pairs.

The modified process_unicode_escape_sequence function cleverly adjusts the escape start position when completing a surrogate pair, ensuring error reporting points to the high surrogate rather than the low surrogate. This provides better user experience by highlighting the beginning of the surrogate pair sequence.

507-639: Comprehensive test coverage for surrogate pair functionality.

The test suite thoroughly validates:

Basic surrogate pair detection and combination

State management across multiple escape sequences

Error cases (lone surrogates, invalid combinations)

Integration with the collector's processing methods

UTF-8 encoding verification for combined surrogate pairs

The tests verify both the low-level utility functions and the higher-level processing logic, ensuring robust validation of the surrogate pair implementation.

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

gemini-code-assist

Summary of Changes

Hello @kaidokert, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces full support for JSON Unicode surrogate pairs, allowing the parser to correctly interpret and decode characters outside the Basic Multilingual Plane (BMP) that are represented by two UTF-16 escape sequences. The changes involve fundamental updates to the Unicode escape processing logic, robust state management for multi-part sequences, and comprehensive error handling for malformed or interrupted pairs. This significantly enhances the parser's compliance with the JSON specification for string handling.

Highlights

Core Logic Enhancement: Introduced new functions (is_high_surrogate, is_low_surrogate, combine_surrogate_pair) and significantly refactored process_unicode_escape in escape_processor.rs. This enables the parser to correctly identify, combine, and UTF-8 encode two-part UTF-16 surrogate pairs (e.g., \uD801\uDC37) into single Unicode codepoints.
Robust State Management: Enhanced the UnicodeEscapeCollector to track a pending_high_surrogate state. This allows the parser to correctly handle multi-part Unicode escape sequences, ensuring that a high surrogate is properly followed by a low surrogate to form a valid character.
Improved Error Handling: Implemented comprehensive error detection for invalid surrogate sequences. This includes cases like lone low surrogates, high surrogates not followed by a valid low surrogate, or surrogate pairs that are interrupted by other escape sequences (e.g., \n, \t) or string terminations. The reset_all() method was added to UnicodeEscapeCollector to facilitate clearing state upon such interruptions.
Unified Parser Integration: Updated both SliceParser and StreamParser to integrate the new surrogate pair processing logic. This ensures consistent and correct behavior for parsing strings and object keys containing surrogate pairs, including careful management of internal buffers and byte consumption, especially for stream-based parsing across chunk boundaries.
Extensive Test Coverage: Added a new, comprehensive test suite (picojson/tests/surrogate_pairs.rs) to validate the surrogate pair implementation. These tests cover various valid scenarios, edge cases, and critical error conditions, ensuring the fix works reliably across different parser types and input configurations, including very small buffer sizes and chunking strategies.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

sourcery-ai

Hey @kaidokert - I've reviewed your changes - here's some feedback:

Consider extracting the surrogate‐pair handling out of the main process_unicode_escape hot path into a smaller helper or state machine to reduce branching overhead for the common single‐codepoint case.
The UnicodeEscapeCollector::reset method doesn’t clear pending surrogate state (you need reset_all), so renaming or clarifying those methods could help avoid accidental state retention.
This PR touches a lot of surface area—splitting the core surrogate‐pair implementation, parser integrations, and the large test suite into separate PRs would make review and future maintenance much easier.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- Consider extracting the surrogate‐pair handling out of the main `process_unicode_escape` hot path into a smaller helper or state machine to reduce branching overhead for the common single‐codepoint case.
- The `UnicodeEscapeCollector::reset` method doesn’t clear pending surrogate state (you need `reset_all`), so renaming or clarifying those methods could help avoid accidental state retention.
- This PR touches a lot of surface area—splitting the core surrogate‐pair implementation, parser integrations, and the large test suite into separate PRs would make review and future maintenance much easier.

## Individual Comments

### Comment 1
<location> `picojson/src/stream_parser.rs:622` </location>
<code_context>
-            if !in_escape && self.stream_buffer.has_unescaped_content() {
+            // Skip writing bytes to escape buffer when we have a pending high surrogate
+            // (prevents literal \uD801 text from being included in final string)
+            if !in_escape
+                && self.stream_buffer.has_unescaped_content()
+                && !self.unicode_escape_collector.has_pending_high_surrogate()
+            {
                 self.append_byte_to_escape_buffer(byte)?;
+            } else if !in_escape && self.stream_buffer.has_unescaped_content() {
             }
         }
</code_context>

<issue_to_address>
The else branch after the main if block is empty and can be removed.

Removing the empty else if branch will make the code clearer and less confusing.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
+            } else if !in_escape && self.stream_buffer.has_unescaped_content() {
+            }
=======
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

picojson/src/stream_parser.rs

gemini-code-assist

Code Review

This pull request introduces comprehensive support for UTF-16 surrogate pairs, enhancing the library's ability to parse a wider range of JSON data. The changes include new functions for surrogate pair detection and combination, as well as updates to the escape processing logic. The addition of new unit and integration tests ensures the correctness and robustness of the implementation.

I've provided feedback on improving code clarity and addressing memory leaks in the test suite.

picojson/src/stream_parser.rs

picojson/tests/surrogate_pairs.rs

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

picojson/src/slice_parser.rs (1)
239-256: Consider improving clarity of the two-phase surrogate pair consumption.

The logic correctly handles surrogate pair completion by consuming both unicode escapes separately, but the complexity could be reduced with clearer documentation or helper methods. The magic number 6 represents the length of \uXXXX and could be defined as a constant for better maintainability.

Consider extracting the magic number to a constant:
+    const UNICODE_ESCAPE_LENGTH: usize = 6; // Length of \uXXXX
+
     fn process_unicode_escape_with_collector(&mut self) -> Result<(), ParseError> {
Then use it in line 248:
-                    .handle_unicode_escape(escape_start_pos + 6, utf8_bytes)?;
+                    .handle_unicode_escape(escape_start_pos + UNICODE_ESCAPE_LENGTH, utf8_bytes)?;
picojson/tests/surrogate_pairs.rs (1)
70-90: Note: Test helpers intentionally leak memory for lifetime management.

The use of leak() in test helpers creates static references, which is acceptable for tests but results in memory that's never freed. This is a common pattern in Rust tests where 'static lifetimes are needed.

Consider adding a comment to clarify the intentional memory leak:
 fn create_slice_parser(input: &str) -> SliceParser {
     let scratch = vec![0u8; 1024];
+    // Intentionally leak memory to get 'static lifetime for tests
     SliceParser::with_buffer(input, scratch.leak())
 }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 02c2b7a and 7b584d3.

📒 Files selected for processing (5)

picojson/examples/test_surrogate_pairs.rs (1 hunks)
picojson/src/escape_processor.rs (12 hunks)
picojson/src/slice_parser.rs (5 hunks)
picojson/src/stream_parser.rs (7 hunks)
picojson/tests/surrogate_pairs.rs (1 hunks)

🧰 Additional context used

🧠 Learnings (6)

📓 Common learnings

Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: stax/src/flex_parser.rs:0-0
Timestamp: 2025-06-28T18:12:30.015Z
Learning: In the stax JSON parser codebase, EscapeSequence event handlers exist in flex_parser.rs not because they're needed by that parser variant, but to avoid catch-all patterns in match statements. The flex parser doesn't need to process EscapeSequence events, but the other parser variant (direct parser) does need them.

picojson/src/slice_parser.rs (4)

Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: stax/src/flex_parser.rs:0-0
Timestamp: 2025-06-28T18:12:30.015Z
Learning: In the stax JSON parser codebase, EscapeSequence event handlers exist in flex_parser.rs not because they're needed by that parser variant, but to avoid catch-all patterns in match statements. The flex parser doesn't need to process EscapeSequence events, but the other parser variant (direct parser) does need them.

Learnt from: kaidokert
PR: kaidokert/picojson-rs#5
File: picojson/src/lib.rs:0-0
Timestamp: 2025-06-29T17:48:18.198Z
Learning: In the picojson-rs project, the `use tokenizer as ujson;` alias in lib.rs is a transitionary and fully internal private alias used during crate reorganization. Examples and external code no longer depend on this alias, making the private visibility appropriate.

Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: tokenizer/src/bitstack/mod.rs:0-0
Timestamp: 2025-06-28T23:43:22.783Z
Learning: In the picojson-rs project, the BitStack trait was redesigned to return bool instead of Option<bool> for pop() and top() methods. Empty stacks return false rather than None, which simplifies the API and avoids Option handling.

Learnt from: kaidokert
PR: kaidokert/picojson-rs#44
File: picojson/src/chunk_reader.rs:28-33
Timestamp: 2025-07-07T01:39:55.177Z
Learning: The PullParser trait in picojson-rs provides both next() and next_event() methods. The next() method is an iterator-like convenience method that returns Option<Result<Event, ParseError>>, returning None when EndDocument is reached. The next_event() method returns Result<Event, ParseError> directly. Both methods are valid and the choice depends on whether you want iterator-style usage (next) or direct result handling (next_event).

picojson/tests/surrogate_pairs.rs (3)

Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: stax/src/flex_parser.rs:0-0
Timestamp: 2025-06-28T18:12:30.015Z
Learning: In the stax JSON parser codebase, EscapeSequence event handlers exist in flex_parser.rs not because they're needed by that parser variant, but to avoid catch-all patterns in match statements. The flex parser doesn't need to process EscapeSequence events, but the other parser variant (direct parser) does need them.

Learnt from: kaidokert
PR: kaidokert/picojson-rs#5
File: picojson/src/lib.rs:0-0
Timestamp: 2025-06-29T17:48:18.198Z
Learning: In the picojson-rs project, the `use tokenizer as ujson;` alias in lib.rs is a transitionary and fully internal private alias used during crate reorganization. Examples and external code no longer depend on this alias, making the private visibility appropriate.

Learnt from: kaidokert
PR: kaidokert/picojson-rs#44
File: picojson/src/chunk_reader.rs:28-33
Timestamp: 2025-07-07T01:39:55.177Z
Learning: The PullParser trait in picojson-rs provides both next() and next_event() methods. The next() method is an iterator-like convenience method that returns Option<Result<Event, ParseError>>, returning None when EndDocument is reached. The next_event() method returns Result<Event, ParseError> directly. Both methods are valid and the choice depends on whether you want iterator-style usage (next) or direct result handling (next_event).

picojson/examples/test_surrogate_pairs.rs (4)

Learnt from: kaidokert
PR: kaidokert/picojson-rs#5
File: picojson/src/lib.rs:0-0
Timestamp: 2025-06-29T17:48:18.198Z
Learning: In the picojson-rs project, the `use tokenizer as ujson;` alias in lib.rs is a transitionary and fully internal private alias used during crate reorganization. Examples and external code no longer depend on this alias, making the private visibility appropriate.

Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: stax/src/flex_parser.rs:0-0
Timestamp: 2025-06-28T18:12:30.015Z
Learning: In the stax JSON parser codebase, EscapeSequence event handlers exist in flex_parser.rs not because they're needed by that parser variant, but to avoid catch-all patterns in match statements. The flex parser doesn't need to process EscapeSequence events, but the other parser variant (direct parser) does need them.

Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: tokenizer/src/bitstack/mod.rs:0-0
Timestamp: 2025-06-28T23:43:22.783Z
Learning: In the picojson-rs project, the BitStack trait was redesigned to return bool instead of Option<bool> for pop() and top() methods. Empty stacks return false rather than None, which simplifies the API and avoids Option handling.

Learnt from: kaidokert
PR: kaidokert/picojson-rs#44
File: picojson/src/chunk_reader.rs:28-33
Timestamp: 2025-07-07T01:39:55.177Z
Learning: The PullParser trait in picojson-rs provides both next() and next_event() methods. The next() method is an iterator-like convenience method that returns Option<Result<Event, ParseError>>, returning None when EndDocument is reached. The next_event() method returns Result<Event, ParseError> directly. Both methods are valid and the choice depends on whether you want iterator-style usage (next) or direct result handling (next_event).

picojson/src/stream_parser.rs (3)

Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: stax/src/flex_parser.rs:0-0
Timestamp: 2025-06-28T18:12:30.015Z
Learning: In the stax JSON parser codebase, EscapeSequence event handlers exist in flex_parser.rs not because they're needed by that parser variant, but to avoid catch-all patterns in match statements. The flex parser doesn't need to process EscapeSequence events, but the other parser variant (direct parser) does need them.

Learnt from: kaidokert
PR: kaidokert/picojson-rs#5
File: picojson/src/lib.rs:0-0
Timestamp: 2025-06-29T17:48:18.198Z
Learning: In the picojson-rs project, the `use tokenizer as ujson;` alias in lib.rs is a transitionary and fully internal private alias used during crate reorganization. Examples and external code no longer depend on this alias, making the private visibility appropriate.

Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: tokenizer/src/bitstack/mod.rs:0-0
Timestamp: 2025-06-28T23:43:22.783Z
Learning: In the picojson-rs project, the BitStack trait was redesigned to return bool instead of Option<bool> for pop() and top() methods. Empty stacks return false rather than None, which simplifies the API and avoids Option handling.

picojson/src/escape_processor.rs (1)

Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: stax/src/flex_parser.rs:0-0
Timestamp: 2025-06-28T18:12:30.015Z
Learning: In the stax JSON parser codebase, EscapeSequence event handlers exist in flex_parser.rs not because they're needed by that parser variant, but to avoid catch-all patterns in match statements. The flex parser doesn't need to process EscapeSequence events, but the other parser variant (direct parser) does need them.

🧬 Code Graph Analysis (1)

picojson/src/slice_parser.rs (2)

picojson/src/escape_processor.rs (1)

process_unicode_escape_sequence (659-700)

picojson/src/slice_input_buffer.rs (1)

current_pos (45-47)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: AVR Panic Prevention (int32, pico-huge, dev)
GitHub Check: AVR Panic Prevention (int8, pico-tiny, dev)
GitHub Check: AVR Panic Prevention (int32, pico-tiny, dev)
GitHub Check: AVR Panic Prevention (int8, pico-tiny, release)
GitHub Check: AVR Panic Prevention (int32, pico-tiny, release)
GitHub Check: AVR Panic Prevention (int32, pico-huge, release)
GitHub Check: AVR Panic Prevention (int32, pico-huge, dev)
GitHub Check: AVR Panic Prevention (int8, pico-huge, dev)
GitHub Check: Run AVR Test Suites
GitHub Check: Run AVR Test Suites

🔇 Additional comments (16)

picojson/src/escape_processor.rs (5)

105-124: LGTM! Correct UTF-16 surrogate pair implementation.

The surrogate detection functions and combination logic correctly follow the UTF-16 specification.

126-182: Well-structured surrogate pair state machine.

The function correctly handles all surrogate pair combinations and error cases. The state transitions are clear and follow the UTF-16 specification.

194-268: Clean API design for surrogate state tracking.

The split between reset() and reset_all() correctly handles the need to preserve surrogate state across escape sequences while allowing full reset on errors.

507-639: Excellent test coverage for surrogate pair functionality.

The tests thoroughly validate surrogate detection, combination, state management, and error handling.

642-700: Smart position tracking for surrogate pairs.

The adjustment of escape_start_pos when completing a surrogate pair (line 694) is a clever solution that ensures error messages and position tracking correctly reference the start of the entire surrogate pair sequence.

picojson/examples/test_surrogate_pairs.rs (1)

1-69: Well-structured example demonstrating surrogate pair support.

The example effectively shows practical usage and validates correct UTF-8 encoding of surrogate pairs. Good choice of test cases covering both common emojis and less common musical symbols.

picojson/src/stream_parser.rs (4)

332-336: Proper validation of surrogate pair completeness.

The checks correctly ensure that strings and keys cannot end with an incomplete surrogate pair, maintaining UTF-16 validity.

Also applies to: 376-380

620-628: Correct buffer management for surrogate pairs.

The logic properly prevents intermediate surrogate text from appearing in the final output, ensuring clean UTF-8 encoding.

Also applies to: 655-661

685-687: Proper handling of escape sequence interruptions.

Resetting surrogate state on simple escapes correctly rejects interrupted surrogate pairs, and the code properly handles the case when no UTF-8 bytes are emitted while waiting for a low surrogate.

Also applies to: 725-743

1813-1854: Excellent edge case testing for buffer boundaries.

The tests thoroughly validate that surrogate pairs are correctly handled even when split across very small buffer chunks, ensuring the streaming implementation is robust.

picojson/src/slice_parser.rs (4)

179-179: Comment simplification looks good.

The simplified comment is clearer and the delimiter logic is self-explanatory from the code.

211-213: Critical fix for surrogate pair interruption handling.

Resetting the Unicode escape collector when encountering simple escapes ensures that interrupted surrogate pairs (like \uD801\n\uDC37) are properly rejected. This is essential for spec-compliant Unicode handling.

433-435: Proper validation of surrogate pairs in JSON keys.

The check for incomplete surrogate pairs before finalizing keys ensures that malformed Unicode sequences are rejected, maintaining JSON spec compliance.

450-452: Consistent surrogate pair validation for string values.

The validation mirrors the key handling, ensuring complete Unicode compliance across all string contexts.

picojson/tests/surrogate_pairs.rs (2)

92-267: Excellent comprehensive test coverage for surrogate pairs.

The tests thoroughly cover:

Valid surrogate pair scenarios

Edge cases near surrogate boundaries

Error conditions including interrupted pairs

Various escape sequences that should reset surrogate state

This provides strong confidence in the implementation's correctness.

269-341: Thorough buffer boundary and edge case testing.

The tests effectively stress the buffer management by:

Splitting surrogate pairs across chunk boundaries

Using extremely small chunk sizes

Testing pathological cases

Verifying complex nested structures

This ensures robust handling across different parsing configurations.

kaidokert · 2025-07-11T04:59:43Z

/gemini review

gemini-code-assist

Code Review

The pull request introduces robust UTF-16 surrogate pair support with well-structured code and extensive tests. Addressing the high-severity issues related to non-adjacent surrogate pairs is crucial for ensuring correct parsing and preventing data loss.

picojson/src/slice_parser.rs

picojson/src/stream_parser.rs

kaidokert · 2025-07-11T05:01:45Z

@coderabbitai review

coderabbitai · 2025-07-11T05:01:51Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Implement surrogate pairs

7b584d3

gemini-code-assist bot reviewed Jul 11, 2025

View reviewed changes

sourcery-ai bot reviewed Jul 11, 2025

View reviewed changes

picojson/src/stream_parser.rs Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Jul 11, 2025

View reviewed changes

picojson/src/stream_parser.rs Outdated Show resolved Hide resolved

picojson/tests/surrogate_pairs.rs Outdated Show resolved Hide resolved

picojson/tests/surrogate_pairs.rs Outdated Show resolved Hide resolved

picojson/tests/surrogate_pairs.rs Outdated Show resolved Hide resolved

coderabbitai bot reviewed Jul 11, 2025

View reviewed changes

Comment feedback

1edc230

gemini-code-assist bot reviewed Jul 11, 2025

View reviewed changes

picojson/src/slice_parser.rs Show resolved Hide resolved

picojson/src/stream_parser.rs Show resolved Hide resolved

kaidokert merged commit 4c98a51 into main Jul 11, 2025
70 checks passed

kaidokert deleted the surrogates branch July 11, 2025 05:19

coderabbitai bot mentioned this pull request Jul 14, 2025

Another big refactor #56

Closed

Implement surrogate pairs #49

Implement surrogate pairs #49

Uh oh!

Conversation

kaidokert commented Jul 11, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Summary by CodeRabbit

Uh oh!

sourcery-ai bot commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for surrogate pair handling in Unicode escape processing

Class diagram for updated Unicode escape processing and surrogate pair support

Class diagram for parser changes related to surrogate pair validation

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

coderabbitai bot commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

kaidokert commented Jul 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

kaidokert commented Jul 11, 2025

Uh oh!

coderabbitai bot commented Jul 11, 2025

Uh oh!

Uh oh!

Uh oh!

kaidokert commented Jul 11, 2025 •

edited by coderabbitai bot

Loading

sourcery-ai bot commented Jul 11, 2025 •

edited

Loading

coderabbitai bot commented Jul 11, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)