-
Notifications
You must be signed in to change notification settings - Fork 2
Implement surrogate pairs #49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer's GuideThis PR augments JSON escape processing to fully support UTF-16 surrogate pairs. It introduces utilities for detecting and combining high/low surrogates, refactors the core escape routines and collector to track pending high surrogates and emit optional UTF-8 output, adjusts both stream and slice parsers to reset or validate surrogate state around simple escapes and at string/key boundaries, and adds extensive unit and example tests to cover correct decoding and error cases across buffer boundaries. Sequence diagram for surrogate pair handling in Unicode escape processingsequenceDiagram
participant Parser
participant UnicodeEscapeCollector
participant EscapeProcessor
participant Buffer
Parser->>UnicodeEscapeCollector: add_hex_digit(digit)
Parser->>UnicodeEscapeCollector: process_to_utf8(&mut buffer)
UnicodeEscapeCollector->>EscapeProcessor: process_unicode_escape(hex_slice, utf8_buffer, pending_high_surrogate)
alt No pending high surrogate
EscapeProcessor->>EscapeProcessor: is_high_surrogate(codepoint)
alt High surrogate
EscapeProcessor-->>UnicodeEscapeCollector: (None, Some(high_surrogate))
else Not a surrogate
EscapeProcessor-->>UnicodeEscapeCollector: (Some(utf8_bytes), None)
end
else Pending high surrogate
EscapeProcessor->>EscapeProcessor: is_low_surrogate(codepoint)
alt Low surrogate
EscapeProcessor->>EscapeProcessor: combine_surrogate_pair(high, low)
EscapeProcessor-->>UnicodeEscapeCollector: (Some(utf8_bytes), None)
else Not a low surrogate
EscapeProcessor-->>UnicodeEscapeCollector: Error
end
end
UnicodeEscapeCollector-->>Parser: (Option<&[u8]>, bool)
alt UTF-8 bytes present
Parser->>Buffer: Write utf8_bytes
else High surrogate pending
Parser->>Buffer: Wait for next escape
end
Class diagram for updated Unicode escape processing and surrogate pair supportclassDiagram
class EscapeProcessor {
+is_high_surrogate(codepoint: u32) bool
+is_low_surrogate(codepoint: u32) bool
+combine_surrogate_pair(high: u32, low: u32) Result<u32, ParseError>
+process_unicode_escape<'a>(hex_slice: &[u8], utf8_buffer: &'a mut [u8], pending_high_surrogate: Option<u32>) Result<(Option<&'a [u8]>, Option<u32>), ParseError>
}
class UnicodeEscapeCollector {
-hex_buffer: [u8; 4]
-hex_pos: usize
-pending_high_surrogate: Option<u32>
+reset()
+reset_all()
+add_hex_digit(digit: u8) Result<bool, ParseError>
+process_to_utf8<'a>(utf8_buffer: &'a mut [u8]) Result<(Option<&'a [u8]>, bool), ParseError>
+has_pending_high_surrogate() bool
}
EscapeProcessor <.. UnicodeEscapeCollector : uses
Class diagram for parser changes related to surrogate pair validationclassDiagram
class StreamParser {
-unicode_escape_collector: UnicodeEscapeCollector
+handle_simple_escape(escape_token)
+next_event()
}
class SliceParser {
-unicode_escape_collector: UnicodeEscapeCollector
+handle_simple_escape(escape_char)
+handle_unicode_escape()
}
StreamParser o-- UnicodeEscapeCollector
SliceParser o-- UnicodeEscapeCollector
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
WalkthroughThis update adds comprehensive support for UTF-16 surrogate pairs in JSON string parsing. It introduces new logic for detecting, combining, and validating surrogate pairs, updates escape processing and parser internals, and adds extensive tests to ensure correct handling—including edge cases, error conditions, and buffer boundary scenarios—across both slice-based and streaming parsers. Additionally, a new example demonstrates surrogate pair parsing, and minor internal code simplifications are applied. Changes
Sequence Diagram(s)sequenceDiagram
participant Input as JSON Input
participant Parser as Slice/Stream Parser
participant EscapeProc as EscapeProcessor
participant Collector as UnicodeEscapeCollector
Input->>Parser: Provide JSON string with \\uXXXX escapes
Parser->>EscapeProc: Detect \\uXXXX, check for surrogate
EscapeProc->>Collector: Track pending high surrogate (if any)
alt High surrogate found
Collector->>Parser: Wait for next \\uXXXX (low surrogate)
Parser->>EscapeProc: Provide next \\uXXXX
EscapeProc->>Collector: Combine surrogates, output UTF-8
else No surrogate or completed pair
EscapeProc->>Collector: Output UTF-8 bytes
end
Collector->>Parser: Return decoded UTF-8
Parser->>Input: Emit string event or error (if invalid sequence)
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (4)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
🧰 Additional context used🧠 Learnings (3)📓 Common learnings
picojson/tests/surrogate_pairs.rs (4)
picojson/src/escape_processor.rs (2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
🔇 Additional comments (9)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @kaidokert, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces full support for JSON Unicode surrogate pairs, allowing the parser to correctly interpret and decode characters outside the Basic Multilingual Plane (BMP) that are represented by two UTF-16 escape sequences. The changes involve fundamental updates to the Unicode escape processing logic, robust state management for multi-part sequences, and comprehensive error handling for malformed or interrupted pairs. This significantly enhances the parser's compliance with the JSON specification for string handling.
Highlights
- Core Logic Enhancement: Introduced new functions (
is_high_surrogate
,is_low_surrogate
,combine_surrogate_pair
) and significantly refactoredprocess_unicode_escape
inescape_processor.rs
. This enables the parser to correctly identify, combine, and UTF-8 encode two-part UTF-16 surrogate pairs (e.g.,\uD801\uDC37
) into single Unicode codepoints. - Robust State Management: Enhanced the
UnicodeEscapeCollector
to track apending_high_surrogate
state. This allows the parser to correctly handle multi-part Unicode escape sequences, ensuring that a high surrogate is properly followed by a low surrogate to form a valid character. - Improved Error Handling: Implemented comprehensive error detection for invalid surrogate sequences. This includes cases like lone low surrogates, high surrogates not followed by a valid low surrogate, or surrogate pairs that are interrupted by other escape sequences (e.g.,
\n
,\t
) or string terminations. Thereset_all()
method was added toUnicodeEscapeCollector
to facilitate clearing state upon such interruptions. - Unified Parser Integration: Updated both
SliceParser
andStreamParser
to integrate the new surrogate pair processing logic. This ensures consistent and correct behavior for parsing strings and object keys containing surrogate pairs, including careful management of internal buffers and byte consumption, especially for stream-based parsing across chunk boundaries. - Extensive Test Coverage: Added a new, comprehensive test suite (
picojson/tests/surrogate_pairs.rs
) to validate the surrogate pair implementation. These tests cover various valid scenarios, edge cases, and critical error conditions, ensuring the fix works reliably across different parser types and input configurations, including very small buffer sizes and chunking strategies.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @kaidokert - I've reviewed your changes - here's some feedback:
- Consider extracting the surrogate‐pair handling out of the main
process_unicode_escape
hot path into a smaller helper or state machine to reduce branching overhead for the common single‐codepoint case. - The
UnicodeEscapeCollector::reset
method doesn’t clear pending surrogate state (you needreset_all
), so renaming or clarifying those methods could help avoid accidental state retention. - This PR touches a lot of surface area—splitting the core surrogate‐pair implementation, parser integrations, and the large test suite into separate PRs would make review and future maintenance much easier.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider extracting the surrogate‐pair handling out of the main `process_unicode_escape` hot path into a smaller helper or state machine to reduce branching overhead for the common single‐codepoint case.
- The `UnicodeEscapeCollector::reset` method doesn’t clear pending surrogate state (you need `reset_all`), so renaming or clarifying those methods could help avoid accidental state retention.
- This PR touches a lot of surface area—splitting the core surrogate‐pair implementation, parser integrations, and the large test suite into separate PRs would make review and future maintenance much easier.
## Individual Comments
### Comment 1
<location> `picojson/src/stream_parser.rs:622` </location>
<code_context>
- if !in_escape && self.stream_buffer.has_unescaped_content() {
+ // Skip writing bytes to escape buffer when we have a pending high surrogate
+ // (prevents literal \uD801 text from being included in final string)
+ if !in_escape
+ && self.stream_buffer.has_unescaped_content()
+ && !self.unicode_escape_collector.has_pending_high_surrogate()
+ {
self.append_byte_to_escape_buffer(byte)?;
+ } else if !in_escape && self.stream_buffer.has_unescaped_content() {
}
}
</code_context>
<issue_to_address>
The else branch after the main if block is empty and can be removed.
Removing the empty else if branch will make the code clearer and less confusing.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
+ } else if !in_escape && self.stream_buffer.has_unescaped_content() {
+ }
=======
>>>>>>> REPLACE
</suggested_fix>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces comprehensive support for UTF-16 surrogate pairs, enhancing the library's ability to parse a wider range of JSON data. The changes include new functions for surrogate pair detection and combination, as well as updates to the escape processing logic. The addition of new unit and integration tests ensures the correctness and robustness of the implementation.
I've provided feedback on improving code clarity and addressing memory leaks in the test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
picojson/src/slice_parser.rs (1)
239-256
: Consider improving clarity of the two-phase surrogate pair consumption.The logic correctly handles surrogate pair completion by consuming both unicode escapes separately, but the complexity could be reduced with clearer documentation or helper methods. The magic number
6
represents the length of\uXXXX
and could be defined as a constant for better maintainability.Consider extracting the magic number to a constant:
+ const UNICODE_ESCAPE_LENGTH: usize = 6; // Length of \uXXXX + fn process_unicode_escape_with_collector(&mut self) -> Result<(), ParseError> {Then use it in line 248:
- .handle_unicode_escape(escape_start_pos + 6, utf8_bytes)?; + .handle_unicode_escape(escape_start_pos + UNICODE_ESCAPE_LENGTH, utf8_bytes)?;picojson/tests/surrogate_pairs.rs (1)
70-90
: Note: Test helpers intentionally leak memory for lifetime management.The use of
leak()
in test helpers creates static references, which is acceptable for tests but results in memory that's never freed. This is a common pattern in Rust tests where'static
lifetimes are needed.Consider adding a comment to clarify the intentional memory leak:
fn create_slice_parser(input: &str) -> SliceParser { let scratch = vec![0u8; 1024]; + // Intentionally leak memory to get 'static lifetime for tests SliceParser::with_buffer(input, scratch.leak()) }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
picojson/examples/test_surrogate_pairs.rs
(1 hunks)picojson/src/escape_processor.rs
(12 hunks)picojson/src/slice_parser.rs
(5 hunks)picojson/src/stream_parser.rs
(7 hunks)picojson/tests/surrogate_pairs.rs
(1 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: stax/src/flex_parser.rs:0-0
Timestamp: 2025-06-28T18:12:30.015Z
Learning: In the stax JSON parser codebase, EscapeSequence event handlers exist in flex_parser.rs not because they're needed by that parser variant, but to avoid catch-all patterns in match statements. The flex parser doesn't need to process EscapeSequence events, but the other parser variant (direct parser) does need them.
picojson/src/slice_parser.rs (4)
Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: stax/src/flex_parser.rs:0-0
Timestamp: 2025-06-28T18:12:30.015Z
Learning: In the stax JSON parser codebase, EscapeSequence event handlers exist in flex_parser.rs not because they're needed by that parser variant, but to avoid catch-all patterns in match statements. The flex parser doesn't need to process EscapeSequence events, but the other parser variant (direct parser) does need them.
Learnt from: kaidokert
PR: kaidokert/picojson-rs#5
File: picojson/src/lib.rs:0-0
Timestamp: 2025-06-29T17:48:18.198Z
Learning: In the picojson-rs project, the `use tokenizer as ujson;` alias in lib.rs is a transitionary and fully internal private alias used during crate reorganization. Examples and external code no longer depend on this alias, making the private visibility appropriate.
Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: tokenizer/src/bitstack/mod.rs:0-0
Timestamp: 2025-06-28T23:43:22.783Z
Learning: In the picojson-rs project, the BitStack trait was redesigned to return bool instead of Option<bool> for pop() and top() methods. Empty stacks return false rather than None, which simplifies the API and avoids Option handling.
Learnt from: kaidokert
PR: kaidokert/picojson-rs#44
File: picojson/src/chunk_reader.rs:28-33
Timestamp: 2025-07-07T01:39:55.177Z
Learning: The PullParser trait in picojson-rs provides both next() and next_event() methods. The next() method is an iterator-like convenience method that returns Option<Result<Event, ParseError>>, returning None when EndDocument is reached. The next_event() method returns Result<Event, ParseError> directly. Both methods are valid and the choice depends on whether you want iterator-style usage (next) or direct result handling (next_event).
picojson/tests/surrogate_pairs.rs (3)
Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: stax/src/flex_parser.rs:0-0
Timestamp: 2025-06-28T18:12:30.015Z
Learning: In the stax JSON parser codebase, EscapeSequence event handlers exist in flex_parser.rs not because they're needed by that parser variant, but to avoid catch-all patterns in match statements. The flex parser doesn't need to process EscapeSequence events, but the other parser variant (direct parser) does need them.
Learnt from: kaidokert
PR: kaidokert/picojson-rs#5
File: picojson/src/lib.rs:0-0
Timestamp: 2025-06-29T17:48:18.198Z
Learning: In the picojson-rs project, the `use tokenizer as ujson;` alias in lib.rs is a transitionary and fully internal private alias used during crate reorganization. Examples and external code no longer depend on this alias, making the private visibility appropriate.
Learnt from: kaidokert
PR: kaidokert/picojson-rs#44
File: picojson/src/chunk_reader.rs:28-33
Timestamp: 2025-07-07T01:39:55.177Z
Learning: The PullParser trait in picojson-rs provides both next() and next_event() methods. The next() method is an iterator-like convenience method that returns Option<Result<Event, ParseError>>, returning None when EndDocument is reached. The next_event() method returns Result<Event, ParseError> directly. Both methods are valid and the choice depends on whether you want iterator-style usage (next) or direct result handling (next_event).
picojson/examples/test_surrogate_pairs.rs (4)
Learnt from: kaidokert
PR: kaidokert/picojson-rs#5
File: picojson/src/lib.rs:0-0
Timestamp: 2025-06-29T17:48:18.198Z
Learning: In the picojson-rs project, the `use tokenizer as ujson;` alias in lib.rs is a transitionary and fully internal private alias used during crate reorganization. Examples and external code no longer depend on this alias, making the private visibility appropriate.
Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: stax/src/flex_parser.rs:0-0
Timestamp: 2025-06-28T18:12:30.015Z
Learning: In the stax JSON parser codebase, EscapeSequence event handlers exist in flex_parser.rs not because they're needed by that parser variant, but to avoid catch-all patterns in match statements. The flex parser doesn't need to process EscapeSequence events, but the other parser variant (direct parser) does need them.
Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: tokenizer/src/bitstack/mod.rs:0-0
Timestamp: 2025-06-28T23:43:22.783Z
Learning: In the picojson-rs project, the BitStack trait was redesigned to return bool instead of Option<bool> for pop() and top() methods. Empty stacks return false rather than None, which simplifies the API and avoids Option handling.
Learnt from: kaidokert
PR: kaidokert/picojson-rs#44
File: picojson/src/chunk_reader.rs:28-33
Timestamp: 2025-07-07T01:39:55.177Z
Learning: The PullParser trait in picojson-rs provides both next() and next_event() methods. The next() method is an iterator-like convenience method that returns Option<Result<Event, ParseError>>, returning None when EndDocument is reached. The next_event() method returns Result<Event, ParseError> directly. Both methods are valid and the choice depends on whether you want iterator-style usage (next) or direct result handling (next_event).
picojson/src/stream_parser.rs (3)
Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: stax/src/flex_parser.rs:0-0
Timestamp: 2025-06-28T18:12:30.015Z
Learning: In the stax JSON parser codebase, EscapeSequence event handlers exist in flex_parser.rs not because they're needed by that parser variant, but to avoid catch-all patterns in match statements. The flex parser doesn't need to process EscapeSequence events, but the other parser variant (direct parser) does need them.
Learnt from: kaidokert
PR: kaidokert/picojson-rs#5
File: picojson/src/lib.rs:0-0
Timestamp: 2025-06-29T17:48:18.198Z
Learning: In the picojson-rs project, the `use tokenizer as ujson;` alias in lib.rs is a transitionary and fully internal private alias used during crate reorganization. Examples and external code no longer depend on this alias, making the private visibility appropriate.
Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: tokenizer/src/bitstack/mod.rs:0-0
Timestamp: 2025-06-28T23:43:22.783Z
Learning: In the picojson-rs project, the BitStack trait was redesigned to return bool instead of Option<bool> for pop() and top() methods. Empty stacks return false rather than None, which simplifies the API and avoids Option handling.
picojson/src/escape_processor.rs (1)
Learnt from: kaidokert
PR: kaidokert/picojson-rs#1
File: stax/src/flex_parser.rs:0-0
Timestamp: 2025-06-28T18:12:30.015Z
Learning: In the stax JSON parser codebase, EscapeSequence event handlers exist in flex_parser.rs not because they're needed by that parser variant, but to avoid catch-all patterns in match statements. The flex parser doesn't need to process EscapeSequence events, but the other parser variant (direct parser) does need them.
🧬 Code Graph Analysis (1)
picojson/src/slice_parser.rs (2)
picojson/src/escape_processor.rs (1)
process_unicode_escape_sequence
(659-700)picojson/src/slice_input_buffer.rs (1)
current_pos
(45-47)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: AVR Panic Prevention (int32, pico-huge, dev)
- GitHub Check: AVR Panic Prevention (int8, pico-tiny, dev)
- GitHub Check: AVR Panic Prevention (int32, pico-tiny, dev)
- GitHub Check: AVR Panic Prevention (int8, pico-tiny, release)
- GitHub Check: AVR Panic Prevention (int32, pico-tiny, release)
- GitHub Check: AVR Panic Prevention (int32, pico-huge, release)
- GitHub Check: AVR Panic Prevention (int32, pico-huge, dev)
- GitHub Check: AVR Panic Prevention (int8, pico-huge, dev)
- GitHub Check: Run AVR Test Suites
- GitHub Check: Run AVR Test Suites
🔇 Additional comments (16)
picojson/src/escape_processor.rs (5)
105-124
: LGTM! Correct UTF-16 surrogate pair implementation.The surrogate detection functions and combination logic correctly follow the UTF-16 specification.
126-182
: Well-structured surrogate pair state machine.The function correctly handles all surrogate pair combinations and error cases. The state transitions are clear and follow the UTF-16 specification.
194-268
: Clean API design for surrogate state tracking.The split between
reset()
andreset_all()
correctly handles the need to preserve surrogate state across escape sequences while allowing full reset on errors.
507-639
: Excellent test coverage for surrogate pair functionality.The tests thoroughly validate surrogate detection, combination, state management, and error handling.
642-700
: Smart position tracking for surrogate pairs.The adjustment of
escape_start_pos
when completing a surrogate pair (line 694) is a clever solution that ensures error messages and position tracking correctly reference the start of the entire surrogate pair sequence.picojson/examples/test_surrogate_pairs.rs (1)
1-69
: Well-structured example demonstrating surrogate pair support.The example effectively shows practical usage and validates correct UTF-8 encoding of surrogate pairs. Good choice of test cases covering both common emojis and less common musical symbols.
picojson/src/stream_parser.rs (4)
332-336
: Proper validation of surrogate pair completeness.The checks correctly ensure that strings and keys cannot end with an incomplete surrogate pair, maintaining UTF-16 validity.
Also applies to: 376-380
620-628
: Correct buffer management for surrogate pairs.The logic properly prevents intermediate surrogate text from appearing in the final output, ensuring clean UTF-8 encoding.
Also applies to: 655-661
685-687
: Proper handling of escape sequence interruptions.Resetting surrogate state on simple escapes correctly rejects interrupted surrogate pairs, and the code properly handles the case when no UTF-8 bytes are emitted while waiting for a low surrogate.
Also applies to: 725-743
1813-1854
: Excellent edge case testing for buffer boundaries.The tests thoroughly validate that surrogate pairs are correctly handled even when split across very small buffer chunks, ensuring the streaming implementation is robust.
picojson/src/slice_parser.rs (4)
179-179
: Comment simplification looks good.The simplified comment is clearer and the delimiter logic is self-explanatory from the code.
211-213
: Critical fix for surrogate pair interruption handling.Resetting the Unicode escape collector when encountering simple escapes ensures that interrupted surrogate pairs (like
\uD801\n\uDC37
) are properly rejected. This is essential for spec-compliant Unicode handling.
433-435
: Proper validation of surrogate pairs in JSON keys.The check for incomplete surrogate pairs before finalizing keys ensures that malformed Unicode sequences are rejected, maintaining JSON spec compliance.
450-452
: Consistent surrogate pair validation for string values.The validation mirrors the key handling, ensuring complete Unicode compliance across all string contexts.
picojson/tests/surrogate_pairs.rs (2)
92-267
: Excellent comprehensive test coverage for surrogate pairs.The tests thoroughly cover:
- Valid surrogate pair scenarios
- Edge cases near surrogate boundaries
- Error conditions including interrupted pairs
- Various escape sequences that should reset surrogate state
This provides strong confidence in the implementation's correctness.
269-341
: Thorough buffer boundary and edge case testing.The tests effectively stress the buffer management by:
- Splitting surrogate pairs across chunk boundaries
- Using extremely small chunk sizes
- Testing pathological cases
- Verifying complex nested structures
This ensures robust handling across different parsing configurations.
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request introduces robust UTF-16 surrogate pair support with well-structured code and extensive tests. Addressing the high-severity issues related to non-adjacent surrogate pairs is crucial for ensuring correct parsing and preventing data loss.
@coderabbitai review |
✅ Actions performedReview triggered.
|
Summary by Sourcery
Implement full support for UTF-16 surrogate pairs in JSON Unicode escape processing across both pull and stream parsers.
New Features:
Bug Fixes:
Enhancements:
Tests:
Summary by CodeRabbit
New Features
Bug Fixes
Tests