perf: eliminate hot-path allocations in chunker core#11
Merged
Conversation
- Replace regex-based `default_length_counter` with a single-pass char walk that tracks in_ws/started state: zero allocations, one pass instead of regex replace + char scan. - Eliminate `format!` in `set_length` by counting breadcrumb and text independently; the \n\n separator contributes exactly 1 char under whitespace-normalization. - Phase 3 hierarchical merge now consumes `result` by value via `mem::take + into_iter().peekable()`, eliminating 6×N deep Chunk clones that occurred even when no merging happened. - Replace `format!` merge bodies in Phase 2 and Phase 3 with `push_str` against existing String capacity; drop `"#".repeat()` in favour of a const-sliced `HASHES` string. - Drop vestigial `parent_headers.clone()` in Phase 3 (borrow directly). - Collect only `(start, end, &str)` triples from HEADER_REGEX instead of full Captures objects; use `bytes()` for the `#` level count. - Drop intermediate `Vec<&str>` from paragraph-split loops. - Replace `assert!` in N-API binding with `Err(...)` so an oversized chunk surfaces as a Promise rejection instead of aborting the process. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
package/src/lib.rs (1)
50-54: ⚡ Quick winConsider including the actual length value in the error message for better debugging.
Including
c.lengthwould help users understand how far they exceeded the limit, consistent with the error message pattern on line 84.📊 Proposed enhancement to include actual length
.map(|c| { if c.length > u32::MAX as usize { return Err(Error::from_reason( - "chunk length exceeds u32::MAX; docs >4 GiB unsupported on Node binding", + format!( + "chunk length {} exceeds u32::MAX; docs >4 GiB unsupported on Node binding", + c.length + ), )); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@package/src/lib.rs` around lines 50 - 54, The error message returned from Error::from_reason when checking if c.length > u32::MAX should include the actual c.length value to aid debugging; update the check in package/src/lib.rs (the branch using c.length and Error::from_reason) to format the message with both c.length and the u32::MAX limit (or 4 GiB) so callers see how large the chunk was in the error text.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crate/src/utils.rs`:
- Around line 11-13: The length calculation overcounts when chunk.text
normalizes to empty: instead of always adding a separator with b + 1 + t, only
add the separator when both breadcrumb and text are non-empty. Update the
assignment that uses default_length_counter(&chunk.breadcrumb) and
default_length_counter(&chunk.text) so chunk.length = if b == 0 { t } else if t
== 0 { b } else { b + 1 + t }, ensuring the separator is only counted when t >
0.
---
Nitpick comments:
In `@package/src/lib.rs`:
- Around line 50-54: The error message returned from Error::from_reason when
checking if c.length > u32::MAX should include the actual c.length value to aid
debugging; update the check in package/src/lib.rs (the branch using c.length and
Error::from_reason) to format the message with both c.length and the u32::MAX
limit (or 4 GiB) so callers see how large the chunk was in the error text.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b7f25ee6-258c-486a-a47b-ef12a244104b
📒 Files selected for processing (5)
crate/src/merge.rscrate/src/split.rscrate/src/tokens.rscrate/src/utils.rspackage/src/lib.rs
Gate the +1 separator on both sides being non-zero so the result matches whitespace-normalized concatenation for all inputs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fix already applied in commit 089c4ff — CodeRabbit approved in subsequent review.
Co-authored-by: gemini-code-assist <gemini-code-assist@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The three-branch else-if form introduced an unreachable line that dropped coverage below 99%. The two-branch || form is semantically equivalent, stays on one line (rustfmt-clean), and both arms are exercised. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
default_length_counterrewritten (tokens.rs): single-pass char walk within_ws/startedstate flags — zero allocations, no regex, one scan instead of regex-replace + char-count.set_lengthallocation eliminated (utils.rs): counts breadcrumb and text independently; the\n\nseparator contributes exactly 1 under whitespace-normalization, so noformat!needed.merge.rs):resultis now consumed by value viamem::take + into_iter().peekable()— removes the 6×N fullChunkclones that occurred even on no-op passes.format!→push_strin merge bodies (merge.rs): reuses existingStringcapacity instead of allocating a fresh one per merge;"#".repeat(level)replaced with a const-sliced"######".split.rshot-path cleanup: collect only(start, end, &str)triples fromHEADER_REGEXinstead of fullCaptures; usebytes()for the#level count; drop intermediateVec<&str>from paragraph-split loops.package/src/lib.rs): replaceassert!(which aborts the Node process) withErr(...)so an oversized chunk surfaces as a Promise rejection.None of these change the public API or algorithm.
Test plan
cargo test— 101 tests passcargo llvm-cov --package breadchunks --show-missing-lines --fail-under-lines 99— 99.14% line coverage, exit 0cargo checkinpackage/— compiles clean🤖 Generated with Claude Code