fix(node-adapter): handle utf-8 characters split across stream chunks#1496
fix(node-adapter): handle utf-8 characters split across stream chunks#1496dinwwwh merged 3 commits intomiddleapi:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the server's ability to process incoming request bodies by addressing a critical issue where multi-byte UTF-8 characters could be corrupted if they were split across different data chunks during streaming. By adopting a more sophisticated decoding mechanism, the change ensures the integrity of all text-based data, particularly for internationalized content, thereby improving the reliability and correctness of data handling. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughReplaces naive chunk-to-string concatenation with incremental UTF-8 decoding via TextDecoder at runtime and adds tests that verify multi-byte UTF-8 characters split across chunk boundaries, including a case with an intentionally incomplete final byte sequence yielding the replacement character. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request correctly addresses an issue with decoding UTF-8 characters that are split across stream chunks by using StringDecoder. The added tests are a great way to verify the fix and prevent future regressions. I've found a minor issue in one of the new test cases that will cause it to fail, and I've provided a suggestion to fix it. Otherwise, the changes look good.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
packages/standard-server-node/src/body.test.ts (2)
77-91: Optional: tighten JSON expectation to strict structural equality.Because JSON serialization drops
undefinedfields, this test can be made stricter and less matcher-dependent.Suggested tightening
it('json with utf-8 characters split across chunk boundaries', async () => { const original = { json: { text: '滚滚长江东逝水', }, - meta: undefined, } @@ - expect(standardBody).toEqual(original) + expect(standardBody).toStrictEqual(original) })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/standard-server-node/src/body.test.ts` around lines 77 - 91, The test expects the original object to equal the parsed JSON but includes a field with value undefined (meta) which JSON.stringify drops, so update the expectation to match the actual parsed structure; for example, remove the meta: undefined from the original test data or replace the expect(standardBody).toEqual(original) with expect(standardBody).toEqual(JSON.parse(JSON.stringify(original))). Locate the test using splitBufferInsideCharacter, createChunkedRequest and toStandardBody and make the expected value strictly reflect JSON.parse(JSON.stringify(original)) so the assertion uses structural equality with no undefined fields.
27-40: Optional: enforce multi-byte split intent in helper.
splitBufferInsideCharacteronly truly splits inside a character whensplitCharacteris multi-byte. A guard would prevent accidental weak test setups.Suggested hardening
function splitBufferInsideCharacter(text: string, splitCharacter: string): Buffer[] { const buffer = Buffer.from(text) const splitBytes = Buffer.from(splitCharacter) + if (splitBytes.length < 2) { + throw new Error(`split character must be multi-byte UTF-8: ${splitCharacter}`) + } const splitIndex = buffer.indexOf(splitBytes)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/standard-server-node/src/body.test.ts` around lines 27 - 40, The helper splitBufferInsideCharacter currently allows single-byte splitCharacter which means it doesn't actually exercise "splitting inside a character"; update splitBufferInsideCharacter to assert or throw if the provided splitCharacter encodes to a single byte (e.g. check Buffer.from(splitCharacter).length <= 1) so callers must pass a multi-byte split token, and adjust tests to supply a multi-byte splitCharacter where needed; keep the same return behavior and error message for split not found but add the new guard early in the function.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/standard-server-node/src/body.test.ts`:
- Around line 77-91: The test expects the original object to equal the parsed
JSON but includes a field with value undefined (meta) which JSON.stringify
drops, so update the expectation to match the actual parsed structure; for
example, remove the meta: undefined from the original test data or replace the
expect(standardBody).toEqual(original) with
expect(standardBody).toEqual(JSON.parse(JSON.stringify(original))). Locate the
test using splitBufferInsideCharacter, createChunkedRequest and toStandardBody
and make the expected value strictly reflect
JSON.parse(JSON.stringify(original)) so the assertion uses structural equality
with no undefined fields.
- Around line 27-40: The helper splitBufferInsideCharacter currently allows
single-byte splitCharacter which means it doesn't actually exercise "splitting
inside a character"; update splitBufferInsideCharacter to assert or throw if the
provided splitCharacter encodes to a single byte (e.g. check
Buffer.from(splitCharacter).length <= 1) so callers must pass a multi-byte split
token, and adjust tests to supply a multi-byte splitCharacter where needed; keep
the same return behavior and error message for split not found but add the new
guard early in the function.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 175ec3a2-8ea4-4c4e-a1a0-2396ee0a697f
📒 Files selected for processing (2)
packages/standard-server-node/src/body.test.tspackages/standard-server-node/src/body.ts
|
The issue was caused by decoding each incoming stream chunk independently with chunk.toString(). For UTF-8 text, a single character can span This change fixes that by using a streaming UTF-8 decoder (StringDecoder) so incomplete multi-byte sequences are buffered until the remaining |
|
Thanks @zxhyc131 for the awesome PR! I made a small change - instead of using |
More templates
@orpc/ai-sdk
@orpc/arktype
@orpc/client
@orpc/contract
@orpc/experimental-durable-iterator
@orpc/hey-api
@orpc/interop
@orpc/json-schema
@orpc/nest
@orpc/openapi
@orpc/openapi-client
@orpc/otel
@orpc/experimental-pino
@orpc/experimental-publisher
@orpc/experimental-publisher-durable-object
@orpc/experimental-ratelimit
@orpc/react
@orpc/react-query
@orpc/experimental-react-swr
@orpc/server
@orpc/shared
@orpc/solid-query
@orpc/standard-server
@orpc/standard-server-aws-lambda
@orpc/standard-server-fastify
@orpc/standard-server-fetch
@orpc/standard-server-node
@orpc/standard-server-peer
@orpc/svelte-query
@orpc/tanstack-query
@orpc/trpc
@orpc/valibot
@orpc/vue-colada
@orpc/vue-query
@orpc/zod
commit: |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Summary
StringDecoderto decode stream content safely across chunk boundariesProblem
Direct
chunk.toString()decoding can corrupt UTF-8 characters when multi-byte sequences are split across stream chunks.Testing
Summary by CodeRabbit
Bug Fixes
Tests