Skip to content

fix(node-adapter): handle utf-8 characters split across stream chunks#1496

Merged
dinwwwh merged 3 commits intomiddleapi:mainfrom
zxhyc131:fix/utf8-chunk-decoding
Mar 24, 2026
Merged

fix(node-adapter): handle utf-8 characters split across stream chunks#1496
dinwwwh merged 3 commits intomiddleapi:mainfrom
zxhyc131:fix/utf8-chunk-decoding

Conversation

@zxhyc131
Copy link
Contributor

@zxhyc131 zxhyc131 commented Mar 24, 2026

Summary

  • use StringDecoder to decode stream content safely across chunk boundaries
  • add tests for JSON and text bodies with UTF-8 characters split between chunks

Problem

Direct chunk.toString() decoding can corrupt UTF-8 characters when multi-byte sequences are split across stream chunks.

Testing

  • added unit tests for chunk-boundary UTF-8 decoding in JSON body
  • added unit tests for chunk-boundary UTF-8 decoding in text body

Summary by CodeRabbit

  • Bug Fixes

    • Improved UTF‑8 decoding for streamed request bodies so multi‑byte characters split across chunks are decoded correctly and incomplete final bytes are handled gracefully (replacement character used).
  • Tests

    • Added tests validating JSON and plain‑text payloads with multi‑byte UTF‑8 characters split across stream boundaries, including an incomplete final byte scenario.

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Mar 24, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the server's ability to process incoming request bodies by addressing a critical issue where multi-byte UTF-8 characters could be corrupted if they were split across different data chunks during streaming. By adopting a more sophisticated decoding mechanism, the change ensures the integrity of all text-based data, particularly for internationalized content, thereby improving the reliability and correctness of data handling.

Highlights

  • UTF-8 Character Handling: Implemented the use of StringDecoder to correctly handle UTF-8 characters that may be split across stream chunk boundaries, preventing data corruption.
  • Improved Test Coverage: Added new unit tests specifically designed to validate the robust decoding of UTF-8 characters in both JSON and plain text bodies when they are fragmented across stream chunks.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Mar 24, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5c325df9-2a05-4c84-a5fe-f176578fdfae

📥 Commits

Reviewing files that changed from the base of the PR and between f03d49a and c7b1e64.

📒 Files selected for processing (2)
  • packages/standard-server-node/src/body.test.ts
  • packages/standard-server-node/src/body.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/standard-server-node/src/body.test.ts

📝 Walkthrough

Walkthrough

Replaces naive chunk-to-string concatenation with incremental UTF-8 decoding via TextDecoder at runtime and adds tests that verify multi-byte UTF-8 characters split across chunk boundaries, including a case with an intentionally incomplete final byte sequence yielding the replacement character.

Changes

Cohort / File(s) Summary
Tests — chunked UTF-8 cases
packages/standard-server-node/src/body.test.ts
Added helpers to construct an IncomingMessage from Buffer chunks and to split a UTF‑8 string into two Buffer parts at a split character; added tests for application/json and text/plain where multi‑byte characters span chunks, including a case with a final incomplete UTF‑8 sequence expecting the replacement character ().
Runtime — UTF‑8 incremental decoding
packages/standard-server-node/src/body.ts
Rewrote _streamToString to use TextDecoder with { stream: true } to incrementally decode normalized Buffer chunks, accumulate decoded text, and finalize with decoder.decode(); changed Buffer import to a value import (import { Buffer } from 'node:buffer') for runtime usage.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I nibble bytes split in two,
TextDecoder stitches them through,
When a rune ends half-way, I sigh,
It blooms as � beneath my eye,
Little rabbit hops, tests say woo-hoo. 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: handle utf-8 characters split across stream chunks' directly and clearly summarizes the main change: fixing a UTF-8 decoding issue where multi-byte characters split across stream chunks are corrupted.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dosubot dosubot bot added bug Something isn't working javascript Pull requests that update javascript code labels Mar 24, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses an issue with decoding UTF-8 characters that are split across stream chunks by using StringDecoder. The added tests are a great way to verify the fix and prevent future regressions. I've found a minor issue in one of the new test cases that will cause it to fail, and I've provided a suggestion to fix it. Otherwise, the changes look good.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
packages/standard-server-node/src/body.test.ts (2)

77-91: Optional: tighten JSON expectation to strict structural equality.

Because JSON serialization drops undefined fields, this test can be made stricter and less matcher-dependent.

Suggested tightening
   it('json with utf-8 characters split across chunk boundaries', async () => {
     const original = {
       json: {
         text: '滚滚长江东逝水',
       },
-      meta: undefined,
     }
@@
-    expect(standardBody).toEqual(original)
+    expect(standardBody).toStrictEqual(original)
   })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/standard-server-node/src/body.test.ts` around lines 77 - 91, The
test expects the original object to equal the parsed JSON but includes a field
with value undefined (meta) which JSON.stringify drops, so update the
expectation to match the actual parsed structure; for example, remove the meta:
undefined from the original test data or replace the
expect(standardBody).toEqual(original) with
expect(standardBody).toEqual(JSON.parse(JSON.stringify(original))). Locate the
test using splitBufferInsideCharacter, createChunkedRequest and toStandardBody
and make the expected value strictly reflect
JSON.parse(JSON.stringify(original)) so the assertion uses structural equality
with no undefined fields.

27-40: Optional: enforce multi-byte split intent in helper.

splitBufferInsideCharacter only truly splits inside a character when splitCharacter is multi-byte. A guard would prevent accidental weak test setups.

Suggested hardening
 function splitBufferInsideCharacter(text: string, splitCharacter: string): Buffer[] {
   const buffer = Buffer.from(text)
   const splitBytes = Buffer.from(splitCharacter)
+  if (splitBytes.length < 2) {
+    throw new Error(`split character must be multi-byte UTF-8: ${splitCharacter}`)
+  }
   const splitIndex = buffer.indexOf(splitBytes)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/standard-server-node/src/body.test.ts` around lines 27 - 40, The
helper splitBufferInsideCharacter currently allows single-byte splitCharacter
which means it doesn't actually exercise "splitting inside a character"; update
splitBufferInsideCharacter to assert or throw if the provided splitCharacter
encodes to a single byte (e.g. check Buffer.from(splitCharacter).length <= 1) so
callers must pass a multi-byte split token, and adjust tests to supply a
multi-byte splitCharacter where needed; keep the same return behavior and error
message for split not found but add the new guard early in the function.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/standard-server-node/src/body.test.ts`:
- Around line 77-91: The test expects the original object to equal the parsed
JSON but includes a field with value undefined (meta) which JSON.stringify
drops, so update the expectation to match the actual parsed structure; for
example, remove the meta: undefined from the original test data or replace the
expect(standardBody).toEqual(original) with
expect(standardBody).toEqual(JSON.parse(JSON.stringify(original))). Locate the
test using splitBufferInsideCharacter, createChunkedRequest and toStandardBody
and make the expected value strictly reflect
JSON.parse(JSON.stringify(original)) so the assertion uses structural equality
with no undefined fields.
- Around line 27-40: The helper splitBufferInsideCharacter currently allows
single-byte splitCharacter which means it doesn't actually exercise "splitting
inside a character"; update splitBufferInsideCharacter to assert or throw if the
provided splitCharacter encodes to a single byte (e.g. check
Buffer.from(splitCharacter).length <= 1) so callers must pass a multi-byte split
token, and adjust tests to supply a multi-byte splitCharacter where needed; keep
the same return behavior and error message for split not found but add the new
guard early in the function.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 175ec3a2-8ea4-4c4e-a1a0-2396ee0a697f

📥 Commits

Reviewing files that changed from the base of the PR and between f30210a and 4d9a7ef.

📒 Files selected for processing (2)
  • packages/standard-server-node/src/body.test.ts
  • packages/standard-server-node/src/body.ts

@zxhyc131
Copy link
Contributor Author

The issue was caused by decoding each incoming stream chunk independently with chunk.toString(). For UTF-8 text, a single character can span
multiple bytes, and those bytes may be split across chunk boundaries. When that happens, decoding each chunk separately produces broken text or
replacement characters, which can then corrupt plain-text bodies and make JSON parsing fail.

This change fixes that by using a streaming UTF-8 decoder (StringDecoder) so incomplete multi-byte sequences are buffered until the remaining
bytes arrive in the next chunk. That preserves the original text across chunk boundaries.

@dinwwwh
Copy link
Member

dinwwwh commented Mar 24, 2026

Thanks @zxhyc131 for the awesome PR! I made a small change - instead of using StringDecoder, I opted for TextDecoder since it's more modern and widely supported. I also added tests to ensure text += decoder.decode() is called at the end. What do you think about my changes?

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Mar 24, 2026
@pkg-pr-new
Copy link

pkg-pr-new bot commented Mar 24, 2026

More templates

@orpc/ai-sdk

npm i https://pkg.pr.new/@orpc/ai-sdk@1496

@orpc/arktype

npm i https://pkg.pr.new/@orpc/arktype@1496

@orpc/client

npm i https://pkg.pr.new/@orpc/client@1496

@orpc/contract

npm i https://pkg.pr.new/@orpc/contract@1496

@orpc/experimental-durable-iterator

npm i https://pkg.pr.new/@orpc/experimental-durable-iterator@1496

@orpc/hey-api

npm i https://pkg.pr.new/@orpc/hey-api@1496

@orpc/interop

npm i https://pkg.pr.new/@orpc/interop@1496

@orpc/json-schema

npm i https://pkg.pr.new/@orpc/json-schema@1496

@orpc/nest

npm i https://pkg.pr.new/@orpc/nest@1496

@orpc/openapi

npm i https://pkg.pr.new/@orpc/openapi@1496

@orpc/openapi-client

npm i https://pkg.pr.new/@orpc/openapi-client@1496

@orpc/otel

npm i https://pkg.pr.new/@orpc/otel@1496

@orpc/experimental-pino

npm i https://pkg.pr.new/@orpc/experimental-pino@1496

@orpc/experimental-publisher

npm i https://pkg.pr.new/@orpc/experimental-publisher@1496

@orpc/experimental-publisher-durable-object

npm i https://pkg.pr.new/@orpc/experimental-publisher-durable-object@1496

@orpc/experimental-ratelimit

npm i https://pkg.pr.new/@orpc/experimental-ratelimit@1496

@orpc/react

npm i https://pkg.pr.new/@orpc/react@1496

@orpc/react-query

npm i https://pkg.pr.new/@orpc/react-query@1496

@orpc/experimental-react-swr

npm i https://pkg.pr.new/@orpc/experimental-react-swr@1496

@orpc/server

npm i https://pkg.pr.new/@orpc/server@1496

@orpc/shared

npm i https://pkg.pr.new/@orpc/shared@1496

@orpc/solid-query

npm i https://pkg.pr.new/@orpc/solid-query@1496

@orpc/standard-server

npm i https://pkg.pr.new/@orpc/standard-server@1496

@orpc/standard-server-aws-lambda

npm i https://pkg.pr.new/@orpc/standard-server-aws-lambda@1496

@orpc/standard-server-fastify

npm i https://pkg.pr.new/@orpc/standard-server-fastify@1496

@orpc/standard-server-fetch

npm i https://pkg.pr.new/@orpc/standard-server-fetch@1496

@orpc/standard-server-node

npm i https://pkg.pr.new/@orpc/standard-server-node@1496

@orpc/standard-server-peer

npm i https://pkg.pr.new/@orpc/standard-server-peer@1496

@orpc/svelte-query

npm i https://pkg.pr.new/@orpc/svelte-query@1496

@orpc/tanstack-query

npm i https://pkg.pr.new/@orpc/tanstack-query@1496

@orpc/trpc

npm i https://pkg.pr.new/@orpc/trpc@1496

@orpc/valibot

npm i https://pkg.pr.new/@orpc/valibot@1496

@orpc/vue-colada

npm i https://pkg.pr.new/@orpc/vue-colada@1496

@orpc/vue-query

npm i https://pkg.pr.new/@orpc/vue-query@1496

@orpc/zod

npm i https://pkg.pr.new/@orpc/zod@1496

commit: e1e5222

@codecov
Copy link

codecov bot commented Mar 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@dinwwwh dinwwwh merged commit 6bc474e into middleapi:main Mar 24, 2026
5 checks passed
@dinwwwh dinwwwh changed the title fix: handle utf-8 characters split across stream chunks fix(node-adapter): handle utf-8 characters split across stream chunks Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working javascript Pull requests that update javascript code lgtm This PR has been approved by a maintainer size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants