Skip to content

UTF-8 character corruption in fast-utf8-stream.js via releaseWritingBuf() leads to silent data loss on partial writes #61744

@pitiflautico

Description

@pitiflautico

Version

No response

Platform

**Summary:** releaseWritingBuf() in lib/internal/streams/fast-utf8-stream.js incorrectly calculates string slice positions when fs.write returns a byte count that splits a multi-byte UTF-8 character, causing silent data corruption (lost characters, lone surrogates in output).

**Description:** 

The releaseWritingBuf function (line 896) converts bytes-written to character count using:


n = Buffer.from(writingBuf).subarray(0, n).toString().length;


When n bytes cuts through a multi-byte character, the incomplete UTF-8 sequence becomes U+FFFD (replacement character) via .toString(). This replacement character has a different .length than the original character in JS UTF-16, causing .slice(n) to cut at the wrong position:

- 3-byte characters (CJK, most non-Latin): character silently dropped from output
- 4-byte characters (emoji, supplementary CJK): lone low surrogate left in remaining buffer, producing invalid UTF-8 on next write

The file was recently added (January 2026), ported from SonicBoom. It is used as the fast path for streaming UTF-8 output.

## Steps To Reproduce:

1. Save this as poc.js and run with node poc.js:


// Reproduces the releaseWritingBuf logic from lib/internal/streams/fast-utf8-stream.js lines 896-906
function releaseWritingBuf(writingBuf, len, n) {
  if (typeof writingBuf === 'string' && Buffer.byteLength(writingBuf) !== n) {
    n = Buffer.from(writingBuf).subarray(0, n).toString().length;
  }
  len = Math.max(len - n, 0);
  writingBuf = writingBuf.slice(n);
  return { writingBuf, len };
}

// Case 1: 4-byte emoji split at byte 7 — lone surrogate
const r1 = releaseWritingBuf("hello🌍world", 14, 7);
console.log("Case 1 - Emoji split:");
console.log("  Result:", JSON.stringify(r1.writingBuf));
console.log("  Expected:", JSON.stringify("🌍world"));
console.log("  First char code: 0x" + r1.writingBuf.charCodeAt(0).toString(16));
console.log("  Is lone surrogate:", r1.writingBuf.charCodeAt(0) >= 0xDC00 &&
r1.writingBuf.charCodeAt(0) <= 0xDFFF);

// Case 2: 3-byte CJK char split at byte 4 — character lost
const r2 = releaseWritingBuf("abc中def", 9, 4);
console.log("\nCase 2 - CJK split:");
console.log("  Result:", JSON.stringify(r2.writingBuf));
console.log("  Expected:", JSON.stringify("中def"));
console.log("  Character 中 lost:", !r2.writingBuf.includes("中"));


2. Output shows:

Case 1 - Emoji split:
  Result: "\udf0dworld"        ← CORRUPTED (lone surrogate)
  Expected: "🌍world"
  First char code: 0xdf0d
  Is lone surrogate: true

Case 2 - CJK split:
  Result: "def"                ← CHARACTER LOST
  Expected: "中def"
  Character 中 lost: true


3. The vulnerable code is at:
https://github.com/nodejs/node/blob/main/lib/internal/streams/fast-utf8-stream.js#L896-L906

Partial fs.write returns are possible when writing to pipes near capacity, under disk I/O pressure, or to Docker log pipes (the exact use case mentioned in the file's comments on line 69-70).

Additional finding: Line 240 has a typo from the SonicBoom port — this._asyncDrainScheduled should be this.#asyncDrainScheduled. All other 5 references use the private field correctly. The newListener handler is effectively dead code.

## Impact:

Silent data corruption in output files. Applications using Utf8Stream for logging with international characters (CJK, emoji, Cyrillic) can produce corrupted output when partial writes occur. 3-byte characters are silently lost (no error emitted). 4-byte characters produce invalid UTF-8 (lone surrogates). This is especially relevant for the Docker container logging use case the file was designed for.

## Supporting Material/References:

- Vulnerable function: releaseWritingBuf() at https://github.com/nodejs/node/blob/main/lib/internal/streams/fast-utf8-stream.js#L896-L906
- Typo (secondary): line 240, _asyncDrainScheduled vs #asyncDrainScheduled
- File derived from SonicBoom (https://github.com/pinojs/sonic-boom) — the original has a similar issue but uses _ prefix consistently
- The PoC script above is standalone and runs on any Node.js version

Subsystem

No response

What steps will reproduce the bug?

  1. Save the following script as poc.js and run it with node poc.js.

  2. The script reproduces the exact logic from
    lib/internal/streams/fast-utf8-stream.js (lines 896–906),
    specifically the releaseWritingBuf() function.

  3. The script simulates partial fs.write() behavior where the number
    of bytes written splits a multi-byte UTF-8 character.

  4. Observe the output:

    • When a 4-byte UTF-8 character (emoji) is split, a lone surrogate
      remains in the output.
    • When a 3-byte UTF-8 character (CJK) is split, the character is
      silently dropped.
  5. This demonstrates incorrect string slicing caused by converting
    byte counts to character counts via .toString().length.

How often does it reproduce? Is there a required condition?

It reproduces deterministically whenever fs.write() (or an equivalent
internal write) returns a byte count that splits a multi-byte UTF-8
character.

The issue is not timing-dependent or race-based. The required condition
is a partial write that ends in the middle of a UTF-8 sequence.

This can occur when writing to pipes, sockets, or log streams under
backpressure (e.g. near-capacity pipes, Docker container logs, or heavy
I/O), which is a documented and expected behavior of fs.write().

What is the expected behavior? Why is that the expected behavior?

The output must always preserve valid UTF-8 and must not silently
corrupt data.

When a partial write ends in the middle of a multi-byte UTF-8 character,
the remaining bytes for that character should be preserved and written
in a subsequent write, rather than being dropped or converted into
replacement characters.

This is the expected behavior because:

  • fs.write() is documented to return partial byte counts.
  • UTF-8 stream handling must be byte-safe across writes.
  • Producing lone surrogates or dropping characters violates UTF-8
    correctness and results in silent data corruption.

The current behavior breaks UTF-8 invariants and can corrupt log output
in real-world streaming scenarios, such as container logging and pipe-
based streams, which this module explicitly targets.

What do you see instead?

Instead of preserving valid UTF-8 output, the stream produces corrupted
results when a partial write splits a multi-byte character.

Specifically:

  • For 3-byte UTF-8 characters (e.g. CJK), the character is silently
    dropped from the output with no error.
  • For 4-byte UTF-8 characters (e.g. emoji), the remaining buffer starts
    with a lone UTF-16 surrogate, producing invalid UTF-8 on subsequent
    writes.

No error or warning is emitted, resulting in silent data corruption in
the output stream.

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    confirmed-bugIssues with confirmed bugs.fsIssues and PRs related to the fs subsystem / file system.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions