UTF-8 character corruption in fast-utf8-stream.js via releaseWritingBuf() leads to silent data loss on partial writes

### Version

_No response_

### Platform

```text
**Summary:** releaseWritingBuf() in lib/internal/streams/fast-utf8-stream.js incorrectly calculates string slice positions when fs.write returns a byte count that splits a multi-byte UTF-8 character, causing silent data corruption (lost characters, lone surrogates in output).

**Description:** 

The releaseWritingBuf function (line 896) converts bytes-written to character count using:


n = Buffer.from(writingBuf).subarray(0, n).toString().length;


When n bytes cuts through a multi-byte character, the incomplete UTF-8 sequence becomes U+FFFD (replacement character) via .toString(). This replacement character has a different .length than the original character in JS UTF-16, causing .slice(n) to cut at the wrong position:

- 3-byte characters (CJK, most non-Latin): character silently dropped from output
- 4-byte characters (emoji, supplementary CJK): lone low surrogate left in remaining buffer, producing invalid UTF-8 on next write

The file was recently added (January 2026), ported from SonicBoom. It is used as the fast path for streaming UTF-8 output.

## Steps To Reproduce:

1. Save this as poc.js and run with node poc.js:


// Reproduces the releaseWritingBuf logic from lib/internal/streams/fast-utf8-stream.js lines 896-906
function releaseWritingBuf(writingBuf, len, n) {
  if (typeof writingBuf === 'string' && Buffer.byteLength(writingBuf) !== n) {
    n = Buffer.from(writingBuf).subarray(0, n).toString().length;
  }
  len = Math.max(len - n, 0);
  writingBuf = writingBuf.slice(n);
  return { writingBuf, len };
}

// Case 1: 4-byte emoji split at byte 7 — lone surrogate
const r1 = releaseWritingBuf("hello🌍world", 14, 7);
console.log("Case 1 - Emoji split:");
console.log("  Result:", JSON.stringify(r1.writingBuf));
console.log("  Expected:", JSON.stringify("🌍world"));
console.log("  First char code: 0x" + r1.writingBuf.charCodeAt(0).toString(16));
console.log("  Is lone surrogate:", r1.writingBuf.charCodeAt(0) >= 0xDC00 &&
r1.writingBuf.charCodeAt(0) <= 0xDFFF);

// Case 2: 3-byte CJK char split at byte 4 — character lost
const r2 = releaseWritingBuf("abc中def", 9, 4);
console.log("\nCase 2 - CJK split:");
console.log("  Result:", JSON.stringify(r2.writingBuf));
console.log("  Expected:", JSON.stringify("中def"));
console.log("  Character 中 lost:", !r2.writingBuf.includes("中"));


2. Output shows:

Case 1 - Emoji split:
  Result: "\udf0dworld"        ← CORRUPTED (lone surrogate)
  Expected: "🌍world"
  First char code: 0xdf0d
  Is lone surrogate: true

Case 2 - CJK split:
  Result: "def"                ← CHARACTER LOST
  Expected: "中def"
  Character 中 lost: true


3. The vulnerable code is at:
https://github.com/nodejs/node/blob/main/lib/internal/streams/fast-utf8-stream.js#L896-L906

Partial fs.write returns are possible when writing to pipes near capacity, under disk I/O pressure, or to Docker log pipes (the exact use case mentioned in the file's comments on line 69-70).

Additional finding: Line 240 has a typo from the SonicBoom port — this._asyncDrainScheduled should be this.#asyncDrainScheduled. All other 5 references use the private field correctly. The newListener handler is effectively dead code.

## Impact:

Silent data corruption in output files. Applications using Utf8Stream for logging with international characters (CJK, emoji, Cyrillic) can produce corrupted output when partial writes occur. 3-byte characters are silently lost (no error emitted). 4-byte characters produce invalid UTF-8 (lone surrogates). This is especially relevant for the Docker container logging use case the file was designed for.

## Supporting Material/References:

- Vulnerable function: releaseWritingBuf() at https://github.com/nodejs/node/blob/main/lib/internal/streams/fast-utf8-stream.js#L896-L906
- Typo (secondary): line 240, _asyncDrainScheduled vs #asyncDrainScheduled
- File derived from SonicBoom (https://github.com/pinojs/sonic-boom) — the original has a similar issue but uses _ prefix consistently
- The PoC script above is standalone and runs on any Node.js version
```

### Subsystem

_No response_

### What steps will reproduce the bug?

1. Save the following script as `poc.js` and run it with `node poc.js`.

2. The script reproduces the exact logic from
   `lib/internal/streams/fast-utf8-stream.js` (lines 896–906),
   specifically the `releaseWritingBuf()` function.

3. The script simulates partial `fs.write()` behavior where the number
   of bytes written splits a multi-byte UTF-8 character.

4. Observe the output:
   - When a 4-byte UTF-8 character (emoji) is split, a lone surrogate
     remains in the output.
   - When a 3-byte UTF-8 character (CJK) is split, the character is
     silently dropped.

5. This demonstrates incorrect string slicing caused by converting
   byte counts to character counts via `.toString().length`.

### How often does it reproduce? Is there a required condition?

It reproduces deterministically whenever `fs.write()` (or an equivalent
internal write) returns a byte count that splits a multi-byte UTF-8
character.

The issue is not timing-dependent or race-based. The required condition
is a partial write that ends in the middle of a UTF-8 sequence.

This can occur when writing to pipes, sockets, or log streams under
backpressure (e.g. near-capacity pipes, Docker container logs, or heavy
I/O), which is a documented and expected behavior of `fs.write()`.

### What is the expected behavior? Why is that the expected behavior?

The output must always preserve valid UTF-8 and must not silently
corrupt data.

When a partial write ends in the middle of a multi-byte UTF-8 character,
the remaining bytes for that character should be preserved and written
in a subsequent write, rather than being dropped or converted into
replacement characters.

This is the expected behavior because:
- `fs.write()` is documented to return partial byte counts.
- UTF-8 stream handling must be byte-safe across writes.
- Producing lone surrogates or dropping characters violates UTF-8
  correctness and results in silent data corruption.

The current behavior breaks UTF-8 invariants and can corrupt log output
in real-world streaming scenarios, such as container logging and pipe-
based streams, which this module explicitly targets.

### What do you see instead?

Instead of preserving valid UTF-8 output, the stream produces corrupted
results when a partial write splits a multi-byte character.

Specifically:
- For 3-byte UTF-8 characters (e.g. CJK), the character is silently
  dropped from the output with no error.
- For 4-byte UTF-8 characters (e.g. emoji), the remaining buffer starts
  with a lone UTF-16 surrogate, producing invalid UTF-8 on subsequent
  writes.

No error or warning is emitted, resulting in silent data corruption in
the output stream.

### Additional information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UTF-8 character corruption in fast-utf8-stream.js via releaseWritingBuf() leads to silent data loss on partial writes #61744

Version

Platform

Subsystem

What steps will reproduce the bug?

How often does it reproduce? Is there a required condition?

What is the expected behavior? Why is that the expected behavior?

What do you see instead?

Additional information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

UTF-8 character corruption in fast-utf8-stream.js via releaseWritingBuf() leads to silent data loss on partial writes #61744

Description

Version

Platform

Subsystem

What steps will reproduce the bug?

How often does it reproduce? Is there a required condition?

What is the expected behavior? Why is that the expected behavior?

What do you see instead?

Additional information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions