Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextDecoder does not error incorrectly for legacy byte sequences #40091

Open
domenic opened this issue Sep 12, 2021 · 2 comments
Open

TextDecoder does not error incorrectly for legacy byte sequences #40091

domenic opened this issue Sep 12, 2021 · 2 comments
Labels
encoding Issues and PRs related to the TextEncoder and TextDecoder APIs.

Comments

@domenic
Copy link
Contributor

domenic commented Sep 12, 2021

Version

v16.9.1

Platform

Microsoft Windows NT 10.0.19043.0 x64

Subsystem

encoding

What steps will reproduce the bug?

Enter the following in the REPL:

new TextDecoder("Big5").decode(new Uint8Array([0x83, 0x5C])).charCodeAt(0).toString(16)

as well as

new TextDecoder("Big5").decode(new Uint8Array([0x83, 0x5C])).charCodeAt(1).toString(16)

How often does it reproduce? Is there a required condition?

Every time

What is the expected behavior?

fffd for the first, and 5c for the second (as in Firefox and Chrome, and per the WHATWG Encoding Standard)

What do you see instead?

f00e and NaN

Additional information

I suspect this has to do with you using ICU as-is, instead of properly patching it to match the Encoding Standard. There are probably more bugs like this.

@inexorabletash may be able to point to where in the Chromium source tree we keep our ICU encoding patches.

@VoltrexKeyva VoltrexKeyva added the encoding Issues and PRs related to the TextEncoder and TextDecoder APIs. label Sep 12, 2021
@inexorabletash
Copy link

In ICU, it's important to use the "HTML" tag when selecting encodings, which should select the web-compatible encodngs based on Encoding. That started off as Chromium-specific but was upstreamed.

For Chromium's remaining customization, https://source.chromium.org/chromium/chromium/src/+/main:third_party/icu/README.chromium is the right place to start, and it references a patch for gb18030/windows-936

I should note that Chrome differs from the Encoding standard for a few encodings (e.g. GBK / GB18030 have differences)

@domenic
Copy link
Contributor Author

domenic commented Nov 7, 2023

This problem remains on Node.js v21.1.0.

Deno 1.38.0 does not have this problem, at least for the minimal test case in the OP. I suspect Bun does not either since it uses WebKit's TextEncoder/TextDecoder implementation.

Relevant web platform tests are https://wpt.fyi/results/encoding?label=experimental&label=master&aligned .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
encoding Issues and PRs related to the TextEncoder and TextDecoder APIs.
Projects
None yet
Development

No branches or pull requests

3 participants