Replacement character (`\uFFFD`) in JavaScript/TypeScript causes to breaks #57930

wooorm · 2024-03-25T17:17:45Z

🔎 Search Terms

TS1490: File appears to be binary
replacement character

🕗 Version & Regression Information

#57008 is incorrect on some real files.

⏯ Playground Link

micromark/micromark#166

💻 Code

Note the � in the first 256 characters.

import assert from 'node:assert/strict'
import test from 'node:test'
import {micromark} from 'micromark'

test('nul', function () {
  assert.equal(
    micromark('asd\0asd'),
    '<p>asd�asd</p>',
    'should replace `\\0` w/ a replacement characters (`�`)'
  )
})

🙁 Actual behavior

error TS1490: File appears to be binary

🙂 Expected behavior

0xFFFD should be fine among other regular characters.

Additional information about the issue

Related-to: #21136.
Related-to: #56516.
Related-to: #57008.

The text was updated successfully, but these errors were encountered:

RyanCavanaugh · 2024-03-25T17:31:28Z

Just write '\ufffd' instead? Putting the signifier that says "this file was corrupted" into a file, causing a tool for humans to say "this file looks corrupted", seems like correct behavior.

fatcerberus · 2024-03-25T17:45:02Z

I always thought the entire point of the 0xFFFD codepoint was that it's only supposed to be generated by a text decoding failure and has no reason to ever appear literally in the source text. Like Ryan says it's specifically designated as an indicator of corrupted text.

RyanCavanaugh · 2024-03-25T17:55:19Z

Situation: \ufffd exists because it shouldn't ever appear in a correctly-encoded file on purpose

"We need to put a raw \ufffd in a file so we can test it"

Situation: \ufffd exists in a correctly-encoded file on purpose

https://xkcd.com/927/

RyanCavanaugh · 2024-03-25T18:11:28Z

Really the thing we need is for Buffer.toString to say when it inserted a replacement character instead of directly reading one. Without that information there's not a lot we can do that isn't expensive.

jakebailey · 2024-03-25T18:34:23Z

Since we have a Buffer in sys.readFile (thanks to us dealing with BOMs and so on), we can technically write:

const buf = fs.readFileSync(p);
const asText = buf.toString("utf8");
const badDecode = Buffer.compare(buf, Buffer.from(asText, "utf8")) === 0;

We could do that conditionally based on the presence of U+FFFD, but then we don't have any way to bubble that information up (because readFile returns just a string, and the point would be to not error on parse because the lower-level thing said it was okay).

wooorm · 2024-03-25T18:43:10Z

"exists because it shouldn't ever appear in a correctly-encoded file on purpose"

I don't think this is really the case. There are many algorithms in the html spec, and markdown spec, that do things that produce the replacement character.

It's like saying NaN shouldn't exist. Of course you don't want NaN normally but it has to exist as a concept. And that also means that you can check for it in code. And in tests. You have to be able to talk about it. The Unicode spec and Wikipedia and html and markdown need to be able to talk about the character.

Also, no other tool does what TS just started doing.

I'd think it's better to check for control characters instead of a replacement character. Perhaps before that toString

fatcerberus · 2024-03-25T21:20:33Z

It's like saying NaN shouldn't exist. Of course you don't want NaN normally but it has to exist as a concept. And that also means that you can check for it in code. And in tests. You have to be able to talk about it. The Unicode spec and Wikipedia and html and markdown need to be able to talk about the character.

NaN is actually a really good analogy, but maybe not for the reason you think: there's a big difference between testing for a computation that produces NaNs, as a way to detect bugs, vs. literally writing

let x = NaN;

The minute you assume the latter is allowed to happen in the wild, for any reason, then the former test becomes useless as an error-detection mechanism (this is probably the rationale behind why direct tests against NaN don't work, come to think of it; it discourages writing them literally for any reason). Yes, x is equal to NaN, but it was done on purpose in this case and doesn't indicate an invalid computation. Which is pretty much exactly like the problem we're having here.

Now, that having been said, I will admit that testing for binary files by looking for a Unicode decode failure is kind of hacky. But alas, design constraints (specifically, the test is done at a time when all the compiler has access to is the UTF-8 decoded text from the source file).

wooorm · 2024-03-26T11:36:40Z

Buffer#toString resulting in � does not mean that Buffer#toString is the only function that is ever allowed to result in � per Unicode.
Markdown for example, which has to deal with potentially malicious authors, has to make documents safe. So, the function markdown(input) also produces �.

Here are some more practical examples:

15 hits in WHATWG, https://github.com/search?q=org%3Awhatwg+%22%EF%BF%BD%22&type=code
16 hits in my personal code: https://github.com/search?q=%22%EF%BF%BD%22+user%3Awooorm&type=code

Even if there were never �s in input, I also argue that Buffer#toString (implying Buffer#toString('utf8')) resulting in � does not mean that a file is binary. Whether a buffer is valid UTF-8 is not the same as whether a buffer is binary or not.

Looking on npm for “is binary” and checking out the code, yields:

In the TS code base, there is access to the buffer. And UTF-8 is enforced already. So the bytes can be checked? Not too complex? https://github.com/wayfind/is-utf8/blob/master/is-utf8.js

snarbies · 2024-03-26T12:32:21Z

(this is probably the rationale behind why direct tests against NaN don't work, come to think of it; it discourages writing them literally for any reason).

Off-topic, but NaN compares false to NaN because there are multiple bit patterns that can represent it and never equal is much better than sometimes equal.

fatcerberus · 2024-03-26T14:36:36Z

Yes, I'm aware that's the theory - but I tend to suspect there's a more pragmatic rationale behind it too 😅 I don't think many people are pulling NaNs apart to inspect their low-level bit patterns, and if they are that's well outside the purview of the IEEE float spec, so from the perspective of normal FP operations the existence of multiple representations isn't even observable.

snarbies · 2024-03-26T16:23:30Z

¯\_(ツ)_/¯ I guess it can be two things

RyanCavanaugh · 2024-03-26T16:40:45Z

In the TS code base, there is access to the buffer. And UTF-8 is enforced already. So the bytes can be checked? Not too complex?

Everything can be fixed; the question is how much slower everyone's tsc should be in order for a handful of test files to not need to use escape sequences.

wooorm · 2024-03-26T16:48:30Z

Indeed! But then I’d move one more step back: why are people passing 700mb video files through typescript? How many folks are doing that 😅

The patch for that already only looks at the first 256 characters. For every file.
There is apparently code for BOMs too.
I’d wager that looking at 256 first bytes with something like https://github.com/gjtorikian/isBinaryFile/blob/main/src/index.ts, particularly when modified to actually do what the goal is (bail quickly when not UTF-8), will be so slow.

RyanCavanaugh · 2024-03-26T16:55:45Z

0x47 is "G". I don't know why someone seeing an error when the letter G appears at two random offsets would be less surprised to see "file is binary" than you are when looking at �.

wooorm · 2024-03-26T16:56:29Z

Right. You still can use those uncommon 2 bytes to go into a slightly slower path tho?

RyanCavanaugh · 2024-03-26T17:00:27Z

We're also not just trying to detect MPEG transport streams. There are other tools, e.g. Expo, which are/were emitting arbitrary binary files into .js file extensions and causing slowdowns because we tried to parse a giant garbage file.

wooorm · 2024-03-26T17:13:25Z

Interesting! I dunno, detecting “arbitrary binary files” made by various tools is just going to be complex I think.

Right now TS throws. I assume that on giant binary files TS was also throwing.
Why not improve that crash with a better message, check whether something appeared binary, and suggest adding an ignore pattern?
Or also when files are like 1mb+, do a small byte check then?
Or when � appears with this recent patch.
And also, as @jakebailey mentions, comparing the buffer: #57930 (comment).

I feel like there are a couple cases where a slightly more thorough check can be done. And I’m not 100% that that check is that slow.

RyanCavanaugh · 2024-03-26T17:15:38Z

PRs accepted

RyanCavanaugh added Suggestion An idea for TypeScript Awaiting More Feedback This means we'd like to hear from more people who would be helped by this feature labels Mar 25, 2024

RyanCavanaugh added Help Wanted You can do this and removed Awaiting More Feedback This means we'd like to hear from more people who would be helped by this feature labels Mar 25, 2024

RyanCavanaugh added this to the Backlog milestone Mar 25, 2024

This comment was marked as resolved.

Sign in to view

dsherret mentioned this issue Mar 26, 2024

feat: TypeScript 5.4 denoland/deno#23086

Merged

jakebailey mentioned this issue Apr 17, 2024

Error on replacement character only in top-level scanning #58227

Merged

jakebailey closed this as completed in #58227 Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replacement character (`\uFFFD`) in JavaScript/TypeScript causes to breaks #57930

Replacement character (`\uFFFD`) in JavaScript/TypeScript causes to breaks #57930

wooorm commented Mar 25, 2024

RyanCavanaugh commented Mar 25, 2024

fatcerberus commented Mar 25, 2024

RyanCavanaugh commented Mar 25, 2024

RyanCavanaugh commented Mar 25, 2024 •

edited

jakebailey commented Mar 25, 2024 •

edited

wooorm commented Mar 25, 2024 •

edited

fatcerberus commented Mar 25, 2024 •

edited

wooorm commented Mar 26, 2024 •

edited

snarbies commented Mar 26, 2024

fatcerberus commented Mar 26, 2024

snarbies commented Mar 26, 2024 •

edited

RyanCavanaugh commented Mar 26, 2024

wooorm commented Mar 26, 2024 •

edited

This comment was marked as resolved.

RyanCavanaugh commented Mar 26, 2024

wooorm commented Mar 26, 2024 •

edited

RyanCavanaugh commented Mar 26, 2024

wooorm commented Mar 26, 2024

RyanCavanaugh commented Mar 26, 2024

Replacement character (\uFFFD) in JavaScript/TypeScript causes to breaks #57930

Replacement character (\uFFFD) in JavaScript/TypeScript causes to breaks #57930

Comments

wooorm commented Mar 25, 2024

🔎 Search Terms

🕗 Version & Regression Information

⏯ Playground Link

💻 Code

🙁 Actual behavior

🙂 Expected behavior

Additional information about the issue

RyanCavanaugh commented Mar 25, 2024

fatcerberus commented Mar 25, 2024

RyanCavanaugh commented Mar 25, 2024

RyanCavanaugh commented Mar 25, 2024 • edited

jakebailey commented Mar 25, 2024 • edited

wooorm commented Mar 25, 2024 • edited

fatcerberus commented Mar 25, 2024 • edited

wooorm commented Mar 26, 2024 • edited

snarbies commented Mar 26, 2024

fatcerberus commented Mar 26, 2024

snarbies commented Mar 26, 2024 • edited

RyanCavanaugh commented Mar 26, 2024

wooorm commented Mar 26, 2024 • edited

This comment was marked as resolved.

RyanCavanaugh commented Mar 26, 2024

wooorm commented Mar 26, 2024 • edited

RyanCavanaugh commented Mar 26, 2024

wooorm commented Mar 26, 2024

RyanCavanaugh commented Mar 26, 2024

Replacement character (`\uFFFD`) in JavaScript/TypeScript causes to breaks #57930

Replacement character (`\uFFFD`) in JavaScript/TypeScript causes to breaks #57930

RyanCavanaugh commented Mar 25, 2024 •

edited

jakebailey commented Mar 25, 2024 •

edited

wooorm commented Mar 25, 2024 •

edited

fatcerberus commented Mar 25, 2024 •

edited

wooorm commented Mar 26, 2024 •

edited

snarbies commented Mar 26, 2024 •

edited

wooorm commented Mar 26, 2024 •

edited

wooorm commented Mar 26, 2024 •

edited