Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replacement character (\uFFFD) in JavaScript/TypeScript causes to breaks #57930

Closed
wooorm opened this issue Mar 25, 2024 · 19 comments · Fixed by #58227
Closed

Replacement character (\uFFFD) in JavaScript/TypeScript causes to breaks #57930

wooorm opened this issue Mar 25, 2024 · 19 comments · Fixed by #58227
Labels
Help Wanted You can do this Suggestion An idea for TypeScript
Milestone

Comments

@wooorm
Copy link

wooorm commented Mar 25, 2024

🔎 Search Terms

  • TS1490: File appears to be binary
  • replacement character

🕗 Version & Regression Information

#57008 is incorrect on some real files.

⏯ Playground Link

micromark/micromark#166

💻 Code

Note the in the first 256 characters.

import assert from 'node:assert/strict'
import test from 'node:test'
import {micromark} from 'micromark'

test('nul', function () {
  assert.equal(
    micromark('asd\0asd'),
    '<p>asd�asd</p>',
    'should replace `\\0` w/ a replacement characters (`�`)'
  )
})

🙁 Actual behavior

error TS1490: File appears to be binary

🙂 Expected behavior

0xFFFD should be fine among other regular characters.

Additional information about the issue

Related-to: #21136.
Related-to: #56516.
Related-to: #57008.

@RyanCavanaugh
Copy link
Member

Just write '\ufffd' instead? Putting the signifier that says "this file was corrupted" into a file, causing a tool for humans to say "this file looks corrupted", seems like correct behavior.

@RyanCavanaugh RyanCavanaugh added Suggestion An idea for TypeScript Awaiting More Feedback This means we'd like to hear from more people who would be helped by this feature labels Mar 25, 2024
@fatcerberus
Copy link

I always thought the entire point of the 0xFFFD codepoint was that it's only supposed to be generated by a text decoding failure and has no reason to ever appear literally in the source text. Like Ryan says it's specifically designated as an indicator of corrupted text.

@RyanCavanaugh
Copy link
Member

Situation: \ufffd exists because it shouldn't ever appear in a correctly-encoded file on purpose

"We need to put a raw \ufffd in a file so we can test it"

Situation: \ufffd exists in a correctly-encoded file on purpose

https://xkcd.com/927/

@RyanCavanaugh
Copy link
Member

RyanCavanaugh commented Mar 25, 2024

Really the thing we need is for Buffer.toString to say when it inserted a replacement character instead of directly reading one. Without that information there's not a lot we can do that isn't expensive.

@jakebailey
Copy link
Member

jakebailey commented Mar 25, 2024

Since we have a Buffer in sys.readFile (thanks to us dealing with BOMs and so on), we can technically write:

const buf = fs.readFileSync(p);
const asText = buf.toString("utf8");
const badDecode = Buffer.compare(buf, Buffer.from(asText, "utf8")) === 0;

We could do that conditionally based on the presence of U+FFFD, but then we don't have any way to bubble that information up (because readFile returns just a string, and the point would be to not error on parse because the lower-level thing said it was okay).

@wooorm
Copy link
Author

wooorm commented Mar 25, 2024

"exists because it shouldn't ever appear in a correctly-encoded file on purpose"

I don't think this is really the case. There are many algorithms in the html spec, and markdown spec, that do things that produce the replacement character.

It's like saying NaN shouldn't exist. Of course you don't want NaN normally but it has to exist as a concept. And that also means that you can check for it in code. And in tests. You have to be able to talk about it. The Unicode spec and Wikipedia and html and markdown need to be able to talk about the character.

Also, no other tool does what TS just started doing.

I'd think it's better to check for control characters instead of a replacement character. Perhaps before that toString

@RyanCavanaugh RyanCavanaugh added Help Wanted You can do this and removed Awaiting More Feedback This means we'd like to hear from more people who would be helped by this feature labels Mar 25, 2024
@RyanCavanaugh RyanCavanaugh added this to the Backlog milestone Mar 25, 2024
@fatcerberus
Copy link

fatcerberus commented Mar 25, 2024

It's like saying NaN shouldn't exist. Of course you don't want NaN normally but it has to exist as a concept. And that also means that you can check for it in code. And in tests. You have to be able to talk about it. The Unicode spec and Wikipedia and html and markdown need to be able to talk about the character.

NaN is actually a really good analogy, but maybe not for the reason you think: there's a big difference between testing for a computation that produces NaNs, as a way to detect bugs, vs. literally writing

let x = NaN;

The minute you assume the latter is allowed to happen in the wild, for any reason, then the former test becomes useless as an error-detection mechanism (this is probably the rationale behind why direct tests against NaN don't work, come to think of it; it discourages writing them literally for any reason). Yes, x is equal to NaN, but it was done on purpose in this case and doesn't indicate an invalid computation. Which is pretty much exactly like the problem we're having here.

Now, that having been said, I will admit that testing for binary files by looking for a Unicode decode failure is kind of hacky. But alas, design constraints (specifically, the test is done at a time when all the compiler has access to is the UTF-8 decoded text from the source file).

@wooorm
Copy link
Author

wooorm commented Mar 26, 2024

Buffer#toString resulting in does not mean that Buffer#toString is the only function that is ever allowed to result in per Unicode.
Markdown for example, which has to deal with potentially malicious authors, has to make documents safe. So, the function markdown(input) also produces .

Here are some more practical examples:

Even if there were never s in input, I also argue that Buffer#toString (implying Buffer#toString('utf8')) resulting in does not mean that a file is binary. Whether a buffer is valid UTF-8 is not the same as whether a buffer is binary or not.

Looking on npm for “is binary” and checking out the code, yields:

In the TS code base, there is access to the buffer. And UTF-8 is enforced already. So the bytes can be checked? Not too complex? https://github.com/wayfind/is-utf8/blob/master/is-utf8.js

@snarbies
Copy link

(this is probably the rationale behind why direct tests against NaN don't work, come to think of it; it discourages writing them literally for any reason).

Off-topic, but NaN compares false to NaN because there are multiple bit patterns that can represent it and never equal is much better than sometimes equal.

@fatcerberus
Copy link

Yes, I'm aware that's the theory - but I tend to suspect there's a more pragmatic rationale behind it too 😅 I don't think many people are pulling NaNs apart to inspect their low-level bit patterns, and if they are that's well outside the purview of the IEEE float spec, so from the perspective of normal FP operations the existence of multiple representations isn't even observable.

@snarbies
Copy link

snarbies commented Mar 26, 2024

¯\_(ツ)_/¯ I guess it can be two things

@RyanCavanaugh
Copy link
Member

In the TS code base, there is access to the buffer. And UTF-8 is enforced already. So the bytes can be checked? Not too complex?

Everything can be fixed; the question is how much slower everyone's tsc should be in order for a handful of test files to not need to use escape sequences.

@wooorm
Copy link
Author

wooorm commented Mar 26, 2024

Indeed! But then I’d move one more step back: why are people passing 700mb video files through typescript? How many folks are doing that 😅

The patch for that already only looks at the first 256 characters. For every file.
There is apparently code for BOMs too.
I’d wager that looking at 256 first bytes with something like https://github.com/gjtorikian/isBinaryFile/blob/main/src/index.ts, particularly when modified to actually do what the goal is (bail quickly when not UTF-8), will be so slow.

@wooorm

This comment was marked as resolved.

@RyanCavanaugh
Copy link
Member

0x47 is "G". I don't know why someone seeing an error when the letter G appears at two random offsets would be less surprised to see "file is binary" than you are when looking at �.

@wooorm
Copy link
Author

wooorm commented Mar 26, 2024

Right. You still can use those uncommon 2 bytes to go into a slightly slower path tho?

@RyanCavanaugh
Copy link
Member

We're also not just trying to detect MPEG transport streams. There are other tools, e.g. Expo, which are/were emitting arbitrary binary files into .js file extensions and causing slowdowns because we tried to parse a giant garbage file.

@wooorm
Copy link
Author

wooorm commented Mar 26, 2024

Interesting! I dunno, detecting “arbitrary binary files” made by various tools is just going to be complex I think.

Right now TS throws. I assume that on giant binary files TS was also throwing.
Why not improve that crash with a better message, check whether something appeared binary, and suggest adding an ignore pattern?
Or also when files are like 1mb+, do a small byte check then?
Or when appears with this recent patch.
And also, as @jakebailey mentions, comparing the buffer: #57930 (comment).

I feel like there are a couple cases where a slightly more thorough check can be done. And I’m not 100% that that check is that slow.

@RyanCavanaugh
Copy link
Member

PRs accepted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Help Wanted You can do this Suggestion An idea for TypeScript
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants