New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replacement character (\uFFFD
) in JavaScript/TypeScript causes to breaks
#57930
Comments
Just write |
I always thought the entire point of the |
Situation: "We need to put a raw Situation: |
Really the thing we need is for |
Since we have a Buffer in const buf = fs.readFileSync(p);
const asText = buf.toString("utf8");
const badDecode = Buffer.compare(buf, Buffer.from(asText, "utf8")) === 0; We could do that conditionally based on the presence of U+FFFD, but then we don't have any way to bubble that information up (because readFile returns just a string, and the point would be to not error on parse because the lower-level thing said it was okay). |
"exists because it shouldn't ever appear in a correctly-encoded file on purpose" I don't think this is really the case. There are many algorithms in the html spec, and markdown spec, that do things that produce the replacement character. It's like saying NaN shouldn't exist. Of course you don't want NaN normally but it has to exist as a concept. And that also means that you can check for it in code. And in tests. You have to be able to talk about it. The Unicode spec and Wikipedia and html and markdown need to be able to talk about the character. Also, no other tool does what TS just started doing. I'd think it's better to check for control characters instead of a replacement character. Perhaps before that toString |
NaN is actually a really good analogy, but maybe not for the reason you think: there's a big difference between testing for a computation that produces NaNs, as a way to detect bugs, vs. literally writing let x = NaN; The minute you assume the latter is allowed to happen in the wild, for any reason, then the former test becomes useless as an error-detection mechanism (this is probably the rationale behind why direct tests against NaN don't work, come to think of it; it discourages writing them literally for any reason). Yes, Now, that having been said, I will admit that testing for binary files by looking for a Unicode decode failure is kind of hacky. But alas, design constraints (specifically, the test is done at a time when all the compiler has access to is the UTF-8 decoded text from the source file). |
Here are some more practical examples:
Even if there were never Looking on npm for “is binary” and checking out the code, yields:
In the TS code base, there is access to the buffer. And UTF-8 is enforced already. So the bytes can be checked? Not too complex? https://github.com/wayfind/is-utf8/blob/master/is-utf8.js |
Off-topic, but NaN compares false to NaN because there are multiple bit patterns that can represent it and never equal is much better than sometimes equal. |
Yes, I'm aware that's the theory - but I tend to suspect there's a more pragmatic rationale behind it too 😅 I don't think many people are pulling NaNs apart to inspect their low-level bit patterns, and if they are that's well outside the purview of the IEEE float spec, so from the perspective of normal FP operations the existence of multiple representations isn't even observable. |
¯\_(ツ)_/¯ I guess it can be two things |
Everything can be fixed; the question is how much slower everyone's tsc should be in order for a handful of test files to not need to use escape sequences. |
Indeed! But then I’d move one more step back: why are people passing 700mb video files through typescript? How many folks are doing that 😅 The patch for that already only looks at the first 256 characters. For every file. |
This comment was marked as resolved.
This comment was marked as resolved.
0x47 is "G". I don't know why someone seeing an error when the letter G appears at two random offsets would be less surprised to see "file is binary" than you are when looking at �. |
Right. You still can use those uncommon 2 bytes to go into a slightly slower path tho? |
We're also not just trying to detect MPEG transport streams. There are other tools, e.g. Expo, which are/were emitting arbitrary binary files into .js file extensions and causing slowdowns because we tried to parse a giant garbage file. |
Interesting! I dunno, detecting “arbitrary binary files” made by various tools is just going to be complex I think. Right now TS throws. I assume that on giant binary files TS was also throwing. I feel like there are a couple cases where a slightly more thorough check can be done. And I’m not 100% that that check is that slow. |
PRs accepted |
🔎 Search Terms
🕗 Version & Regression Information
#57008 is incorrect on some real files.
⏯ Playground Link
micromark/micromark#166
💻 Code
Note the
�
in the first 256 characters.🙁 Actual behavior
error TS1490: File appears to be binary
🙂 Expected behavior
0xFFFD should be fine among other regular characters.
Additional information about the issue
Related-to: #21136.
Related-to: #56516.
Related-to: #57008.
The text was updated successfully, but these errors were encountered: