Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 BOM stripping Redux #83

Closed
jbenet opened this issue Aug 12, 2015 · 28 comments
Closed

UTF-8 BOM stripping Redux #83

jbenet opened this issue Aug 12, 2015 · 28 comments

Comments

@jbenet
Copy link

jbenet commented Aug 12, 2015

@sindresorhus @phated @contra @IsaacSanch

#72 this is awful. I understand your interest in making things generally better, and I like making tooling that fixes legacy problems as much as everyone else.

But the contents of files are the user's domain. Not the tool's.

We wasted hours trying to find what in our big pipeline was changing our files. Changing content of files is not something that filesystem tools should do. To a filesystem, the contents of a file are sacred. Imagine if S3 or Dropbox or HTTP or even TCP subtly "tried to fix files for you". It would be a nightmare. All cryptographically secure protocols break!

I urge you to reconsider this. We were using vinyl and vinyl-fs for a bunch of ipfs pipelines, and would like to continue using it. We very much like the viewpoints described here: https://medium.com/@contrahacks/gulp-3828e8126466 But if the vinyl tooling does not hold it as a strong guarantee that contents of files can move from A to B without magically changing underneath, we'll have to abandon it for something else. :'(

I think @contra said it best:

Not adding it just because it was asked for, adding it because it makes sense. We shouldn't be fucking with contents without a way to disable it or we are a really shitty general purpose fs access tool (vinyl-fs)

And I agree. It was disappointing to find this design bug.

@yocontra
Copy link
Member

Pinging @mathiasbynens who wrote the PR for this code originally

@phated
Copy link
Member

phated commented Aug 12, 2015

Still 👎 on not stripping this. PR for reference #21 and we've created https://github.com/sindresorhus/gulp-bom for the few that want BOM.

All of the gulp libraries are written for the many, not the few, but the few have ways to work around it (hence the plugin).

@yocontra
Copy link
Member

@phated vinyl and vinyl-fs are outside of gulp. gulp can wrap vinyl-fs .src to default to BOM stripping while vinyl-fs defaults to not stripping BOM

@phated
Copy link
Member

phated commented Aug 12, 2015

@contra I understand what you are saying, but any library (not just gulp) should be written for the 90% case. The 90% (hell, even 99% case here) is to strip the BOM. The 1% can just use a plugin

@phated
Copy link
Member

phated commented Aug 12, 2015

@contra it's not like this is an undocumented feature of vinyl-fs (I just searched BOM on the readme)

@yocontra
Copy link
Member

@phated Yeah, it was documented after #72

@jbenet
Copy link
Author

jbenet commented Aug 12, 2015

@phated

any library (not just gulp) should be written for the 90% case. The 90% (hell, even 99% case here) is to strip the BOM

do you thinkit would be acceptable, then, for the linux kernel to strip BOM characters? what if your filesystem was trying to be super nice for the 99% of users who, you know, don't care about BOM?

> curl http://myfile.com/file | shasum 
d1059c2798bc5efbf0f824552da5e4f8d2f81741  -
> curl http://myfile.com/file >file
> cat file | shasum
526d626bc1cb512f8ae973f694b011ca324ad789  -

Or what if git did this?

> cat file | shasum
d1059c2798bc5efbf0f824552da5e4f8d2f81741  -
> git add file
> git commit -m "my file"
> git checkout file
> cat file | shasum
526d626bc1cb512f8ae973f694b011ca324ad789  -

Or netcat?

> cat file | shasum
d1059c2798bc5efbf0f824552da5e4f8d2f81741  -
> cat file | nc -l 1234
> nc 1234 | shasum
526d626bc1cb512f8ae973f694b011ca324ad789  -

Libraries are about abstractions of use. filesystem tools do not mutate files based on some prejudice of byte distributions. They honor the fundamental abstraction of a filesystem: the user should be able to put a bitstring in, and get the same bitstring out.

@phated
Copy link
Member

phated commented Aug 12, 2015

@jbenet your analogies are egregious, this is a build tool for the web, not curl, not netcat, not the linux kernel. This is a documented feature that takes opinionated stance that takes care of the 99%.

@mathiasbynens
Copy link
Contributor

@jbenet Your comparison is not very fair, IMHO.

I don’t think it’s very common for devs to use gulp tasks that mutate source files like that. In all of the Gulpfiles I’ve seen, files are concatenated/minified/etc. after which the result is saved to a new file.

(That said, stripping BOM from existing source files could be useful as a Git precommit hook, for example, and sure, gulp could be used for that. But this just adds to the point.)

IMHO, by default, concatenating two files together through gulp or any other build tool should not result in a file with a BOM somewhere in the middle.

This is anecdotal evidence, but every time I’ve seen a file with a BOM in it the developer was unaware their editor was configured to add it. It was never an intentional choice. BOM is a bug.

@mathiasbynens
Copy link
Contributor

I’d love to hear more details about @jbenet’s use case. What kind of files are you dealing with that make the BOM so important?

vinyl and vinyl-fs are outside of gulp. gulp can wrap vinyl-fs .src to default to BOM stripping while vinyl-fs defaults to not stripping BOM

I don’t have a strong opinion on this.

@jbenet
Copy link
Author

jbenet commented Aug 12, 2015

@phated

your analogies are egregious

No. vinyl-fs is used by many tools that transfer arbitrary binary files. including ours! (see the end of this post for an example much like nc and git that today uses vinyl-fs.

this is a build tool for the web

I think vinyl-fs is a very useful module, well outside of the confines of gulp. https://www.npmjs.com/browse/depended/vinyl-fs shows lots of other uses.

@mathiasbynens

This is https://github.com/wearefractal/vinyl-fs, an adapter for https://github.com/wearefractal/vinyl and the filesystem. It is inspired by -- and used for -- gulp, but is a separate module, as far as github, npm, and the abstractions presented by the module readme are concerned.

If you intend it not to be used by non-gulp things, i understand -- it's your module, not mine -- i'll go away. But my impression was you wanted this to be like any other module in npm: a general tool with utility outside of the scope of its original motivation (a good thing).

This is anecdotal evidence, but every time I’ve seen a file with a BOM in it the developer was unaware their editor was configured to add it. It was never an intentional choice. BOM is a bug.

Not all files piped through with vinyl and vinyl-fs are text (ASCII/Unicode) files. Take a look at https://www.npmjs.com/browse/depended/vinyl-fs -- I wonder what % of the module authors actually want some bytes to be stripped, or are even aware. Examples include a tool to download files from the internet and save them to the local file system. I.e. it uses vinyl-fs to save arbitrary binary files. (much like the examples I showed, @phated). I don't think they would want the particular sequence of bytes that happens to coincide with BOM to be stripped in a jpg, video, or a tar file... (that would wreak havoc to alignment of the format frames). Luckily, it seems to only use dest though, which AFAICT, does not do any BOM stripping.

Btw, yes, I am aware it is trying to do some detection with https://www.npmjs.com/package/is-utf8 -- but that looks not safe for arbitrary byte sequences (e.g. archives with utf files within them.)

I’d love to hear more details about @jbenet’s use case.

Thanks for asking! We are using vinyl-fs to move arbitrary binary files (not unicode!), where the infamous BOM just happens to be a particular sequence of bytes. Moreover, we follow string filesystem semantics-- and files have to be --bit for bit-- exactly the same in and out.

We're using vinyl-fs in a tool that effectively does this:

# add files in one computer
> hash=$(ipfs add <path>)

# get them in another computer
> ipfs get $hash

And it works both in the cli and in the web browser. If the bitstrings change, both fs guarantees and cryptographic checks fail.


to remind everyone, what the original PR asked for is this:

var vfs = require('vinyl-fs')
var files = vfs.srcs('./**/*', {stripBOM: false})

though i do think @contra is right about this:

gulp can wrap vinyl-fs .src to default to BOM stripping while vinyl-fs defaults to not stripping BOM


I'm done here. I've spent way more than the energy required to solve my own problem. Do what you will.

@mathiasbynens
Copy link
Contributor

Examples include a tool to download files from the internet and save them to the local file system. I.e. it uses vinyl-fs to save arbitrary binary files. (much like the examples I showed, @phated). I don't think they would want the particular sequence of bytes that happens to coincide with BOM to be stripped in a jpg, video, or a tar file... (that would wreak havoc to alignment of the format frames).

A file that starts with a BOM cannot possibly be a valid JPG, video (in any format), or tar file, so this example is not relevant here. strip-bom, and thus vinyl-fs, only strips BOM at the beginning of files that are detected to be UTF-8.

Like I said, I wouldn’t mind if vinyl-fs exposed a stripBOM option so @jbenet can set it to false. I just agree with @phated that the most sensical default for it is true, especially for dependents like gulp. I don’t really care what the default value for vinyl-fs itself is.

@phated phated mentioned this issue Sep 10, 2015
6 tasks
@tunnckoCore
Copy link
Contributor

All that said, I don't see problem to be option which by default to be true.

@yocontra
Copy link
Member

@jbenet Can you provide a more concrete use case (or sample repo) of where our BOM stripping is incorrectly stripping on a non-UTF8 file?

@kingmotley
Copy link

Just noob here, but why doesn't vinyl-fs strip the BOM from the content, and have a flag that indicates that it was present? dest can then write the BOM out if the file has this BOM flag was set (or not)? Most plug-ins, etc could then just worry about what is in content, blissfully ignoring the fact if the BOM was present or not. I'm probably missing something basic as I'm just picking all this up. You could always make an option then to treat BOMs as raw data for any edge cases in which you want to treat the BOM as content.

@tunnckoCore
Copy link
Contributor

The best, the only logical, non-breaking and reasonable thing would be to just expose option that can control that behavior, which would be enabled by default to support backward compatibility.

There's absolutely no problem to have such option, and there's no need to have any so stable reason behind to introduce it.

I don't see problem if some user want it and it not breaks the things and don't add complexity - it will only help and allow more control.

@yocontra
Copy link
Member

Again: I'm not against this, I'm just trying to understand it entirely before making a judgement

@kingmotley
Copy link

After looking further into this, and actually looking at how it's currently implemented, I'm going to have to side with jbenet on this one. Vinyl-fs likely shouldn't be BOM stripping, at least by default. It's quite possible to create files that fail all the safety tests, and it adds a non-trivial amount of processing to try and detect utf8 by scanning up to the entire file's contents making sure each byte conforms to uft8 encoding. I'd recommend either moving this logic up into gulp's methods, or having gulp pass a flag it's calls to vinyl-fs. Similar issues have arisen in most languages (C#, python, pascal, VB), and they solve it via a flag during file open or having two separate types of opens (binary vs text). You could even wrap vinyl-fs with a more complete file abstraction for text files, converting line endings in and out as well as stripping BOM and have gulp depend on this instead of a raw vinyl-fs.

Last thing you want is a situation where vinyl-fs is accidentally corrupting files. Tracking that down (if you detect it) would be very time consuming.

isUtf8 is only a safe check IF you know that the file is text to begin with. It's definitely not safe on arbitrary binary files. For example, imaging a program that has a single option, if written to disk as binary, and that option just happens to be a 16-bit signed int, and the value is -2, then you use vinyl-fs to copy the file, you've just corrupted it. Or a writing a C structure like: struct { short option1, char[2] state } to disk, if option1 is -2, then it'll also get corrupted.

I can come up with more examples, but the likelyhood of a user running across this issue and getting bitten by it is non-trivial, and tracking it down would be non-trivial. Silently corrupting files is about the worst thing for a tool chain can do.

@yocontra
Copy link
Member

I'm leaning towards this solution:

  • vinyl-fs .src has an option to strip BOM on read, which defaults to false
  • gulp defaults to stripping BOM on read, and overrides this option in vinyl-fs

Does this satisfy all parties?

@phated
Copy link
Member

phated commented Sep 14, 2015

👎 on gulp having to override defaults. Just default it to true in vinyl-fs and document it 2 times (once in options, once in the note that is already there)

@kingmotley
Copy link

I think that is a fair an equitable solution contra. That would allow vinyl-fs to remain by default as seeing content as "sacred", while allowing gulp which often works on text files to default it sane values for it as well.

@phated
Copy link
Member

phated commented Sep 14, 2015

@contra a guiding ethos I have taken with gulp 4 is not to override defaults of the underlying libraries because it causes complex hunting through the dependency tree to find where it is being overridden

@phated
Copy link
Member

phated commented Sep 14, 2015

@contra also, please re-read #83 (comment) which I think has the most well-informed information

@yocontra
Copy link
Member

What's going to happen:

  • vinyl-fs .src gets an option to strip BOM on read, which defaults to true.

Reasoning: Multiple people made conjecture that "this could be triggered on non-utf8 files" but not one real example was provided. This seems totally fine as default behavior and no real drawbacks other than philosophical differences over what a true fs library should do were provided.

@mathiasbynens
Copy link
Contributor

@contra 👍

@jbenet
Copy link
Author

jbenet commented Sep 26, 2015

@contra i'm 👍 for this resolution, as it breaks no code (important!) even if i think the default is incorrect.


btw, i did not spend time looking for a non-utf8 file because:

  • IMO i have found a test case: any file which is moved with vinyl and whose cryptographic checksum changes is (afaic) incorrect.
  • but if you want another answer, there are odd formats out there that simply just violate the expectations above. for example it is actually quite common in scientific datasets to have to have files that are just a single array or matrix of 4 or 8 byte integers. In such cases, a file with a number whose hex value starts with 0xefbbbf might match. one is provided in the repo below, with a single 4 byte number 0xefbbbf0a.
  • sure "no one will ever do this" -- well i found the problem by running into it :)

You can test both of these at https://github.com/jbenet/bom-diff-for-vinyl

> make test
./vmv.js
moved src/ to dst/
shasum src/test.num
34e399cb21690bfb4dbcc6308f96cd984461160b  src/test.num
shasum dst/test.num
adc83b19e793491b1c6ea0fd8b46cd9f32e592fc  dst/test.num
diff -r src dst
diff -r src/test.num dst/test.num
1c1
<
---
>
make: *** [test] Error 1

@kingmotley
Copy link

After scanning my entire drive, I found no files that would have their BOM stripped that weren't actual text files (or XML/HTML, etc). The only files I found that could cause problems would be the DRM type files that no doubt have signatures that would now fail since the contents of the file changed.

@tunnckoCore
Copy link
Contributor

Actually, in one or in another way, there would have breaking changes cuz it is new major release (if this change is intended to be for v4). So one more breaking change is not an issue, so it can defaults to false for v4+ and users will be warned. But im for true by default.

@phated phated closed this as completed in 870551c Sep 27, 2015
yocontra pushed a commit that referenced this issue Sep 27, 2015
add option to strip BOM, default to true - closes #83
phated added a commit that referenced this issue Nov 28, 2017
phated pushed a commit that referenced this issue Dec 5, 2017
add option to strip BOM, default to true - closes #83
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants