Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: add option to automatically add BOM to write methods #11767

Closed
mcdado opened this Issue Mar 9, 2017 · 25 comments

Comments

Projects
None yet
@mcdado
Copy link

mcdado commented Mar 9, 2017

Today I found out that you need to manually add a unicode representation of the Byte Order Mark in unicode files/streams.

The fact that you have to manually prepend it leads to confusion IMHO. I think that it would be better to add a addBom (or something like that) as an option to the different write methods, that would remove the manuality of the process.

@vsemozhetbyt

This comment has been minimized.

Copy link
Member

vsemozhetbyt commented Mar 9, 2017

This matter emerges from time to time. See, for example, these issues with comments:

#3040
#6924

@sam-github

This comment has been minimized.

Copy link
Member

sam-github commented Mar 9, 2017

The BOM is part of the file, removing or adding it would modify the data to/from disk, and trigger another set of bug reports along the lines of "why is the data I read in node smaller than the file on disk?", and "why does this file I wrote have these strange bytes in the front?". Its a bit awkward, but I think explicit is better than guessing the user's intentions here, though it might be possible to introduce new encoding names with +bom or something, that will add/remove the BOM implicitly (but under the explicit control of the user).

@mcdado

This comment has been minimized.

Copy link
Author

mcdado commented Mar 9, 2017

@sam-github

This comment has been minimized.

Copy link
Member

sam-github commented Mar 9, 2017

What is your specific suggestion? What "different write methods" would you like addBom: to be an option for?

/cc @srl295

@mcdado

This comment has been minimized.

Copy link
Author

mcdado commented Mar 9, 2017

@bnoordhuis

This comment has been minimized.

Copy link
Member

bnoordhuis commented Mar 10, 2017

With "auto" it would be on for encodings like utf16le but off for UTF8

The UTF-8 byte order mark (EF BB BF), while not common and discouraged, is in use.

It's trivial to strip from incoming data but I predict endless discussions on whether it should be added or not to outgoing data.

@mcdado

This comment has been minimized.

Copy link
Author

mcdado commented Mar 10, 2017

With "auto" it would be on for encodings like utf16le but off for UTF8

The UTF-8 byte order mark (EF BB BF), while not common and discouraged, is in use.

Exactly, at that point we're just talking defaults. To me, auto in write methods would not add BOM if the enconding is utf8, but would add it if the encoding is utf16le. See here:

The standard also allows the byte order to be stated explicitly by specifying UTF-16BE or UTF-16LE as the encoding type. When the byte order is specified explicitly this way, a BOM is specifically not supposed to be prepended to the text, and a U+FEFF at the beginning should be handled as a ZWNBSP character. Many applications ignore the BOM code at the start of any Unicode encoding. Web browsers often use a BOM as a hint in determining the character encoding.

In my experience, if you encode in utf16le and don't include a BOM, many readers won't be able to interprete it. BBEdit 11.6 on a Mac, VS Code 1.10.1 also on a Mac, they either can't open the file in the latter case or misinterpret it in the former.

@bnoordhuis

This comment has been minimized.

Copy link
Member

bnoordhuis commented Mar 10, 2017

I think you miss my point. You should file a pull request if you feel it's a worthwhile addition but you should be prepared for lots of discussion when it's a convenience thing like an 'auto' mode (or even an on/off mode; stripping and inserting BOMs is after all just a convenience.)

@mcdado

This comment has been minimized.

Copy link
Author

mcdado commented Mar 10, 2017

@zcorpan

This comment has been minimized.

Copy link

zcorpan commented Mar 14, 2017

What is the use case for writing UTF-16 at all?

@mcdado

This comment has been minimized.

Copy link
Author

mcdado commented Mar 14, 2017

@silverwind

This comment has been minimized.

Copy link
Contributor

silverwind commented Mar 14, 2017

I think you might better of wrapping or monkey-patching fs to suit your needs. BTW, are there any other programming languages with such a "feature"?

@mcdado

This comment has been minimized.

Copy link
Author

mcdado commented Mar 14, 2017

Personally, as a user of the language, I expect the runtime to know that if I choose the utf16le encoding, it should know that to make a legal file in such encoding, it needs to add the BOM.
If it doesn't automatically takes care of it, I would expect it to obviously show how to add such a character. This is also because it can be confusing to do by hand, because U+FEFF is the Unicode character, but you have to know that you always use as it is instead of flipping it according to the encoding that you're using. My point being that right now the situation is opaque, while it would be better if it was clearer and transparent. Probably the new affordance should be off by default, but adding a new option like "useBom : true" wouldn't cause any drawbacks IMHO.

@zcorpan

This comment has been minimized.

Copy link

zcorpan commented Mar 15, 2017

Unfortunately there are business softwares (HP SmartStream Designer) that
don't distinguish between UTF8 and Ansi because the former is
back-compatible,

OK, so you need the BOM for utf-8. It seems reasonable to me to have a convenient way to do that.

and their understanding of Unicode is UCS-2 (utf16le).

But you don't need to use utf-16, correct?

A bit of trivia about the BOM for utf-16:
In the era before https://encoding.spec.whatwg.org/ , if the encoding label is "utf-16", the BOM is mandatory; if the encoding label is "utf-16le" or "utf-16be", the BOM is forbidden. (Encoding label is what you'd put in the charset parameter for Content-Type response header in HTTP.)

Today per the Encoding Standard, the utf-16 decoder is more robust wrt the BOM and encoding label, but it does not specify an encoder because browsers do not need an encoder and everyone should be using only utf-8.

@mcdado

This comment has been minimized.

Copy link
Author

mcdado commented Mar 15, 2017

I admit that I'm not expert in encodings, or standards about them. In my case, I can only talk by what I experience using them, and this is how I experienced the issue:

> var fs = require('fs');
> fs.writeFileSync('/Users/David/Desktop/test.txt', 'aåäeèéëiïœoøöuü', {encoding: 'utf16le'});

I'm on a Mac, I have several text editors to try to open the test file. In order: TextEdit, BBEdit, Visual Studio Code, Safari, Hex Fiend.

TextEdit

BBEdit

Visual Studio Code

Safari

Hex Fiend

Then I did the following:

fs.writeFileSync('/Users/David/Desktop/test-bom.txt', '\ufeffaåäeèéëiïœoøöuü', {encoding: 'utf16le'});

And without making other screenshots, just trust me when the same apps interpreted the file just fine.

Hex Fiend

@mcdado

This comment has been minimized.

Copy link
Author

mcdado commented Mar 15, 2017

But you don't need to use utf-16, correct?

No, I'll try to explain better: I need to feed this software text files, it uses them as records. It usually happens that there are characters (like the one above in my last comment). If I save the output as utf8, the software does not interpret the file correctly, it assumes it is Ansi. Sure, it's a buggy interpreter, but that's not my problem to solve. This software has support for what it calls Unicode, but I found out that it means either UCS-2 or UTF-16. Through experimentation, I found out that using Notepad++ on Windows and converting to UCS-2 LE (which adds BOM) then it works okay, its interpreter works correctly. That's why I started using utf16le as encoding, but I was surprised that I have to include BOM by hand when, again in my experience, files without it simply don't work anywhere!

This is just anecdotal for the rest of the world I guess, but I thought it showed a "hole" in the assumption of Node trying to write to utf16le streams. If I'm wrong, then we should just keep adding BOM by hand, like, all the time.

@zcorpan

This comment has been minimized.

Copy link

zcorpan commented Mar 15, 2017

I think you would probably be better off using utf-8 with a BOM than using utf-16 (any variant).

Alternatively use utf-8 without a BOM and configure your editors to default to utf-8 instead of "Ansi".

@mcdado

This comment has been minimized.

Copy link
Author

mcdado commented Mar 15, 2017

The editors are not the problem… it's the specific software that expects either UCS-2 or UTF-16.

My example with the various editors was to show that they can't either read or detect UTF-16 without BOM, so it's not just me 😄

@jasnell

This comment has been minimized.

Copy link
Member

jasnell commented Mar 15, 2017

The key challenge to writing the BOM automatically is that the stream interface is quite agnostic to the encoding right now. It would be fairly straightforward, however, to create a light weight wrapper interface in userland that does this... something like...

const BomStream = require('...');
const fs = require('fs');
const out = fs.createWriteStream('data');
const bomout = new BomStream.Utf16LeStream(out);
bomout.write('some data');

While I am quite sympathetic to the problem, I don't believe we should be adding support for this in core.

@mcdado

This comment has been minimized.

Copy link
Author

mcdado commented Mar 15, 2017

Hmm, maybe I misnamed this Issue, and created confusion along the way.

What I'm proposing is a way to say addBom: true, which to me seems to be fairly minor challenge. I think it solves the discoverability issue, while it wouldn't do anything by default. When used, it wouldn't need to figure out in userland how to specify the BOM unicode character.

I guess it should also be smart enough to not do anything if the encoding being passed in one which doesn't not use BOMs.

@hsivonen

This comment has been minimized.

Copy link

hsivonen commented Mar 16, 2017

What's the use case for writing UTF-16 without a BOM? Shouldn't a BOM be automatically be added when UTF-16 output is requested?

@mcdado

This comment has been minimized.

Copy link
Author

mcdado commented Mar 16, 2017

What's the use case for writing UTF-16 without a BOM? Shouldn't a BOM be automatically be added when UTF-16 output is requested?

That's exactly what brought me here to discuss this.

@Trott

This comment has been minimized.

Copy link
Member

Trott commented Jul 30, 2017

This seems stalled (and seems to me like something that should be solved as a published module before consideration for adding to core, but reasonable people can disagree on that). I'm going to close this, but if that's misguided because there's active work going on or for some other reason, by all means, comment to that effect (or re-open if GitHub allows you to).

@Trott Trott closed this Jul 30, 2017

@mikaelfs

This comment has been minimized.

Copy link

mikaelfs commented Feb 15, 2019

Considering @mcdado proposal, I looked again at BOM definition from RFC 2781. An excerpt from Section 3.2 Byte Order Mark (BOM):

It is important to understand that the character 0xFEFF appearing at
any position other than the beginning of a stream MUST be interpreted
with the semantics for the zero-width non-breaking space, and MUST
NOT be interpreted as a byte-order mark. The contrapositive of that
statement is not always true: the character 0xFEFF in the first
position of a stream MAY be interpreted as a zero-width non-breaking
space, and is not always a byte-order mark. For example, if a process
splits a UTF-16 string into many parts, a part might begin with
0xFEFF because there was a zero-width non-breaking space at the
beginning of that substring.

According to Node JS documentation for fs.writeFile method, data can be string, Buffer, TypedArray, or DataView. When user is passing string, the data length is fixed or can be said to represent a single stream with known length. As the first position of single stream that contains multibyte characters will represent BOM, it will be handy to have an option to add BOM (e.g.: addBom option) that is built into the method and is activated when the user supplies string.

I stumbled into this same issue when trying to display East Asian characters in a CSV file generated with Node JS properly in Excel on MacOS. It may not be too intuitive for users if they have to manually add BOM into the data. When the option is there, users can have better insight about how to deal with writing multibyte characters into file properly.

I wrote an article about this issue, including trial and error to properly write multibyte characters into a CSV file that can be later read across different applications.

@mcdado

This comment has been minimized.

Copy link
Author

mcdado commented Feb 15, 2019

@mikaelfs thank you for your input. I hope that at least this Issue has the SEO juice to bubble up in search results and help confused people searching for a solution. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.