fs.readFileSync(filename, 'utf8') doesn't strip BOM markers #1918

Closed
dobesv opened this Issue Oct 21, 2011 · 11 comments

Comments

Projects
None yet
5 participants

dobesv commented Oct 21, 2011

Environment: cloud9ide.com, node version 0.4.5

If I read a file using fs.readFileSync(filename, 'utf8') that is encoded using UTF8 with BOM, the BOM is included in the resulting string.

I think the routine to decode UTF8 is supposed to automatically strip the BOM from the start of the stream before returning the string.

dobesv commented Oct 21, 2011

Workaround:

body = body.replace(/^\uFEFF/, '');

After reading a UTF8 file where you are uncertain whether it may have a BOM marker in it.

koichik commented Oct 21, 2011

If fs.readFileSync() strips the BOM automatically,

var text = fs.readFileSync('foo.tx', 'utf8');
fs.writeFileSync('foo.txt', text, 'utf8');

The BOM is lost...

dobesv commented Oct 24, 2011

Hmm maybe it is something that was fixed in a more recent version of node.js?

koichik commented Oct 24, 2011

No, I mean the BOM was lost from a file ('foo.txt') after fs.writeFileSync().
fs.writeFileSync() cannot add the BOM automatically because it depends on the application whether the BOM is necessary.
Therefore, I think that the BOM should not be removed automatically.

koichik closed this Oct 24, 2011

dobesv commented Nov 4, 2011

@koichik - can you clarify why you closed this issue? If I read a utf-8 file into a string it should not have a BOM in it, that's simply how UTF-8 decoding works, the BOM is not included in the decoded string.

Applications that expect the BOM to be present can add it back on when they write out the file, or to preserve the BOM they can read/write the file as binary.

dobesv commented Nov 4, 2011

OK I read a huge argument about this subject on the python mailing list and a bug report on the JVM systems and I see that it is more controversial than I had originally thought.

So, never mind ... looks like it's up to programmers to remove the BOM from UTF-8 files themselves.

What they did in python was interesting - they added a new encoding scheme called 'utf8-sig' which will strip the bom if present and emit a BOM when encoding to bytes. This allows the programmer to decide whether to use a BOM or not.

See http://docs.python.org/library/codecs.html:

"On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file."

Do you think that approach would be acceptable for use in node?

koichik commented Nov 4, 2011

You can easily write the utility (e.g. myfs.readUtf8FileSync()) in a user land.
So... I do not think that it is necessary to include 'utf8-sig' in Nodoe core.

If the ut8 file has a BOM, then in the latest node (0.6.18) it leaves the first characer of the string as unicode 65279 - which is 0xFE 0xFF - which is not what was read (that is the utf16 BOM?) - as the utf8 signature on a utf8 file is 0xef, 0xbb, 0xbf - so the current file reading does not really make sense at all.

dobesv commented May 24, 2012

Hi Myles,

It is confusing but it makes sense in way; when you decode those three
bytes using the UTF decoding algorithm you get the 16-bit BOM as the first
single character.

On Wed, May 23, 2012 at 7:35 PM, MylesPenlington <
reply@reply.github.com

wrote:

If the ut8 file has a BOM, then in the latest node (0.6.18) it leaves the
first characer of the string as unicode 65279 - which is 0xFE 0xFF - which
is not what was read (that is the utf16 BOM?) - as the utf8 signature on a
utf8 file is 0xef, 0xbb, 0xbf - so the current file reading does not really
make sense at all.


Reply to this email directly or view it on GitHub:
joyent#1918 (comment)

@kvz kvz pushed a commit to kvz/deprecated that referenced this issue Feb 21, 2014

Artur Dorochowicz Allow for Unicode Byte Order Mark in analyzed files. node-jshint read…
…s with fs.readFileSync with utf-8 encoding, but node.js keeps BOM in the returned string (see nodejs/node-v0.x-archive#1918) which is detected by jshint as unsafe characters.
b36ac70

@paulfitz paulfitz added a commit to paulfitz/daff that referenced this issue Oct 23, 2015

@paulfitz paulfitz strip BOM marker when loading utf8 text
See nodejs/node-v0.x-archive#1918

Thanks @paulsaurels for reporting a case of this.
7261b81

@rajkumar42 rajkumar42 added a commit to rajkumar42/omnisharp-vscode that referenced this issue Jul 18, 2016

@rajkumar42 rajkumar42 fs.readFileSync(filename, 'utf8') doesn't strip BOM markers
Taking the workaround specified here nodejs/node-v0.x-archive#1918
2dff400

@rajkumar42 rajkumar42 added a commit to OmniSharp/omnisharp-vscode that referenced this issue Jul 18, 2016

@rajkumar42 rajkumar42 fs.readFileSync(filename, 'utf8') doesn't strip BOM markers (#580)
Taking the workaround specified here nodejs/node-v0.x-archive#1918
2801e0f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment