Skip to content
This repository

Buffer Incorrectly Encodes "\0" as 32 for ASCII Encoding #297

Closed
bigeasy opened this Issue · 11 comments

4 participants

Alan Gutierrez ry Ben Noordhuis Koichi Kobayashi
Alan Gutierrez

To reproduce:

(new Buffer("\0", "ascii"))[0] == 32

Correct for UTF-8:

(new Buffer("\0", "utf8"))[0] == 0

For reference, here is an ASCII table.

ry

The reason for it, is this line:

http://github.com/ry/node/blob/9922e4e433996722a76edb46d14f1729f33b4bed/deps/v8/src/api.cc#L3005

which I am not sure the purpose of... I'll ask in v8-users.

Ben Noordhuis

Side note: the behaviour for UTF-8 has changed since this issue was raised.

> typeof (new Buffer("\0", "utf8"))[0]
'undefined'

Perhaps we should add this as a caveat emptor to the docs and tell people to pass non-printable characters as an array instead of a string, e.g. [0] instead of "\0" or "\u0000"?

Koichi Kobayashi koichik referenced this issue from a commit in koichik/node
Koichi Kobayashi koichik Doc improvements
Fixes #297.
d39f23a
Koichi Kobayashi
Collaborator

How about d39f23a?

Alan Gutierrez

I am so confused. Don't we have two bugs now? The "ascii" encoding converts '\0' to 32, which is a space, ' ', and 'utf8' converts to undefined? So, the documentation patch not only excuses a bug, it gives you instructions that do not solve the problem, but instead triggers a different bug.

Why does the character with the code 0 not translate to a number 0?

The behavior is inconsistent, seemingly arbitrary:

[alan@postojna ~]$ node
> "\0".charAt(0)
'\u0000'
> "\0".charCodeAt(0)
0
> typeof (new Buffer("\0", "utf8"))[0]
'undefined'
> new Buffer([0]).toString("utf8")
'\u0000'
> typeof (new Buffer("\0", "ascii"))[0]
'number'
>  (new Buffer("\0", "ascii"))[0]
32

What is the reasoning behind any of it?

Koichi Kobayashi
Collaborator

Yes, we have 2 problems.

The 'utf8' encoding problem is same as #394, we can fix it (#1210).
Therefore I did not write it for the document.

The 'ascii' encoding problem (this issue) is V8's matter.
If we fix it, Node will copy bytes oneself, but it may be slower.
Then, I think that we won't fix it.

Ben Noordhuis

@bigeasy

The bug with UTF-8 is that if the last byte is a nul byte, it gets stripped. Consider:

> Buffer('\0')
<Buffer >
> Buffer('a\0')
<Buffer 61>
> Buffer('a\0b')
<Buffer 61 00 62>
> Buffer('a\0b\0')
<Buffer 61 00 62>

koichik posted a patch for that in #394.

Alan Gutierrez

Wow. That is really, really bad. Who is implementing this upstream?

Can we implement a correct encoding in Node.js and skip the capricious V8 implementations? They are not right and there is no justification for why they are the way they are.

Why does Node.js show deference to them? Maybe in Chrome these encodings don't matter, but Node.js is for network programming, and for network programming, correct and fast encoding implementations matter.

Alan Gutierrez

I just noticed:

@koichik ~ The ascii encoding problem (this issue) is V8's matter. If we fix it, Node will copy bytes oneself, but it may be slower.

If it is written in C it won't be slower.

Koichi Kobayashi
Collaborator

@bigeasy - I think that Node can not access string's data directly, Node needs copy twice.
first, copies of Unicode data to tmp buf using v8::String::Write.
second, copies with converting from tmp buf to target Buffer.

Alan Gutierrez

@koichik - Icky. I did post an inquiry a while back on the V8 mailing list. Maybe we can patch V8 and lobby for the patch inclusion? My guess is that the 32 is a copy-and-paste from somewhere else, that it doesn't solve a problem for them. The omitted UTF-8 \0 is some string conversion logic that got pushed into the encoder where it does not belong.

It will probably take some time to figure out how to argue the point. It seems obvious to me that decoding should be agnostic about the string implementation. It is probably very logical to them that 0 is stripped and adulterated. In the last exchange there, I was being told that 0 is not a valid ASCII character.

Koichi Kobayashi koichik closed this issue from a commit
Koichi Kobayashi koichik Doc improvements
Fixes #297.
d05afa5
Koichi Kobayashi koichik closed this in d05afa5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.