Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buffer.indexOf is incorrect in utf16le encoding for odd byteOffset #26448

Open
hkjackey opened this issue Mar 5, 2019 · 27 comments
Open

buffer.indexOf is incorrect in utf16le encoding for odd byteOffset #26448

hkjackey opened this issue Mar 5, 2019 · 27 comments
Assignees
Labels
buffer Issues and PRs related to the buffer subsystem. confirmed-bug Issues with confirmed bugs.

Comments

@hkjackey
Copy link

hkjackey commented Mar 5, 2019

Version: v11.10.1
Platform: Windows 10 (64-bits)
Subsystem: buffer
File Encoding: UTF8

Please consider the following 2 lines of code:
let buf = Buffer.from('\u6881\u6882\u6881', 'utf16le');
console.log(buf.indexOf('\u6881', 1, 'utf16le'));

In this example, the expected correct output should be 4.
However, the result is 0.

@addaleax addaleax added the buffer Issues and PRs related to the buffer subsystem. label Mar 5, 2019
@abhinavsagar
Copy link

It would be much better if you showed exactly what the error log displays.

@mkopa
Copy link

mkopa commented Mar 6, 2019

@hkjackey ucs2 (utf16le) encoding is always two bytes. So byteOffset must by multiple of 2.
Look at the code: https://github.com/nodejs/node/blob/master/src/node_buffer.cc#L945 you see offset has been divided by 2 and result multiplied by 2.
If you put offset equal 1 you have 1 div 2 mul 2 == 0.

@addaleax
Copy link
Member

addaleax commented Mar 7, 2019

Currently, indexOf with UTF16-LE is odd anyway in another respect as well, namely that we only search for the needle at even indices:

> Buffer.from('00aaaa', 'hex').indexOf('\uaaaa', 'utf16le')
-1
> Buffer.from('0000aaaa', 'hex').indexOf('\uaaaa', 'utf16le')
2

I’m not sure if this is “correct”, because some people might expect this?

/cc @nodejs/buffer

@RReverser
Copy link
Member

I think it's correct as we wouldn't really want to break codepoint boundaries (although need to recheck if same holds for UTF-8 or any other encoding).

@seishun
Copy link
Contributor

seishun commented Mar 7, 2019

You mean code unit boundaries. This isn't relevant for UTF-8 because a UTF-8 code unit is a byte.

I suggest closing as expected behavior.

@RReverser
Copy link
Member

You mean code unit boundaries. This isn't relevant for UTF-8 because a UTF-8 code unit is a byte.

No, I do mean a codepoint (1-4 bytes in case of UTF-8).

@seishun
Copy link
Contributor

seishun commented Mar 7, 2019

In UTF-16 a codepoint can be represented by one or two 16-bit code units. I don't think Buffer.indexOf has any checks for that.

@RReverser
Copy link
Member

I don't think Buffer.indexOf has any checks for that.

Hmm that's another good question (although personally I'd prefer it to check if it doesn't).

@hkjackey
Copy link
Author

hkjackey commented Mar 8, 2019

@hkjackey ucs2 (utf16le) encoding is always two bytes. So byteOffset must by multiple of 2.
Look at the code: https://github.com/nodejs/node/blob/master/src/node_buffer.cc#L945 you see offset has been divided by 2 and result multiplied by 2.
If you put offset equal 1 you have 1 div 2 mul 2 == 0.

@mkopa Thanks for the reply.

I agree that utf16le encoding is always two bytes, but buffer is not.
"buffer" can be intended to store a mix of many things.
For example, the beginning 7 bytes of a buffer is used for some other things,
then starting from the 7th byte is used to store utf16le string.
In this case, an odd byteOffset is needed.

I regarded the current behavior as a bug because according to the NodeJS specification:
https://nodejs.org/api/buffer.html#buffer_buf_indexof_value_byteoffset_encoding
it does not mention that odd byteOffset is not allowed.

I would accept the current behavior not a bug if the NodeJS specification mentioned that.

@hkjackey
Copy link
Author

hkjackey commented Mar 8, 2019

Hi everybody!
Below I want to persuade that returning -1 for odd byteOffset in utf16le can be really weird.

Please consider the following 5 lines of code:
let buf = Buffer.alloc(7);
let str = '\u6881\u6882\u6881';
console.log(buf.write(str, 1, 'utf16le')); //6 (successfully written)
console.log(buf); //<Buffer 00 81 68 82 68 81 68>
console.log(buf.indexOf(str, 1, 'utf16le')); //-1 (NOT found)

I write a string at odd byteOffset 1 but finding it using indexOf at the same position
give me -1 which means "NOT found".
If anyone still regard the above behavior valid, then I have nothing to say.

@BridgeAR BridgeAR added the confirmed-bug Issues with confirmed bugs. label Mar 8, 2019
@seishun seishun self-assigned this Mar 8, 2019
@seishun
Copy link
Contributor

seishun commented Mar 8, 2019

That might be a breaking change - if the UTF-16 data starts at an even offset, indexOf might find a code unit composed of an odd byte and an even byte that happen to match the input value. For instance, this test breaks.

However, the behavior in the issue description is clearly a bug - the offset should be rounded up to an even value, which is a simple fix. Thoughts?

@addaleax
Copy link
Member

addaleax commented Mar 8, 2019

the offset should be rounded up to an even value, which is a simple fix. Thoughts?

I would disagree – I think buffer.indexOf(str, offset, enc) should work the same way that buffer.slice(offset).indexOf(str, enc) works (with an adjusted return value).

@addaleax
Copy link
Member

addaleax commented Mar 8, 2019

@seishun Would that work, just using .slice() if the offset is odd and then adjusting the return value? That way you don’t run into that issue you described with an odd byte and an even byte that happen to match, right?

@seishun
Copy link
Contributor

seishun commented Mar 8, 2019

If the offset is odd, buffer.slice(offset).indexOf(str, enc) will only consider byte pairs that are odd and even in the original buffer. For example, Buffer.from('0000aaaa', 'hex').slice(1).indexOf('\uaaaa', 'utf16le') currently fails, but Buffer.from('0000aaaa', 'hex').indexOf('\uaaaa', 1, 'utf16le') succeeds.

@addaleax
Copy link
Member

addaleax commented Mar 8, 2019

@seishun Yeah, I think that’s the behaviour that I would expect – by specifying an odd offset to begin with rather than an even one, I’m saying that I want to look at odd + even pairs rather than even + odd ones.

@seishun
Copy link
Contributor

seishun commented Mar 8, 2019

That seems unintuitive. Plus it wouldn't fix the original issue - indexOf would return -1 rather than the desired 4 or the current 0.

@BridgeAR
Copy link
Member

BridgeAR commented Mar 8, 2019

@seishun it would AFAIK return 2, not -1? I agree with @addaleax and would also expect slice(1) to be used in this case.

@hkjackey
Copy link
Author

hkjackey commented Mar 9, 2019

Thank everybody for handling this issue!

@seishun
Copy link
Contributor

seishun commented Mar 9, 2019

@BridgeAR Have you tried it?

let buf = Buffer.from('\u6881\u6882\u6881', 'utf16le');
console.log(buf.indexOf('\u6881', 1, 'utf16le'));
console.log(buf.slice(1).indexOf('\u6881', 'utf16le'));

@addaleax
Copy link
Member

addaleax commented Mar 9, 2019

@seishun Yeah, I think that’s (another) bug then. :/

@seishun
Copy link
Contributor

seishun commented Mar 9, 2019

@addaleax So what would be the expected behaviour?

@addaleax
Copy link
Member

addaleax commented Mar 9, 2019

@seishun I would say in your example both methods should return -1. I think it shouldn’t matter whether a Buffer was created through .slice(), or allocated directly.

i.e., if buf is a Buffer, then buf.indexOf(...) and Buffer.from(buf).indexOf(...) should return the same results.

@hkjackey
Copy link
Author

@BridgeAR Have you tried it?

let buf = Buffer.from('\u6881\u6882\u6881', 'utf16le');
console.log(buf.indexOf('\u6881', 1, 'utf16le'));
console.log(buf.slice(1).indexOf('\u6881', 'utf16le'));

@addaleax So what would be the expected behaviour?

I suggest returning
4 for the 1st case (without slice) and
3 for the 2nd case (with slice) after the fix.

@seishun
Copy link
Contributor

seishun commented Mar 10, 2019

So we have a conflict of opinions.

@addaleax (and @BridgeAR?) thinks indexOf with 'utf16le' should look only at odd-even byte pairs if the offset is odd, and only at even-odd pairs otherwise.
@hkjackey thinks it should look at both odd-even and even-odd byte pairs regardless of the offset.

How should we proceed?

@addaleax
Copy link
Member

I’m personally also okay with what @hkjackey’s suggestion – that’s most consistent with what .indexOf() does for other kinds of arguments.

@seishun
Copy link
Contributor

seishun commented Mar 10, 2019

That brings us back to my comment. If you think it's fine then I can just delete that test and my PR is practically ready.

@hkjackey
Copy link
Author

@addaleax Thanks for your support!
Also thank @seishun for fixing this bug!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem. confirmed-bug Issues with confirmed bugs.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants