buffer.indexOf is incorrect in utf16le encoding for odd byteOffset #26448

hkjackey · 2019-03-05T08:00:07Z

Version: v11.10.1
Platform: Windows 10 (64-bits)
Subsystem: buffer
File Encoding: UTF8

Please consider the following 2 lines of code:
let buf = Buffer.from('\u6881\u6882\u6881', 'utf16le');
console.log(buf.indexOf('\u6881', 1, 'utf16le'));

In this example, the expected correct output should be 4.
However, the result is 0.

abhinavsagar · 2019-03-06T07:04:58Z

It would be much better if you showed exactly what the error log displays.

mkopa · 2019-03-06T18:55:28Z

@hkjackey ucs2 (utf16le) encoding is always two bytes. So byteOffset must by multiple of 2.
Look at the code: https://github.com/nodejs/node/blob/master/src/node_buffer.cc#L945 you see offset has been divided by 2 and result multiplied by 2.
If you put offset equal 1 you have 1 div 2 mul 2 == 0.

addaleax · 2019-03-07T09:43:57Z

Currently, indexOf with UTF16-LE is odd anyway in another respect as well, namely that we only search for the needle at even indices:

> Buffer.from('00aaaa', 'hex').indexOf('\uaaaa', 'utf16le')
-1
> Buffer.from('0000aaaa', 'hex').indexOf('\uaaaa', 'utf16le')
2

I’m not sure if this is “correct”, because some people might expect this?

/cc @nodejs/buffer

RReverser · 2019-03-07T10:08:54Z

I think it's correct as we wouldn't really want to break codepoint boundaries (although need to recheck if same holds for UTF-8 or any other encoding).

seishun · 2019-03-07T10:13:25Z

You mean code unit boundaries. This isn't relevant for UTF-8 because a UTF-8 code unit is a byte.

I suggest closing as expected behavior.

RReverser · 2019-03-07T10:42:13Z

You mean code unit boundaries. This isn't relevant for UTF-8 because a UTF-8 code unit is a byte.

No, I do mean a codepoint (1-4 bytes in case of UTF-8).

seishun · 2019-03-07T13:03:19Z

In UTF-16 a codepoint can be represented by one or two 16-bit code units. I don't think Buffer.indexOf has any checks for that.

RReverser · 2019-03-07T13:12:06Z

I don't think Buffer.indexOf has any checks for that.

Hmm that's another good question (although personally I'd prefer it to check if it doesn't).

hkjackey · 2019-03-08T08:37:29Z

@hkjackey ucs2 (utf16le) encoding is always two bytes. So byteOffset must by multiple of 2.
Look at the code: https://github.com/nodejs/node/blob/master/src/node_buffer.cc#L945 you see offset has been divided by 2 and result multiplied by 2.
If you put offset equal 1 you have 1 div 2 mul 2 == 0.

@mkopa Thanks for the reply.

I agree that utf16le encoding is always two bytes, but buffer is not.
"buffer" can be intended to store a mix of many things.
For example, the beginning 7 bytes of a buffer is used for some other things,
then starting from the 7th byte is used to store utf16le string.
In this case, an odd byteOffset is needed.

I regarded the current behavior as a bug because according to the NodeJS specification:
https://nodejs.org/api/buffer.html#buffer_buf_indexof_value_byteoffset_encoding
it does not mention that odd byteOffset is not allowed.

I would accept the current behavior not a bug if the NodeJS specification mentioned that.

hkjackey · 2019-03-08T11:00:29Z

Hi everybody!
Below I want to persuade that returning -1 for odd byteOffset in utf16le can be really weird.

Please consider the following 5 lines of code:
let buf = Buffer.alloc(7);
let str = '\u6881\u6882\u6881';
console.log(buf.write(str, 1, 'utf16le')); //6 (successfully written)
console.log(buf); //<Buffer 00 81 68 82 68 81 68>
console.log(buf.indexOf(str, 1, 'utf16le')); //-1 (NOT found)

I write a string at odd byteOffset 1 but finding it using indexOf at the same position
give me -1 which means "NOT found".
If anyone still regard the above behavior valid, then I have nothing to say.

seishun · 2019-03-08T20:51:05Z

That might be a breaking change - if the UTF-16 data starts at an even offset, indexOf might find a code unit composed of an odd byte and an even byte that happen to match the input value. For instance, this test breaks.

However, the behavior in the issue description is clearly a bug - the offset should be rounded up to an even value, which is a simple fix. Thoughts?

addaleax · 2019-03-08T21:02:17Z

the offset should be rounded up to an even value, which is a simple fix. Thoughts?

I would disagree – I think buffer.indexOf(str, offset, enc) should work the same way that buffer.slice(offset).indexOf(str, enc) works (with an adjusted return value).

addaleax · 2019-03-08T21:04:16Z

@seishun Would that work, just using .slice() if the offset is odd and then adjusting the return value? That way you don’t run into that issue you described with an odd byte and an even byte that happen to match, right?

seishun · 2019-03-08T21:13:09Z

If the offset is odd, buffer.slice(offset).indexOf(str, enc) will only consider byte pairs that are odd and even in the original buffer. For example, Buffer.from('0000aaaa', 'hex').slice(1).indexOf('\uaaaa', 'utf16le') currently fails, but Buffer.from('0000aaaa', 'hex').indexOf('\uaaaa', 1, 'utf16le') succeeds.

addaleax · 2019-03-08T21:25:02Z

@seishun Yeah, I think that’s the behaviour that I would expect – by specifying an odd offset to begin with rather than an even one, I’m saying that I want to look at odd + even pairs rather than even + odd ones.

seishun · 2019-03-08T21:30:44Z

That seems unintuitive. Plus it wouldn't fix the original issue - indexOf would return -1 rather than the desired 4 or the current 0.

BridgeAR · 2019-03-08T22:35:42Z

@seishun it would AFAIK return 2, not -1? I agree with @addaleax and would also expect slice(1) to be used in this case.

hkjackey · 2019-03-09T04:05:52Z

Thank everybody for handling this issue!

seishun · 2019-03-09T08:34:14Z

@BridgeAR Have you tried it?

let buf = Buffer.from('\u6881\u6882\u6881', 'utf16le');
console.log(buf.indexOf('\u6881', 1, 'utf16le'));
console.log(buf.slice(1).indexOf('\u6881', 'utf16le'));

addaleax · 2019-03-09T10:38:56Z

@seishun Yeah, I think that’s (another) bug then. :/

seishun · 2019-03-09T10:53:58Z

@addaleax So what would be the expected behaviour?

addaleax · 2019-03-09T10:59:02Z

@seishun I would say in your example both methods should return -1. I think it shouldn’t matter whether a Buffer was created through .slice(), or allocated directly.

i.e., if buf is a Buffer, then buf.indexOf(...) and Buffer.from(buf).indexOf(...) should return the same results.

hkjackey · 2019-03-10T04:48:17Z

@BridgeAR Have you tried it?

let buf = Buffer.from('\u6881\u6882\u6881', 'utf16le');
console.log(buf.indexOf('\u6881', 1, 'utf16le'));
console.log(buf.slice(1).indexOf('\u6881', 'utf16le'));

@addaleax So what would be the expected behaviour?

I suggest returning
4 for the 1st case (without slice) and
3 for the 2nd case (with slice) after the fix.

seishun · 2019-03-10T10:59:45Z

So we have a conflict of opinions.

@addaleax (and @BridgeAR?) thinks indexOf with 'utf16le' should look only at odd-even byte pairs if the offset is odd, and only at even-odd pairs otherwise.
@hkjackey thinks it should look at both odd-even and even-odd byte pairs regardless of the offset.

How should we proceed?

addaleax · 2019-03-10T11:15:46Z

I’m personally also okay with what @hkjackey’s suggestion – that’s most consistent with what .indexOf() does for other kinds of arguments.

seishun · 2019-03-10T11:22:13Z

That brings us back to my comment. If you think it's fine then I can just delete that test and my PR is practically ready.

hkjackey · 2019-03-11T09:39:23Z

@addaleax Thanks for your support!
Also thank @seishun for fixing this bug!

addaleax added the buffer Issues and PRs related to the buffer subsystem. label Mar 5, 2019

BridgeAR added the confirmed-bug Issues with confirmed bugs. label Mar 8, 2019

seishun self-assigned this Mar 8, 2019

seishun mentioned this issue Mar 11, 2019

buffer: always let indexOf look at odd indexes #26594

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

buffer.indexOf is incorrect in utf16le encoding for odd byteOffset #26448

buffer.indexOf is incorrect in utf16le encoding for odd byteOffset #26448

hkjackey commented Mar 5, 2019

abhinavsagar commented Mar 6, 2019

mkopa commented Mar 6, 2019

addaleax commented Mar 7, 2019

RReverser commented Mar 7, 2019

seishun commented Mar 7, 2019 •

edited

RReverser commented Mar 7, 2019

seishun commented Mar 7, 2019

RReverser commented Mar 7, 2019

hkjackey commented Mar 8, 2019

hkjackey commented Mar 8, 2019

seishun commented Mar 8, 2019

addaleax commented Mar 8, 2019 •

edited

addaleax commented Mar 8, 2019

seishun commented Mar 8, 2019

addaleax commented Mar 8, 2019

seishun commented Mar 8, 2019

BridgeAR commented Mar 8, 2019

hkjackey commented Mar 9, 2019

seishun commented Mar 9, 2019

addaleax commented Mar 9, 2019

seishun commented Mar 9, 2019

addaleax commented Mar 9, 2019

hkjackey commented Mar 10, 2019

seishun commented Mar 10, 2019

addaleax commented Mar 10, 2019

seishun commented Mar 10, 2019

hkjackey commented Mar 11, 2019

buffer.indexOf is incorrect in utf16le encoding for odd byteOffset #26448

buffer.indexOf is incorrect in utf16le encoding for odd byteOffset #26448

Comments

hkjackey commented Mar 5, 2019

abhinavsagar commented Mar 6, 2019

mkopa commented Mar 6, 2019

addaleax commented Mar 7, 2019

RReverser commented Mar 7, 2019

seishun commented Mar 7, 2019 • edited

RReverser commented Mar 7, 2019

seishun commented Mar 7, 2019

RReverser commented Mar 7, 2019

hkjackey commented Mar 8, 2019

hkjackey commented Mar 8, 2019

seishun commented Mar 8, 2019

addaleax commented Mar 8, 2019 • edited

addaleax commented Mar 8, 2019

seishun commented Mar 8, 2019

addaleax commented Mar 8, 2019

seishun commented Mar 8, 2019

BridgeAR commented Mar 8, 2019

hkjackey commented Mar 9, 2019

seishun commented Mar 9, 2019

addaleax commented Mar 9, 2019

seishun commented Mar 9, 2019

addaleax commented Mar 9, 2019

hkjackey commented Mar 10, 2019

seishun commented Mar 10, 2019

addaleax commented Mar 10, 2019

seishun commented Mar 10, 2019

hkjackey commented Mar 11, 2019

seishun commented Mar 7, 2019 •

edited

addaleax commented Mar 8, 2019 •

edited