The "Search > Go to..." feature should not allow moving inside a multi-byte encoding of a character ! #9101

guy038 · 2020-11-04T21:20:17Z

Remark : This issue has been first noticed by Alan Kilborn, revisited by Peter Jones and discussed in that topic :

https://community.notepad-plus-plus.org/post/59397

Description of the Issue

When using the Search > Go to... feature, with the Offset option ticked, the different offsets corresponding to each byte of a multi-bytes encoding, after the first one, should be inaccessible !

Steps to Reproduce the Issue

Open a new tab in N++
if, necessary, use the Encoding > Convert to UTF-8 to get an empty UTF-8 encoded file
Just type in the text A👨Z on the first line

Note that, as the emoji MAN 👨 is the Unicode character of code-point U+1F468, we can describe this line, in an UTF-8 encoded file, as :

Characters :   A         👨        Z
Bytes      :   41  F0  9F  91  A8  5A
Offset     :   0   1   2   3   4   5

If you move the caret right before the A char, the Search > Go to... feature says you're at offset 0
If you move the caret right before the 👨 char, the Search > Go to... feature says you're at offset 1
If you move the caret right before the Z char, the Search > Go to... feature says you're at offset 5

All these offsets are correct. But these values should be the only possible offsets to type in in the You want to go to zone !

Actual Behavior

Now, let's force a move to offset 3, exactly in the middle of the multi-bytes sequence of the emoji char ( byte 91 ) and then click on the Go button

Seemingly, the caret seems right before the Z letter. In fact :
- If you hit the Backspace key, you get the text Ax91xA8Z, so the first two bytes of the encoding xF0x9F, before the offset, are deleted
- If you hit the Delete key, you get the text AxF0x9FxA8Z, so the next x91 byte, after the offset, is deleted
In addition, as you can see, the action of the two keys Backspace and Delete are not symmetrical as the former deletes two bytes ( the beginning of the multi-bytes sequence ) whereas the latter just deletes one byte ( x91 )

Expected Behavior

The offsets values, relative to the individual bytes of a multi-bytes sequence, after the 1 byte, in a Unicode encoded file, should not be allowed ! For instance, in the example above, the allowed values should be, exclusively, 0, 1 and 5

Then, the Backspace and Delete would just act on one character, only, as expected !

Best Regards,

guy038

The text was updated successfully, but these errors were encountered:

sasumner · 2020-11-04T21:45:54Z

@guy038

All these offsets are correct

But...are they??
As a user, I would expect:

If you move the caret right before the A char, the Search > Go to... feature says you're at offset 0
If you move the caret right before the 👨 char, the Search > Go to... feature says you're at offset 1
If you move the caret right before the Z char, the Search > Go to... feature says you're at offset 2 (DIFFERENT)

Or maybe I would expect "position" as the label (not "offset"), and I might then expect 1, 2 and 3, respectively.

As a user, that doesn't know or doesn't care about UTF-8 encoding details, I might only care about characters in my document, not the bytes that encode them.

I filed a related issue recently, see #9095

BUT of course, your core issue, that you shouldn't be allowed to jump right into the middle of a multi-byte encoded UTF-8 character, is absolutely valid.

Yaron10 · 2020-11-09T19:35:47Z

I've just fully read this thread. Very interesting.

Continued from #9125 (comment).

In developing a solution for multibyte UTF-8 characters (for 9101), it made sense that if a user tries to offset into the middle of a multibyte character, the caret would be placed after that multibyte character.
The same solution works for this issue, but it places the caret after the LF when offsetting into the middle of a CRLF pair. And that means the caret will be placed at the start of the next line. I hope that isn't a critical problem for you, but I think I will move ahead with that solution.

And here:

Or maybe I would expect "position" as the label (not "offset"), and I might then expect 1, 2 and 3, respectively.

If "Pos" and "Offset" are implemented as character-based, the question whether to place the caret before or after can apparently be irrelevant in the "Go to" dialog: accept only numbers in the 1-3 range.
It wouldn't work with "-p" from the command-line.

I'm not sure why it would make more sense to place the caret after: it's in the middle and you can move either way.
So why not implement before as in manually placing the caret between CR and LF?

sasumner · 2020-11-09T20:02:15Z

I think it is worth closing the deficiency of this issue immediately, because data corruption can occur if one jumps into the middle of a multibyte character and then inserts/deletes.

The "should everything be character(position)-based or should everything be byte(offset)-based" is a larger issue, and can wait (and be debated perhaps under 9095 as time goes on).
In the purest sense of how things are currently, probably a offset that is specified that is inside a character maybe should result in an error popup, but with the just-mentioned debate still open I don't want to go too far with it, just close the data-corruption possibility.

I think the before-versus-after question could go either way, but it is slightly easier to code in an "after" sense.

Yaron10 · 2020-11-09T20:19:55Z

As you understand.
Thank you for working on it. 👍

sasumner mentioned this issue Nov 8, 2020

Caret not visible if "Go to -> Offset" is EOL #9125

Closed

sasumner mentioned this issue Nov 9, 2020

Disallow Goto dlg offset option from moving to position inside multibyte char or between CR and LF #9129

Closed

donho closed this as completed in f027e92 Nov 12, 2020

sasumner mentioned this issue Nov 12, 2020

Specifying -p on command line can lead to data corruption #9142

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The "Search > Go to..." feature should not allow moving inside a multi-byte encoding of a character ! #9101

The "Search > Go to..." feature should not allow moving inside a multi-byte encoding of a character ! #9101

guy038 commented Nov 4, 2020

sasumner commented Nov 4, 2020

Yaron10 commented Nov 9, 2020

sasumner commented Nov 9, 2020

Yaron10 commented Nov 9, 2020

The "Search > Go to..." feature should not allow moving inside a multi-byte encoding of a character ! #9101

The "Search > Go to..." feature should not allow moving inside a multi-byte encoding of a character ! #9101

Comments

guy038 commented Nov 4, 2020

Description of the Issue

Steps to Reproduce the Issue

Actual Behavior

Expected Behavior

sasumner commented Nov 4, 2020

Yaron10 commented Nov 9, 2020

sasumner commented Nov 9, 2020

Yaron10 commented Nov 9, 2020