-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The "Search > Go to..." feature should not allow moving inside a multi-byte encoding of a character ! #9101
Comments
But...are they??
Or maybe I would expect "position" as the label (not "offset"), and I might then expect 1, 2 and 3, respectively. As a user, that doesn't know or doesn't care about UTF-8 encoding details, I might only care about characters in my document, not the bytes that encode them. I filed a related issue recently, see #9095 BUT of course, your core issue, that you shouldn't be allowed to jump right into the middle of a multi-byte encoded UTF-8 character, is absolutely valid. |
I've just fully read this thread. Very interesting. Continued from #9125 (comment).
And here:
If "Pos" and "Offset" are implemented as character-based, the question whether to place the caret before or after can apparently be irrelevant in the "Go to" dialog: accept only numbers in the 1-3 range. I'm not sure why it would make more sense to place the caret after: it's in the middle and you can move either way. |
I think it is worth closing the deficiency of this issue immediately, because data corruption can occur if one jumps into the middle of a multibyte character and then inserts/deletes. The "should everything be character(position)-based or should everything be byte(offset)-based" is a larger issue, and can wait (and be debated perhaps under 9095 as time goes on). I think the before-versus-after question could go either way, but it is slightly easier to code in an "after" sense. |
As you understand. |
Remark : This issue has been first noticed by
Alan Kilborn
, revisited byPeter Jones
and discussed in that topic :https://community.notepad-plus-plus.org/post/59397
Description of the Issue
When using the
Search > Go to...
feature, with theOffset
option ticked, the different offsets corresponding to each byte of a multi-bytes encoding, after the first one, should be inaccessible !Steps to Reproduce the Issue
Open a new tab in N++
if, necessary, use the
Encoding > Convert to UTF-8
to get an emptyUTF-8
encoded fileJust type in the text
A👨Z
on the first lineNote that, as the
emoji
MAN 👨 is the Unicode character of code-pointU+1F468
, we can describe this line, in anUTF-8
encoded file, as :If you move the caret right before the
A
char, theSearch > Go to...
feature says you're at offset0
If you move the caret right before the
👨
char, theSearch > Go to...
feature says you're at offset1
If you move the caret right before the
Z
char, theSearch > Go to...
feature says you're at offset5
All these offsets are correct. But these values should be the only possible offsets to type in in the
You want to go to
zone !Actual Behavior
Now, let's force a move to offset
3
, exactly in the middle of the multi-bytes sequence of the emoji char ( byte91
) and then click on theGo
buttonSeemingly, the caret seems right before the
Z
letter. In fact :If you hit the
Backspace
key, you get the text Ax91xA8Z, so the first two bytes of the encodingxF0x9F
, before the offset, are deletedIf you hit the
Delete
key, you get the text AxF0x9FxA8Z, so the nextx91
byte, after the offset, is deletedIn addition, as you can see, the action of the two keys
Backspace
andDelete
are not symmetrical as the former deletes two bytes ( the beginning of the multi-bytes sequence ) whereas the latter just deletes one byte (x91
)Expected Behavior
The offsets values, relative to the individual bytes of a multi-bytes sequence, after the
1
byte, in a Unicode encoded file, should not be allowed ! For instance, in the example above, the allowed values should be, exclusively,0
,1
and5
Then, the
Backspace
andDelete
would just act on one character, only, as expected !Best Regards,
guy038
The text was updated successfully, but these errors were encountered: