Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The "Search > Go to..." feature should not allow moving inside a multi-byte encoding of a character ! #9101

Closed
guy038 opened this issue Nov 4, 2020 · 4 comments

Comments

@guy038
Copy link

guy038 commented Nov 4, 2020

Remark : This issue has been first noticed by Alan Kilborn, revisited by Peter Jones and discussed in that topic :

https://community.notepad-plus-plus.org/post/59397

Description of the Issue

When using the Search > Go to... feature, with the Offset option ticked, the different offsets corresponding to each byte of a multi-bytes encoding, after the first one, should be inaccessible !

Steps to Reproduce the Issue

  • Open a new tab in N++

  • if, necessary, use the Encoding > Convert to UTF-8 to get an empty UTF-8 encoded file

  • Just type in the text A👨Z on the first line

Note that, as the emoji MAN 👨 is the Unicode character of code-point U+1F468, we can describe this line, in an UTF-8 encoded file, as :

Characters :   A         👨        Z
Bytes      :   41  F0  9F  91  A8  5A
Offset     :   0   1   2   3   4   5
  • If you move the caret right before the A char, the Search > Go to... feature says you're at offset 0

  • If you move the caret right before the 👨 char, the Search > Go to... feature says you're at offset 1

  • If you move the caret right before the Z char, the Search > Go to... feature says you're at offset 5

All these offsets are correct. But these values should be the only possible offsets to type in in the You want to go to zone !

Actual Behavior

Now, let's force a move to offset 3, exactly in the middle of the multi-bytes sequence of the emoji char ( byte 91 ) and then click on the Go button

  • Seemingly, the caret seems right before the Z letter. In fact :

    • If you hit the Backspace key, you get the text Ax91xA8Z, so the first two bytes of the encoding xF0x9F, before the offset, are deleted

    • If you hit the Delete key, you get the text AxF0x9FxA8Z, so the next x91 byte, after the offset, is deleted

  • In addition, as you can see, the action of the two keys Backspace and Delete are not symmetrical as the former deletes two bytes ( the beginning of the multi-bytes sequence ) whereas the latter just deletes one byte ( x91 )

Expected Behavior

The offsets values, relative to the individual bytes of a multi-bytes sequence, after the 1 byte, in a Unicode encoded file, should not be allowed ! For instance, in the example above, the allowed values should be, exclusively, 0, 1 and 5

Then, the Backspace and Delete would just act on one character, only, as expected !

Best Regards,

guy038

@sasumner
Copy link
Contributor

sasumner commented Nov 4, 2020

@guy038

All these offsets are correct

But...are they??
As a user, I would expect:

  • If you move the caret right before the A char, the Search > Go to... feature says you're at offset 0

  • If you move the caret right before the 👨 char, the Search > Go to... feature says you're at offset 1

  • If you move the caret right before the Z char, the Search > Go to... feature says you're at offset 2 (DIFFERENT)

Or maybe I would expect "position" as the label (not "offset"), and I might then expect 1, 2 and 3, respectively.

As a user, that doesn't know or doesn't care about UTF-8 encoding details, I might only care about characters in my document, not the bytes that encode them.

I filed a related issue recently, see #9095

BUT of course, your core issue, that you shouldn't be allowed to jump right into the middle of a multi-byte encoded UTF-8 character, is absolutely valid.

@Yaron10
Copy link

Yaron10 commented Nov 9, 2020

I've just fully read this thread. Very interesting.

Continued from #9125 (comment).

In developing a solution for multibyte UTF-8 characters (for 9101), it made sense that if a user tries to offset into the middle of a multibyte character, the caret would be placed after that multibyte character.
The same solution works for this issue, but it places the caret after the LF when offsetting into the middle of a CRLF pair. And that means the caret will be placed at the start of the next line. I hope that isn't a critical problem for you, but I think I will move ahead with that solution.

And here:

Or maybe I would expect "position" as the label (not "offset"), and I might then expect 1, 2 and 3, respectively.

If "Pos" and "Offset" are implemented as character-based, the question whether to place the caret before or after can apparently be irrelevant in the "Go to" dialog: accept only numbers in the 1-3 range.
It wouldn't work with "-p" from the command-line.

I'm not sure why it would make more sense to place the caret after: it's in the middle and you can move either way.
So why not implement before as in manually placing the caret between CR and LF?

@sasumner
Copy link
Contributor

sasumner commented Nov 9, 2020

I think it is worth closing the deficiency of this issue immediately, because data corruption can occur if one jumps into the middle of a multibyte character and then inserts/deletes.

The "should everything be character(position)-based or should everything be byte(offset)-based" is a larger issue, and can wait (and be debated perhaps under 9095 as time goes on).
In the purest sense of how things are currently, probably a offset that is specified that is inside a character maybe should result in an error popup, but with the just-mentioned debate still open I don't want to go too far with it, just close the data-corruption possibility.

I think the before-versus-after question could go either way, but it is slightly easier to code in an "after" sense.

@Yaron10
Copy link

Yaron10 commented Nov 9, 2020

As you understand.
Thank you for working on it. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants