Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode/emoji offset on remove and move ops #2680

Closed
wants to merge 20 commits into from

Conversation

adjourn
Copy link
Contributor

@adjourn adjourn commented Apr 7, 2019

TL;DR

Side-by-side comparison with default: https://jsfiddle.net/swjq5h8k/
All different cases that were tested: https://jsfiddle.net/gxnhcbo6/

Is this adding or improving a feature or fixing a bug?

Both.

What's the new behavior?

Plain text emojis are broken because:

  • it's possible to move selection "into" the emoji (between surrogate pairs) and break it into � (broken) characters if new characters are inserted or Backspace/Delete is pressed while "in" emoji
  • have to press arrow keys multiple times to move selection over the emoji "for real" (it can range from 2 to >10 depending on emoji) without any visual feedback (caret is not accurate, terrible UX)
  • have to press Backspace and Delete multiple times when emoji is other than simple surrogate pair

I think this is something that Slate should be able to handle.

If you currently want to fix emojis, you'd have to cover all cases emojis could be inserted and convert them into inlines. I think that it makes our editors overly complex and decreases performance.

Proposed function calculates the offset required to move selection over whole (emoji) sequence. It covers arrow keys, Delete and Backspace, other cases like trying to put caret in the middle of sequence where it shouldn't be, is already handled by browser.

Compare with current Slate: https://jsfiddle.net/swjq5h8k/ (PR implemented as plugin).

Performance

It's quite performant, Im seeing max 0.5ms with cold compiler (usually ~0.2ms) and <0.05ms otherwise. Even 6x CPU slowdown is less than 1ms with cold compiler and my CPU is at least 5 years old.

Handled cases

Unicode rabbit hole is deep, there are zero width joiners, variation selectors, surrogates, etc I didn't even know existed few days ago. You can see all my test cases here: https://jsfiddle.net/gxnhcbo6/.

Graceful fallback

If you manage to break it, please let me know and you'll get a virtual cookie. It should never throw an error, edge cases that might slip through, have the same experience as with all emojis at current Slate state.

How does this change work?

It aims to mimic browser behavior as much as possible without dictionaries or 10kB regex.

I changed the way getCharOffset works which is used by Delete and Backspace already and made forward and backward events (arrow keys) in slate-react use it aswell.

getCharOffset always receives a string from (about to be) selection until the end of block. while loop iterates over string's code points until it has determined that current code point doesn't form a sequence with previous (and next in some cases). E.g complex emojis take 3-7 iterations, single characters 1.

Each iteration generally checks the following things for current code point:

Order matters, if any IF is true, we'll either break the loop if it determines that previous and current code points don't form a sequence or continue to next code point after we've updated offset and other meta.

  • IF surrogate
  • IF ZWJ (zero width joiner)
  • IF variation selector
  • BREAK (most "regular" characters fall under here, 1 iteration)

Few examples (numbers are .length):

😀😀 [emoji 2][emoji 2]
Selection should move over whole emoji (not end up in the middle of surrogate pair) and be able to move between the two, offset should be 2.

😀‍😀 [emoji 2][zero width joiner 1][emoji 2]
ZWJ (zero width joiner) combines the two, selection should now move over whole thing, offset should be 5. You should also notice that you can't copy these emojis one-by-one anymore, it's a single unit.

👮🏻 [emoji 2][modifier 2]
Modifier "fuses" with previous unicode/surrogate pair, offset should be 4. The trick here is to know if it's modifier because it has same characteristics as any other surrogate pair and we've established in previous cases that selection should be able to move between the two without ZWJ. This is exception.

These even aren't all the cases and you can already see that rules contradict and complexity grows. It gets a lot worse because it has to work in backward direction too, in which case many forward rules would fail.

Have you checked that...?

  • The new code matches the existing patterns and styles.
  • The tests pass with yarn test.
  • The linter passes with yarn lint. (Fix errors with yarn prettier.)
  • The relevant examples still work. (Run examples with yarn watch.)

Does this fix any issues or need any specific reviewers?

Fixes #2635, fixes #2632

@SmilinBrian
Copy link
Contributor

Looks good and seems very helpful. Side-by-side comparison is compelling.

Is there any chance similar work needs to be done to support kanji or other multi with modifiers? Sorry, no example at hand, but maybe someone more familiar with asian or arabic languages knows of a case.

@adjourn
Copy link
Contributor Author

adjourn commented Apr 11, 2019

@SmilinBrian I have exactly zero experience with these languages. I'm open to feedback and situations where it might fail, in order to cover all cases. That's why it's labeled WIP.

I've explored alternatives because lets be honest, it's a brute force solution, but none worked out.

@ianstormtaylor What's your take on this? I see no other realistic alternatives right now.

I did experiment with letting browser handle it (only prevent default if browser doesn't know what to do, e.g void related) and syncing Slate value. After all, browsers do the right thing with unicodes.
Selection should be easy but Delete and Backspace require DOM diffing and rethinking core concepts which is not desirable, especially when Slate is already in the middle of big change.

@adjourn adjourn changed the title [RFC][WIP] Unicode/emoji offset on remove and move ops [RFC] Unicode/emoji offset on remove and move ops Apr 18, 2019
Copy link
Owner

@ianstormtaylor ianstormtaylor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for researching this @adjourn and taking the time to write this up, it looks great.

Just one comment in there. I think we can change the default to be more useful, and avoid having to inline that logic in more places.

packages/slate-react/src/plugins/after.js Outdated Show resolved Hide resolved
@rikkit
Copy link

rikkit commented Apr 25, 2019

(This more of an observation than a opinion, great work on the PR @adjourn !)

ZWJ (zero width joiner) combines the two, selection should now move over whole thing, offset should be 5. You should also notice that you can't copy these emojis one-by-one anymore, it's a single unit.

Should ZWJ glue together all emojis, or just those that are presented together? Only pointing this out cos you noted

It aims to mimic browser behavior without dictionaries or 10kB regex.

Firefox and Edge (pre Chromium version) both select a single emoji

image
image

Whereas Chrome selects both (as per the logic in the PR)
image

You could argue that either implementation is correct, I have absolutely no idea how Firefox and Edge have done this though. Like you say it'd be ideal to just let the browser handle it but there's the issues with delete and backspace you mentioned

* first stab at removing leaves with tests passing

* fixes

* add iterables to the element interface

* use iterables in more places

* update examples to use iterables

* update naming

* fix tests

* convert more key-based logic to paths

* add range support to iterables

* refactor many methods to use iterables, deprecate cruft

* clean up existing iterables

* more cleanup

* more cleaning

* fix word count example

* work

* split decoration and annotations

* update examples for `renderNode` useage

* deprecate old DOM-based helpers, update examples

* make formats first class, refactor leaf rendering

* fix examples, fix isAtomic checking

* deprecate leaf model

* convert Text and Leaf to functional components

* fix lint and tests
 - slate-base64-serializer@0.2.103
 - slate-html-serializer@0.8.2
 - slate-hyperscript@0.13.0
 - slate-plain-serializer@0.7.2
 - slate-prop-types@0.5.33
 - slate-react-placeholder@0.2.0
 - slate-react@0.22.0
 - slate@0.47.0
 - slate-base64-serializer@0.2.104
 - slate-html-serializer@0.8.3
 - slate-hyperscript@0.13.1
 - slate-plain-serializer@0.7.3
 - slate-prop-types@0.5.34
 - slate-react-placeholder@0.2.1
 - slate-react@0.22.1
 - slate@0.47.1
@danieltorrer
Copy link

This feature is really helpful! Just as a heads up: I used your jsfiddle code as plugin and noticed that with this plugin activated you can no longer select text with shift + arrow key or cmd + shift + arrow key.

@adjourn
Copy link
Contributor Author

adjourn commented May 9, 2019

@danieltorrer Whoa, was worried for a second! But fortunately this is due to the fact that shift logic is not handled in plugin onKeyDown handler, everything works as it should when it's part of Slate.

@adjourn
Copy link
Contributor Author

adjourn commented May 9, 2019

@rikkit Thank you for pointing that out.

I think that current implementation is the way to go because:

  • conceptually ZWJ is meant to connect two characters
  • it would add so many edge cases
  • nobody (don't quote me on that) adds ZWJ-s manually
  • so many users use Chromium based browsers

But if someone has objections, I'm all ears.

@adjourn adjourn changed the title [RFC] Unicode/emoji offset on remove and move ops Unicode/emoji offset on remove and move ops May 9, 2019
@adjourn
Copy link
Contributor Author

adjourn commented May 9, 2019

That was my worst conflict resolution to date, glad it's over.

@ianstormtaylor It's ready for another review.

Notable changes (see clean view of all my commits):

  • moved offset logic from slate-react plugin into move* commands (moveForward/moveBackward)
  • changed the concept/move* argument from length to character which means that e.g if there's 3 emojis each with length 5, moveForward(2) doesn't move selection in the middle of first one, it moves selection over 2 emojis, i.e "characters".

See it in action here (read the paragraph starting with PS).

@isubasti
Copy link
Contributor

@adjourn just found this package that is actually used by eslint-plugin-jsx-a11y as its dependency for parsing emoji: https://www.npmjs.com/package/emoji-regex, maybe slate can rely on that package instead?

@slotterbackW
Copy link

slotterbackW commented Jul 22, 2019

Is there an update on the progress of this PR? Will it be merged soon?

@ianstormtaylor
Copy link
Owner

Hey @adjourn I'm super interested in this. For some reason the diff shows a bunch of changed stuff that is from earlier commits to master. Is there any way to fix that? It's hard to see what exactly his changing here. Thanks!

@slotterbackW
Copy link

@ianstormtaylor from adjourn's comment above here's an easy to read view of commits:

Notable changes (see clean view of all my commits):

Excited to see development on this! Thanks both of you!

@kwokhuen
Copy link

@ianstormtaylor I know how to fix the commits. I also found some bugs while deleting and moving forward and backward. I'm wondering if I could continue to work on this PR. If yes, how would I go about it? Should I just checkout @adjourn branch, do my thing and force push?

@davidhoeller davidhoeller mentioned this pull request Sep 30, 2019
4 tasks
@ianstormtaylor ianstormtaylor mentioned this pull request Nov 14, 2019
@ianstormtaylor
Copy link
Owner

Fixed by #3093.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
10 participants