Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store token position + fix tokenizer #2

Closed
wants to merge 11 commits into from

Conversation

milahu
Copy link

@milahu milahu commented May 6, 2021

this is useful for a "live diff" editor, where i must reset the cursor position

WIP demo: live html diff editor with htmldiff.js

edit: this use case is deprecated, since every diff-algo will produce "false diffs"
and the only way to get a "real live diff" this is to track the input and selectionchange events

WIP demo 2: live html diff editor with inputevent (also on github)

other small edits ...

6ea5b33 word, should be tokenized to |word|,| not to |word,|

3fefdab     "word should be tokenized to |    |"|word| not to |    |"word|
|word|deletion|"| -> |word|"| should produce a delete, no replace

a77f632 wörd should be tokenized to |wörd| not to |wö|rd|
for (const char of str) will loop unicode chars, const char = str[i] will loop bytes (src)
xregexp is a fast unicode matcher (src)

@milahu milahu changed the title store token position store token position + fix tokenizer May 6, 2021
@jleider
Copy link
Member

jleider commented Dec 22, 2021

Hi @milahu, apologies for the delayed response. We are taking a look at your PR today. Do you think you could add some tests specifically for the tokenizer changes you made?

Copy link
Collaborator

@SuttonKyle SuttonKyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, for the most part. Changes are not particularly relevant to our use cases -- @jleider do you think it's worth including this added functionality in the main repo or should this PR really be another fork?

let currentAtomicTag = '';
const words = [];

for (const char of html) {
const unicodeChars = Array.from(html);
for (let charIdx = 0; charIdx < unicodeChars.length; charIdx++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (let charIdx = 0; charIdx < unicodeChars.length; charIdx++) {
unicodeChars.forEach((char, charIdx) => {

}
currentWord = char;
currentWord = '';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we removing char here?

for (const char of html) {
const unicodeChars = Array.from(html);
for (let charIdx = 0; charIdx < unicodeChars.length; charIdx++) {
const char = unicodeChars[charIdx] as string;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const char = unicodeChars[charIdx] as string;

@@ -277,7 +312,7 @@ export function htmlToTokens(html: string): Token[] {
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll have to account for the style_tag case as well now.

@milahu
Copy link
Author

milahu commented Dec 22, 2021

sorry, i have to abandon this patch

notes:

  • for (var i = 0; ...) is fastest loop in all environments, forEach can have function call overhead
  • today i would try to avoid converting string to array

i cannot reproduce why i added Array.from(html) to loop unicode chars

for (var c of '\u00e4\u00f6\u00fc') console.log(c)
for (var c of Array.from('\u00e4\u00f6\u00fc')) console.log(c)
// nodejs:
/*
ä
ö
ü
*/

more micro optimization: maybe the string parsing could be optimized by looping bytes (Uint8Array typed array), using a jump table to handle values <= 127, and treating all values >= 128 as word chars (all unicode bytes have value >= 128)

@jleider
Copy link
Member

jleider commented Jan 19, 2022

Closing this in favor of #5 which incorporates some of these changes.

@jleider jleider closed this Jan 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants