store token position + fix tokenizer #2

milahu · 2021-05-06T12:42:00Z

this is useful for a "live diff" editor, where i must reset the cursor position

WIP demo: live html diff editor with htmldiff.js

edit: this use case is deprecated, since every diff-algo will produce "false diffs"
and the only way to get a "real live diff" this is to track the input and selectionchange events

WIP demo 2: live html diff editor with inputevent (also on github)

other small edits ...

jleider · 2021-12-22T16:05:47Z

Hi @milahu, apologies for the delayed response. We are taking a look at your PR today. Do you think you could add some tests specifically for the tokenizer changes you made?

SuttonKyle

Makes sense, for the most part. Changes are not particularly relevant to our use cases -- @jleider do you think it's worth including this added functionality in the main repo or should this PR really be another fork?

SuttonKyle · 2021-12-22T16:26:24Z

src/htmldiff.ts

  let currentAtomicTag = '';
  const words = [];

-  for (const char of html) {
+  const unicodeChars = Array.from(html);
+  for (let charIdx = 0; charIdx < unicodeChars.length; charIdx++) {


Suggested change

for (let charIdx = 0; charIdx < unicodeChars.length; charIdx++) {

unicodeChars.forEach((char, charIdx) => {

SuttonKyle · 2021-12-22T16:29:50Z

src/htmldiff.ts

          }
-          currentWord = char;
+          currentWord = '';


why are we removing char here?

SuttonKyle · 2021-12-22T16:30:39Z

src/htmldiff.ts

-  for (const char of html) {
+  const unicodeChars = Array.from(html);
+  for (let charIdx = 0; charIdx < unicodeChars.length; charIdx++) {
+    const char = unicodeChars[charIdx] as string;


Suggested change

const char = unicodeChars[charIdx] as string;

SuttonKyle · 2021-12-22T16:32:56Z

src/htmldiff.ts

@@ -277,7 +312,7 @@ export function htmlToTokens(html: string): Token[] {
    }


You'll have to account for the style_tag case as well now.

milahu · 2021-12-22T18:21:55Z

sorry, i have to abandon this patch

notes:

for (var i = 0; ...) is fastest loop in all environments, forEach can have function call overhead
today i would try to avoid converting string to array

i cannot reproduce why i added Array.from(html) to loop unicode chars

for (var c of '\u00e4\u00f6\u00fc') console.log(c)
for (var c of Array.from('\u00e4\u00f6\u00fc')) console.log(c)
// nodejs:
/*
ä
ö
ü
*/

more micro optimization: maybe the string parsing could be optimized by looping bytes (Uint8Array typed array), using a jump table to handle values <= 127, and treating all values >= 128 as word chars (all unicode bytes have value >= 128)

jleider · 2022-01-19T16:34:58Z

Closing this in favor of #5 which incorporates some of these changes.

milahu added 6 commits May 6, 2021 14:19

store token position

ca7281e

tokenize entities + special chars

6ea5b33

package.json: add script diff

77ef9b3

minifix createToken

89e24fc

tokenize specialchars after whitespace

3fefdab

tokenize unicode words

a77f632

milahu force-pushed the store-token-position branch from d9abe61 to a77f632 Compare May 6, 2021 22:29

milahu changed the title ~~store token position~~ store token position + fix tokenizer May 6, 2021

milahu added 5 commits May 7, 2021 11:51

refactor: OPS -> renderHandler

0d8dec0

refactor: OperationAction

82a6205

fix bug in 3fefdab: token positions

695d4bd

package.json: add prepare script to build on install from git

84c5aee

package.json: fix main file

3182f44

SuttonKyle requested changes Dec 22, 2021

View reviewed changes

SuttonKyle mentioned this pull request Jan 14, 2022

overhaul style tag handling #5

Merged

jleider closed this Jan 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

store token position + fix tokenizer #2

store token position + fix tokenizer #2

milahu commented May 6, 2021 •

edited

jleider commented Dec 22, 2021

SuttonKyle left a comment

SuttonKyle Dec 22, 2021

SuttonKyle Dec 22, 2021

SuttonKyle Dec 22, 2021

SuttonKyle Dec 22, 2021

milahu commented Dec 22, 2021

jleider commented Jan 19, 2022 •

edited

	for (let charIdx = 0; charIdx < unicodeChars.length; charIdx++) {
	unicodeChars.forEach((char, charIdx) => {

		@@ -277,7 +312,7 @@ export function htmlToTokens(html: string): Token[] {
		}

store token position + fix tokenizer #2

store token position + fix tokenizer #2

Conversation

milahu commented May 6, 2021 • edited

jleider commented Dec 22, 2021

SuttonKyle left a comment

Choose a reason for hiding this comment

SuttonKyle Dec 22, 2021

Choose a reason for hiding this comment

SuttonKyle Dec 22, 2021

Choose a reason for hiding this comment

SuttonKyle Dec 22, 2021

Choose a reason for hiding this comment

SuttonKyle Dec 22, 2021

Choose a reason for hiding this comment

milahu commented Dec 22, 2021

jleider commented Jan 19, 2022 • edited

milahu commented May 6, 2021 •

edited

jleider commented Jan 19, 2022 •

edited