New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
store token position + fix tokenizer #2
Conversation
d9abe61
to
a77f632
Compare
Hi @milahu, apologies for the delayed response. We are taking a look at your PR today. Do you think you could add some tests specifically for the tokenizer changes you made? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, for the most part. Changes are not particularly relevant to our use cases -- @jleider do you think it's worth including this added functionality in the main repo or should this PR really be another fork?
let currentAtomicTag = ''; | ||
const words = []; | ||
|
||
for (const char of html) { | ||
const unicodeChars = Array.from(html); | ||
for (let charIdx = 0; charIdx < unicodeChars.length; charIdx++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (let charIdx = 0; charIdx < unicodeChars.length; charIdx++) { | |
unicodeChars.forEach((char, charIdx) => { |
} | ||
currentWord = char; | ||
currentWord = ''; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we removing char
here?
for (const char of html) { | ||
const unicodeChars = Array.from(html); | ||
for (let charIdx = 0; charIdx < unicodeChars.length; charIdx++) { | ||
const char = unicodeChars[charIdx] as string; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const char = unicodeChars[charIdx] as string; |
@@ -277,7 +312,7 @@ export function htmlToTokens(html: string): Token[] { | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll have to account for the style_tag
case as well now.
sorry, i have to abandon this patch notes:
i cannot reproduce why i added for (var c of '\u00e4\u00f6\u00fc') console.log(c)
for (var c of Array.from('\u00e4\u00f6\u00fc')) console.log(c)
// nodejs:
/*
ä
ö
ü
*/ more micro optimization: maybe the string parsing could be optimized by looping bytes (Uint8Array typed array), using a jump table to handle values <= 127, and treating all values >= 128 as word chars (all unicode bytes have value >= 128) |
Closing this in favor of #5 which incorporates some of these changes. |
this is useful for a "live diff" editor, where i must reset the cursor position
WIP demo: live html diff editor with htmldiff.js
edit: this use case is deprecated, since every diff-algo will produce "false diffs"
and the only way to get a "real live diff" this is to track the
input
andselectionchange
eventsWIP demo 2: live html diff editor with inputevent (also on github)
other small edits ...
6ea5b33
word,
should be tokenized to|word|,|
not to|word,|
3fefdab
"word
should be tokenized to| |"|word|
not to| |"word|
|word|deletion|"|
->|word|"|
should produce a delete, no replacea77f632
wörd
should be tokenized to|wörd|
not to|wö|rd|
for (const char of str)
will loop unicode chars,const char = str[i]
will loop bytes (src)xregexp is a fast unicode matcher (src)