-
Notifications
You must be signed in to change notification settings - Fork 517
Closed
Description
the word tokenizer for WordDiff
and WordWithSpaceDiff
uses \b
in its regular expression. that considers word characters as [a-zA-Z0-9_]
, which fails on anything beyond 7 bit.
f.e. the german phrase "wir üben" splits to:
'wir üben'.split(/\b/);
-> ["wir", " ü", "ben"]
replacing the tokenizer with value.split(/(\s+)/)
is sufficient in my use-case, but i don't have newlines in my text. some further testing needed, i think.
further reading:
http://stackoverflow.com/questions/10590098/javascript-regexp-word-boundaries-unicode-characters/10590620#10590620
Metadata
Metadata
Assignees
Labels
No labels