Skip to content

word tokenizer works only for 7 bit ascii #29

@plasmagunman

Description

@plasmagunman

the word tokenizer for WordDiff and WordWithSpaceDiff uses \b in its regular expression. that considers word characters as [a-zA-Z0-9_], which fails on anything beyond 7 bit.

f.e. the german phrase "wir üben" splits to:

'wir üben'.split(/\b/);
-> ["wir", " ü", "ben"]

replacing the tokenizer with value.split(/(\s+)/) is sufficient in my use-case, but i don't have newlines in my text. some further testing needed, i think.

further reading:
http://stackoverflow.com/questions/10590098/javascript-regexp-word-boundaries-unicode-characters/10590620#10590620

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions