Tokenizers do not handle accented characters correctly

I'm using diffx to diff XML, and the tokenizers are splitting words when they contain accented characters.

This was reported in https://github.com/tuskyapp/Tusky/pull/3314, which has this screenshot:

<img src="https://user-images.githubusercontent.com/773100/218974710-be0dbf41-a704-4938-a321-bb19f0ecceb2.png" width="45%">

Notice how in the last-but-one item in the list the word "Käumen" has been split at the "ä".

@sl1txdvd who reported it to us has a fix to the regex in https://github.com/pageseeder/diffx/compare/master...sl1txdvd:diffx:unicode.

---

Separately, I tried to fix this by writing a new tokenizer based on `TokenizerBySpaceWord`, but when I did this I quickly discovered:

1. All the classes are final, so I couldn't simply extend an existing tokenizer, I would have to write a new one
2. There is no mechanism for plugging a custom tokenizer in to the processing pipeline, and again, all the classes are final, so actually doing this would require forking the project

Please consider making it easier to override this functionality, thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenizers do not handle accented characters correctly #10

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tokenizers do not handle accented characters correctly #10

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions