-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Closed
Labels
kind/enhancementSomething could be better.Something could be better.
Description
The current CJK tokenizer in v1.0.10 is the one included in Bleve. It has limited support and can yield extra tokens when that aren't needed. We need to use a CJK package/library specifically designed for CJK support that can handle these languages better.
For example, the term "first name" or "名字" is tokenized as "名", "字". But in this form, "字" is "name" and "字" is "word", so we have lost "first" as a token. So a fulltext/term lookup for "字" won't return the expected results. The expected term should have "名字".
Some CJK packages considered are:
- https://github.com/yanyiwu/gojieba (has Bleve support)
- https://github.com/huichen/sego
Refers #1421
Metadata
Metadata
Assignees
Labels
kind/enhancementSomething could be better.Something could be better.