Skip to content

Improve CJK tokenizer support #2801

@srfrog

Description

@srfrog

The current CJK tokenizer in v1.0.10 is the one included in Bleve. It has limited support and can yield extra tokens when that aren't needed. We need to use a CJK package/library specifically designed for CJK support that can handle these languages better.

For example, the term "first name" or "名字" is tokenized as "名", "字". But in this form, "字" is "name" and "字" is "word", so we have lost "first" as a token. So a fulltext/term lookup for "字" won't return the expected results. The expected term should have "名字".

Some CJK packages considered are:

Refers #1421

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions