Jieba integration #931

oabu · 2022-11-08T02:45:38Z

ICU is not a good choice in China. In addition, it is very important for Chinese word segmentation to customize the dictionary, because the application of words in different industries is completely different.
Taking jieba word segmentation as an example, he has a mode called search mode, which is specially prepared for full-text retrieval.

To this end, I made an example, please take a look and you will understand the difference.
http://lx.host.dabai.com/
the FULL result is the correct result

Taking "清华大学" as an example, few people may search for "清华大学", but most of them will use "清华" as a keyword search, so we need both "清华大学" and "清华". @sanikolaev @dzcpy

sanikolaev · 2022-11-15T13:47:57Z

@oabu Thank you for your feedback. So do you recommend adding integration with https://github.com/yanyiwu/cppjieba ?

malacca · 2022-11-28T20:15:34Z

@sanikolaev
yes, i hope manticoresearch can integration with jieba, because it does not support chinese word segmentation, I temporarily choose meilisearch.

if you decide to integration with jieba, Here is a nice discussion to refer to

sanikolaev · 2023-05-22T03:32:52Z

@fxtxkktv in #1137 expressed his interest in adding Jieba support into Manticore.

axhiao · 2023-05-24T13:26:02Z

there is another repo that is related to Chinese word segmentation. And it was written in C++.

https://github.com/fastcws/fastcws

sanikolaev · 2023-05-24T13:57:48Z

there is another repo that is related to Chinese word segmentation. And it was written in C++.

Jieba seems to be more popular. What are the advantages of this one? Is there any benchmark comparing it with Jieba and/or ICU?

fxtxkktv · 2023-05-25T02:45:11Z

还有另一种与中文分词有关的存储库。它是用C++编写的。

杰霸似乎更受欢迎。这个有什么优点？是否有与杰霸和/或ICU比较的基准？

【jieba】 Custom Chinese word segmentation is useful

jacentsao · 2023-07-18T07:28:07Z

@sanikolaev hi, is there any plan about using jieba as Chinese text segmentation, the most popular Chinese text segmentation is https://github.com/fxsjy/jieba and it's C++ version is https://github.com/yanyiwu/cppjieba.

sanikolaev · 2023-07-18T08:11:28Z

This issue won't make it to the upcoming release. Hopefully we'll address this issue in the next release, i.e. in a few months.

JonGates · 2023-08-15T13:44:43Z

I think jieba is the current best open source Chinese participle , support for Chinese Simplified Chinese , Chinese Traditional Chinese participle , support for customized thesaurus .

jieba supports three modes of participle : precise mode, full mode and search engine mode. Very suitable for full-text search , I used in es is also jieba @sanikolaev

oabu · 2023-08-16T01:44:07Z

@oabu 感谢您的反馈。因此，您是否建议添加与 https://github.com/yanyiwu/cppjieba ?

https://github.com/fxsjy/jieba
https://github.com/yanyiwu/cppjieba

jaric · 2024-01-10T04:12:08Z

hi @sanikolaev ,

Do you have any plan or timeline regarding the full integration of Jieba?

Thanks.

sanikolaev · 2024-01-10T07:23:11Z

Hi @jaric

Unfortunately, it's not in our nearest plans, but we are still interested in it. Ideally, we'd like someone to make a pull request or sponsor the development :)

thegenius · 2024-02-01T23:15:11Z

This is very important for Chinese developer to choose Manticore。
For now, small company may choose postgresql, and big company stick to Elastic Search。
And I think Meilisearch and Manticore will be The Next Star。
Many friends of mine from startup company recommend Meilisearch, for the easy of use and Chinese support.
I personally prefer Manticore for the SQL-first，but disappointed by the absent of Jieba support.
This is not so hard, but absolutely important!

sanikolaev · 2024-02-02T16:33:29Z

@thegenius thanks for the comment. I've added this task to the roadmap - https://roadmap.manticoresearch.com/

xzxiaoshan · 2024-03-28T04:19:28Z

jieba 对中文来说很重要，希望早一些可以用上。

sanikolaev added the waiting Waiting for the original poster (in most cases) or something else label Nov 15, 2022

sanikolaev removed the waiting Waiting for the original poster (in most cases) or something else label Nov 29, 2022

sanikolaev mentioned this issue May 22, 2023

Hope Jieba support, ICU is wrong. #1137

Closed

sanikolaev changed the title ~~ICU is not a good choice for chinese~~ Jieba integration Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jieba integration #931

Jieba integration #931

oabu commented Nov 8, 2022 •

edited by githubmanticore

sanikolaev commented Nov 15, 2022

malacca commented Nov 28, 2022

sanikolaev commented May 22, 2023

axhiao commented May 24, 2023

sanikolaev commented May 24, 2023

fxtxkktv commented May 25, 2023

jacentsao commented Jul 18, 2023

sanikolaev commented Jul 18, 2023

JonGates commented Aug 15, 2023

oabu commented Aug 16, 2023 •

edited

jaric commented Jan 10, 2024

sanikolaev commented Jan 10, 2024

thegenius commented Feb 1, 2024

sanikolaev commented Feb 2, 2024

xzxiaoshan commented Mar 28, 2024

Jieba integration #931

Jieba integration #931

Comments

oabu commented Nov 8, 2022 • edited by githubmanticore

sanikolaev commented Nov 15, 2022

malacca commented Nov 28, 2022

sanikolaev commented May 22, 2023

axhiao commented May 24, 2023

sanikolaev commented May 24, 2023

fxtxkktv commented May 25, 2023

jacentsao commented Jul 18, 2023

sanikolaev commented Jul 18, 2023

JonGates commented Aug 15, 2023

oabu commented Aug 16, 2023 • edited

jaric commented Jan 10, 2024

sanikolaev commented Jan 10, 2024

thegenius commented Feb 1, 2024

sanikolaev commented Feb 2, 2024

xzxiaoshan commented Mar 28, 2024

oabu commented Nov 8, 2022 •

edited by githubmanticore

oabu commented Aug 16, 2023 •

edited