Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jieba integration #931

Open
oabu opened this issue Nov 8, 2022 · 15 comments
Open

Jieba integration #931

oabu opened this issue Nov 8, 2022 · 15 comments

Comments

@oabu
Copy link

oabu commented Nov 8, 2022

ICU is not a good choice in China. In addition, it is very important for Chinese word segmentation to customize the dictionary, because the application of words in different industries is completely different.
Taking jieba word segmentation as an example, he has a mode called search mode, which is specially prepared for full-text retrieval.

To this end, I made an example, please take a look and you will understand the difference.
http://lx.host.dabai.com/
the FULL result is the correct result

Taking "清华大学" as an example, few people may search for "清华大学", but most of them will use "清华" as a keyword search, so we need both "清华大学" and "清华". @sanikolaev @dzcpy

@sanikolaev
Copy link
Collaborator

@oabu Thank you for your feedback. So do you recommend adding integration with https://github.com/yanyiwu/cppjieba ?

@sanikolaev sanikolaev added the waiting Waiting for the original poster (in most cases) or something else label Nov 15, 2022
@malacca
Copy link

malacca commented Nov 28, 2022

@sanikolaev
yes, i hope manticoresearch can integration with jieba, because it does not support chinese word segmentation, I temporarily choose meilisearch.

if you decide to integration with jieba, Here is a nice discussion to refer to

@sanikolaev sanikolaev removed the waiting Waiting for the original poster (in most cases) or something else label Nov 29, 2022
@sanikolaev
Copy link
Collaborator

@fxtxkktv in #1137 expressed his interest in adding Jieba support into Manticore.

@axhiao
Copy link

axhiao commented May 24, 2023

there is another repo that is related to Chinese word segmentation. And it was written in C++.

https://github.com/fastcws/fastcws

@sanikolaev
Copy link
Collaborator

there is another repo that is related to Chinese word segmentation. And it was written in C++.

Jieba seems to be more popular. What are the advantages of this one? Is there any benchmark comparing it with Jieba and/or ICU?

@fxtxkktv
Copy link

还有另一种与中文分词有关的存储库。它是用C++编写的。

杰霸似乎更受欢迎。这个有什么优点?是否有与杰霸和/或ICU比较的基准?

【jieba】 Custom Chinese word segmentation is useful

@jacentsao
Copy link

@sanikolaev hi, is there any plan about using jieba as Chinese text segmentation, the most popular Chinese text segmentation is https://github.com/fxsjy/jieba and it's C++ version is https://github.com/yanyiwu/cppjieba.

@sanikolaev
Copy link
Collaborator

This issue won't make it to the upcoming release. Hopefully we'll address this issue in the next release, i.e. in a few months.

@JonGates
Copy link

I think jieba is the current best open source Chinese participle , support for Chinese Simplified Chinese , Chinese Traditional Chinese participle , support for customized thesaurus .

jieba supports three modes of participle : precise mode, full mode and search engine mode. Very suitable for full-text search , I used in es is also jieba @sanikolaev

@oabu
Copy link
Author

oabu commented Aug 16, 2023

@oabu 感谢您的反馈。因此,您是否建议添加与 https://github.com/yanyiwu/cppjieba ?

https://github.com/fxsjy/jieba
https://github.com/yanyiwu/cppjieba

@jaric
Copy link

jaric commented Jan 10, 2024

hi @sanikolaev ,

Do you have any plan or timeline regarding the full integration of Jieba?

Thanks.

@sanikolaev
Copy link
Collaborator

Hi @jaric

Unfortunately, it's not in our nearest plans, but we are still interested in it. Ideally, we'd like someone to make a pull request or sponsor the development :)

@thegenius
Copy link

This is very important for Chinese developer to choose Manticore。
For now, small company may choose postgresql, and big company stick to Elastic Search。
And I think Meilisearch and Manticore will be The Next Star。
Many friends of mine from startup company recommend Meilisearch, for the easy of use and Chinese support.
I personally prefer Manticore for the SQL-first,but disappointed by the absent of Jieba support.
This is not so hard, but absolutely important!

@sanikolaev sanikolaev changed the title ICU is not a good choice for chinese Jieba integration Feb 2, 2024
@sanikolaev
Copy link
Collaborator

@thegenius thanks for the comment. I've added this task to the roadmap - https://roadmap.manticoresearch.com/

@xzxiaoshan
Copy link

jieba 对中文来说很重要,希望早一些可以用上。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

10 participants