New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jieba integration #931
Comments
@oabu Thank you for your feedback. So do you recommend adding integration with https://github.com/yanyiwu/cppjieba ? |
@sanikolaev if you decide to integration with jieba, Here is a nice discussion to refer to |
there is another repo that is related to Chinese word segmentation. And it was written in C++. |
Jieba seems to be more popular. What are the advantages of this one? Is there any benchmark comparing it with Jieba and/or ICU? |
【jieba】 Custom Chinese word segmentation is useful |
@sanikolaev hi, is there any plan about using jieba as Chinese text segmentation, the most popular Chinese text segmentation is https://github.com/fxsjy/jieba and it's C++ version is https://github.com/yanyiwu/cppjieba. |
This issue won't make it to the upcoming release. Hopefully we'll address this issue in the next release, i.e. in a few months. |
I think jieba is the current best open source Chinese participle , support for Chinese Simplified Chinese , Chinese Traditional Chinese participle , support for customized thesaurus . jieba supports three modes of participle : precise mode, full mode and search engine mode. Very suitable for full-text search , I used in es is also jieba @sanikolaev |
hi @sanikolaev , Do you have any plan or timeline regarding the full integration of Jieba? Thanks. |
Hi @jaric Unfortunately, it's not in our nearest plans, but we are still interested in it. Ideally, we'd like someone to make a pull request or sponsor the development :) |
This is very important for Chinese developer to choose Manticore。 |
@thegenius thanks for the comment. I've added this task to the roadmap - https://roadmap.manticoresearch.com/ |
jieba 对中文来说很重要,希望早一些可以用上。 |
ICU is not a good choice in China. In addition, it is very important for Chinese word segmentation to customize the dictionary, because the application of words in different industries is completely different.
Taking jieba word segmentation as an example, he has a mode called search mode, which is specially prepared for full-text retrieval.
To this end, I made an example, please take a look and you will understand the difference.
http://lx.host.dabai.com/
the FULL result is the correct result
Taking "清华大学" as an example, few people may search for "清华大学", but most of them will use "清华" as a keyword search, so we need both "清华大学" and "清华". @sanikolaev @dzcpy
The text was updated successfully, but these errors were encountered: