Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

凝固度存疑 #14

Closed
WangQi1024 opened this issue Nov 8, 2019 · 3 comments
Closed

凝固度存疑 #14

WangQi1024 opened this issue Nov 8, 2019 · 3 comments

Comments

@WangQi1024
Copy link

您好,求解答: “巧克力”中“巧客”和“力”的凝固程度很高,所以更倾向于把“巧克力”定义为一个词,为什么按凝固程度,程序会找出“巧客”这样半个词的片段(博文中这么写的)谢谢????

@jtyoui
Copy link
Owner

jtyoui commented Nov 11, 2019

你说的巧客应该是巧克的意思吧,这个算法是按词袋进行统计的,意思是在统计巧克力三个字是不是成词的时候,首先先统计巧克(两个字)是不是成词,当巧克力出现的次数不是很多的时候,巧克的次数差不多等于巧克力的次数时,那么巧克和巧克力的统计分析结果相差不大,自然会出现断字(巧克之类的词语)出现。根本原因是数据量不足导致的。你可以人为的调整参数很增大数据量来避免这样的断字出现,其实有一些断字是有意义的,比如:中华人民共和国、中华、中华人民、共和国等都是有意义的。你如果只想要最大粒度的词语,那么过滤掉就行,过滤的算法可以参考:https://github.com/jtyoui/Jtyoui/blob/master/jtyoui/data/methods.py 里面的110行remove_subset函数。

import jtyoui

print(jtyoui.remove_subset(['aa', 'a', 'ab']))  
# ['aa', 'ab']

@jtyoui
Copy link
Owner

jtyoui commented Nov 11, 2019

您好,求解答: “巧克力”中“巧客”和“力”的凝固程度很高,所以更倾向于把“巧克力”定义为一个词,为什么按凝固程度,程序会找出“巧客”这样半个词的片段(博文中这么写的)谢谢????

https://github.com/jtyoui/Jtyoui/issues/14#issue-520022018

@jtyoui jtyoui closed this as completed Nov 25, 2019
@WangQi1024
Copy link
Author

WangQi1024 commented Jan 10, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants