Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请问sego的词库是不是有什么工具生成的?想对目前的词库进行扩容. #14

Open
insionng opened this issue Jun 2, 2015 · 10 comments

Comments

@insionng
Copy link

insionng commented Jun 2, 2015

No description provided.

@suntong
Copy link

suntong commented Jul 4, 2016

同问。词库扩容很重要。

@huichen
Copy link
Owner

huichen commented Jul 20, 2016

是直接拷贝了 jieba 的词库,你直接向词库里添加新词和词频即可,词频可以通过在你的语料中统计得到

@insionng
Copy link
Author

@huichen 可以说明一下词库的几列具体分别是什么意思吗? 第一个是词语这是知道的,但后面的不是很清楚.

@huichen
Copy link
Owner

huichen commented Jul 29, 2016

三列分别是 词语、在训练语料中的词频、词性

@phproot
Copy link

phproot commented Aug 8, 2016

词频有没有计算公式?如何获得?

@huichen
Copy link
Owner

huichen commented Aug 8, 2016

@phproot 语料库中简单的出现次数的统计

@phproot
Copy link

phproot commented Aug 8, 2016

@huichen 语料库在那里呢?是不是可以自己去创建一个语料库,基于大数据?sego有没有类似结巴里面的添加新词的功能呢?

@huichen
Copy link
Owner

huichen commented Aug 8, 2016

你可以把你索引的文档类似的文档拿出来做语料,生成的字典再和这里提供的词典融合一下

@phproot
Copy link

phproot commented Aug 8, 2016

就是说,我的文章数据库中,有10万条内容。然后把这些文章当做语料,然后生成词典对吧?是使用您开发的mlf来生成吗?

@huichen
Copy link
Owner

huichen commented Aug 24, 2016

@phproot 不是用mlf,你从语料中做文本匹配简单统计即可。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants