Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

缺省的 stopWordDict 不起作用 #55

Closed
xuwenbin opened this issue May 6, 2016 · 9 comments
Closed

缺省的 stopWordDict 不起作用 #55

xuwenbin opened this issue May 6, 2016 · 9 comments

Comments

@xuwenbin
Copy link

xuwenbin commented May 6, 2016

环境:Ubuntu 14.04,下载安装了nodejieba最新版本。require并且load了nodejieba。

重现步骤:输入 “7天或者15天”,然后调用 cut 函数,返回的数组里,仍然出现 “或者” 这个词组。查看了一下缺省的stopWordDict,发现里面是有 “或者” 词组,但是为什么 cut 出来的结果里,仍然包含 “或者” 呢?

@yanyiwu
Copy link
Owner

yanyiwu commented May 6, 2016

这位朋友,你可能有一些误会,听我给你解释一下:

分词本身是没有使用停用词词典的,停用词词典是在extract,也就是关键词抽取的时候才会被使用。

@xuwenbin
Copy link
Author

xuwenbin commented May 6, 2016

了解了。使用了extract函数后,确实如此。关于extract的第二个参数,有没有详细的文档说明?我翻看了项目的页面,没有找到具体的文档的链接。

@yanyiwu
Copy link
Owner

yanyiwu commented May 6, 2016

恩谢谢提醒,之前在README里面没有写清楚,第二个参数时topN,就是抽取关键词抽出权重最高的topN个。

@xuwenbin
Copy link
Author

xuwenbin commented May 6, 2016

如果调用者想获得所有的关键词,那个参数该传什么值呢?

@yanyiwu
Copy link
Owner

yanyiwu commented May 6, 2016

传入一个极大值就行了,比如 Number.MAX_VALUE

@xuwenbin
Copy link
Author

xuwenbin commented May 6, 2016

根据测试来看,cut和extract都会保留阿拉伯数字,而且是切好后单独作为数字存在。关于这类字符,大家日常应该是stop掉还是保留呢?

@yanyiwu
Copy link
Owner

yanyiwu commented May 6, 2016

纠正一下,这个 Number.MAX_VALUE 太大了不行。
下面这么大就够了。

var topN = 1<<30;

@xuwenbin xuwenbin closed this as completed May 7, 2016
@skyblue
Copy link

skyblue commented Dec 6, 2016

tag的时候stopWordDict也是不生效的吗?

@yanyiwu
Copy link
Owner

yanyiwu commented Dec 6, 2016

@skyblue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants