Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hugemake 出core #4

Closed
samson-wang opened this issue Dec 23, 2014 · 4 comments
Closed

hugemake 出core #4

samson-wang opened this issue Dec 23, 2014 · 4 comments

Comments

@samson-wang
Copy link

Hi,

使用hugemake处理大约400M的文档时出错,挂在src/hugemaker.cpp 的808行,是在读最后的seq文件时挂掉的
assert(-1 != open_status);
处理20M的没有问题,但是结果是乱码尝试从gbk转成utf8也不行,输入文档编码如下
file ~/work_corpus_head
/home/ubuntu/work_corpus_head: UTF-8 Unicode text, with very long lines
输出文档编码如下
file work_word_new
work_word_new: Non-ISO extended-ASCII text, with LF, NEL line terminators
不知道对输入文档格式有什么要求吗
谢谢!

@samson-wang
Copy link
Author

我没有注意到只支持gbk,把文档转成gbk之后就不会出core了

@jannson
Copy link
Owner

jannson commented Dec 24, 2014

:) 当时比较懒,为了简单就只支持gkb。gbk两个char就可以记录一个字,实现起来容易很多。

@irwenqiang
Copy link

term freq left entropy right entropy

湿隔离 257 12.850000 2.502627

保湿隔离 255 12.750000 2.498990

这两个实际上可以组合成一个,代码里面可以加个支持啊.

@jannson
Copy link
Owner

jannson commented May 21, 2015

其实是有这个支持的,但是没有那么那么完美. 你可以看到其它的很多词是OK的.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants