-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hugemake 出core #4
Comments
我没有注意到只支持gbk,把文档转成gbk之后就不会出core了 |
:) 当时比较懒,为了简单就只支持gkb。gbk两个char就可以记录一个字,实现起来容易很多。 |
term freq left entropy right entropy 湿隔离 257 12.850000 2.502627 保湿隔离 255 12.750000 2.498990 这两个实际上可以组合成一个,代码里面可以加个支持啊. |
其实是有这个支持的,但是没有那么那么完美. 你可以看到其它的很多词是OK的. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
使用hugemake处理大约400M的文档时出错,挂在src/hugemaker.cpp 的808行,是在读最后的seq文件时挂掉的
assert(-1 != open_status);
处理20M的没有问题,但是结果是乱码尝试从gbk转成utf8也不行,输入文档编码如下
file ~/work_corpus_head
/home/ubuntu/work_corpus_head: UTF-8 Unicode text, with very long lines
输出文档编码如下
file work_word_new
work_word_new: Non-ISO extended-ASCII text, with LF, NEL line terminators
不知道对输入文档格式有什么要求吗
谢谢!
The text was updated successfully, but these errors were encountered: