Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HanLP.segment关于与“一”有关的分词错误 #1421

Closed
1 task done
tiandiweizun opened this issue Feb 13, 2020 · 5 comments
Closed
1 task done

HanLP.segment关于与“一”有关的分词错误 #1421

tiandiweizun opened this issue Feb 13, 2020 · 5 comments
Assignees

Comments

@tiandiweizun
Copy link
Contributor

tiandiweizun commented Feb 13, 2020

Describe the bug
A clear and concise description of what the bug is.
java 1.7.6版本,调用HanLP.segment后,“十一介绍”,分词为[十/m, 一介/nz, 绍/nz],“十一中国放假吗”分词为[十/m, 一中/j, 国/n, 放假/vi, 吗/y],类似这种与“一”相关的分词存在错误
Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

HanLP.segment("十一中国放假吗")

Describe the current behavior
A clear and concise description of what happened.
“十一介绍”,分词为[十/m, 一介/nz, 绍/nz]
Expected behavior
A clear and concise description of what you expected to happen.
[十一, 介绍]

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): windows
  • Python version: java
  • HanLP version: 1.7.6

Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

所以很多 "一A@*"这种模式,其中A代表一个字,*代表任意字,然后我把以A开头的词全部拿出来,并以频率排序,发现有很多这样的问题,原因应该是 一中、一发、一通等进行转移不太合适,上传文件,不知有无兴趣强化
hanLP关于一的分词错误文件.txt

中国=39573 例子:十一中国放假吗
发展=20444 例子:十一发展计划
时间=17679
通过=16921 例子:十一通过山海关吗
发现=16201
经济=14634
会议=13344
同时=13137
生活=12721
对于=10684
发生=10588
中央=10431
介绍=9492
价格=9121
一直=8928
行为=8549
中心=8440
一起=8280
环境=7994
大家=7830
期间=7734
世界=7554
代表=7412
大学=7294
发布=7279
时候=7149
领导=7055
手机=6656

  • I've completed this form and searched the web for solutions.
@hankcs
Copy link
Owner

hankcs commented Feb 13, 2020

[auto-reply] Thanks for your comment. However, the essential information is required. Please carefully fill out the form.

@hankcs hankcs added auto-replied and removed bug labels Feb 13, 2020
hankcs added a commit that referenced this issue Feb 14, 2020
@hankcs
Copy link
Owner

hankcs commented Feb 14, 2020

这个问题跟原子分词有关,我提交了一个patch,欢迎测试。

@tiandiweizun
Copy link
Contributor Author

tiandiweizun commented Feb 15, 2020

中文测试基本ok,但是混杂英文字母还是不行

    System.out.println(HanLP.segment("android十一中国版本"));
    System.out.println(HanLP.segment("appple十一介绍"));
    System.out.println(HanLP.segment("MIUI十一通过测试了吗"));

实际输出:
[android/nx, 十/m, 一中/j, 国/n, 版本/n]
[appple/nx, 十/m, 一介/nz, 绍/nz]
[MIUI/nx, 十/m, 一通/v, 过/uguo, 测试/vn, 了/ule, 吗/y]

@hankcs
Copy link
Owner

hankcs commented Feb 15, 2020

现在针对中文数词做了改进。

@tiandiweizun
Copy link
Contributor Author

十分感谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants