HanLP.segment关于与“一”有关的分词错误 #1421

tiandiweizun · 2020-02-13T02:44:58Z

Describe the bug
A clear and concise description of what the bug is.
java 1.7.6版本，调用HanLP.segment后，“十一介绍”，分词为[十/m, 一介/nz, 绍/nz]，“十一中国放假吗”分词为[十/m, 一中/j, 国/n, 放假/vi, 吗/y]，类似这种与“一”相关的分词存在错误
Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

HanLP.segment("十一中国放假吗")

Describe the current behavior
A clear and concise description of what happened.
“十一介绍”，分词为[十/m, 一介/nz, 绍/nz]
Expected behavior
A clear and concise description of what you expected to happen.
[十一, 介绍]

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): windows
Python version: java
HanLP version: 1.7.6

Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

所以很多 "一A@*"这种模式，其中A代表一个字，*代表任意字，然后我把以A开头的词全部拿出来，并以频率排序，发现有很多这样的问题，原因应该是一中、一发、一通等进行转移不太合适，上传文件，不知有无兴趣强化
hanLP关于一的分词错误文件.txt

中国=39573 例子：十一中国放假吗
发展=20444 例子：十一发展计划
时间=17679
通过=16921 例子：十一通过山海关吗
发现=16201
经济=14634
会议=13344
同时=13137
生活=12721
对于=10684
发生=10588
中央=10431
介绍=9492
价格=9121
一直=8928
行为=8549
中心=8440
一起=8280
环境=7994
大家=7830
期间=7734
世界=7554
代表=7412
大学=7294
发布=7279
时候=7149
领导=7055
手机=6656

I've completed this form and searched the web for solutions.

hankcs · 2020-02-13T02:45:04Z

[auto-reply] Thanks for your comment. However, the essential information is required. Please carefully fill out the form.

hankcs · 2020-02-14T20:54:08Z

这个问题跟原子分词有关，我提交了一个patch，欢迎测试。

tiandiweizun · 2020-02-15T01:49:48Z

中文测试基本ok，但是混杂英文字母还是不行

    System.out.println(HanLP.segment("android十一中国版本"));
    System.out.println(HanLP.segment("appple十一介绍"));
    System.out.println(HanLP.segment("MIUI十一通过测试了吗"));

实际输出：
[android/nx, 十/m, 一中/j, 国/n, 版本/n]
[appple/nx, 十/m, 一介/nz, 绍/nz]
[MIUI/nx, 十/m, 一通/v, 过/uguo, 测试/vn, 了/ule, 吗/y]

hankcs · 2020-02-15T02:07:37Z

现在针对中文数词做了改进。

tiandiweizun · 2020-02-15T02:31:31Z

十分感谢

tiandiweizun added the bug label Feb 13, 2020

tiandiweizun assigned hankcs Feb 13, 2020

hankcs added auto-replied and removed bug labels Feb 13, 2020

hankcs added a commit that referenced this issue Feb 14, 2020

改进原子切分 fix #1421

e8a920c

hankcs closed this as completed Feb 14, 2020

hankcs added improvement and removed auto-replied labels Feb 14, 2020

hankcs added a commit that referenced this issue Feb 15, 2020

进一步改进原子切分 fix #1421 (comment)

b62db0d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HanLP.segment关于与“一”有关的分词错误 #1421

HanLP.segment关于与“一”有关的分词错误 #1421

tiandiweizun commented Feb 13, 2020 •

edited

Loading

hankcs commented Feb 13, 2020

hankcs commented Feb 14, 2020

tiandiweizun commented Feb 15, 2020 •

edited

Loading

hankcs commented Feb 15, 2020

tiandiweizun commented Feb 15, 2020

HanLP.segment关于与“一”有关的分词错误 #1421

HanLP.segment关于与“一”有关的分词错误 #1421

Comments

tiandiweizun commented Feb 13, 2020 • edited Loading

hankcs commented Feb 13, 2020

hankcs commented Feb 14, 2020

tiandiweizun commented Feb 15, 2020 • edited Loading

hankcs commented Feb 15, 2020

tiandiweizun commented Feb 15, 2020

tiandiweizun commented Feb 13, 2020 •

edited

Loading

tiandiweizun commented Feb 15, 2020 •

edited

Loading