Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

加入了对数量词的识别! #9

Closed
a198720 opened this issue May 6, 2015 · 8 comments
Closed

加入了对数量词的识别! #9

a198720 opened this issue May 6, 2015 · 8 comments

Comments

@a198720
Copy link

a198720 commented May 6, 2015

博主! 我加入了对数量词的识别! 主题代码如下:
package com.hankcs.hanlp.recognition.mq;

import com.hankcs.hanlp.HanLP;
import com.hankcs.hanlp.corpus.tag.Nature;
import com.hankcs.hanlp.dictionary.CoreDictionary;
import com.hankcs.hanlp.seg.common.Vertex;
import com.hankcs.hanlp.seg.common.WordNet;
import com.hankcs.hanlp.utility.Predefine;

import java.util.List;
import java.util.ListIterator;

import static com.hankcs.hanlp.dictionary.nr.NRConstant.WORD_ID;

/**

  • 数量词识别
  • @author hankcs
    /
    public class TranslatedQuantifierRecognition
    {
    /
    *
    • 执行识别

    • @param segResult 粗分结果

    • @param wordNetOptimum 粗分结果对应的词图

    • @param wordNetAll 全词图
      */
      public static void Recognition(List segResult, WordNet wordNetOptimum, WordNet wordNetAll)
      {
      StringBuilder sbQuantifier = new StringBuilder();
      int appendTimes = 0;
      ListIterator listIterator = segResult.listIterator();
      listIterator.next();
      int line = 1;
      int activeLine = 1;
      while (listIterator.hasNext())
      {
      Vertex vertex = listIterator.next();
      if (appendTimes > 0)
      {
      if (vertex.guessNature() == Nature.q ||vertex.guessNature() == Nature.qt
      ||vertex.guessNature() == Nature.qv
      || vertex.guessNature() == Nature.qt
      ||vertex.guessNature() == Nature.nx)
      {
      sbQuantifier.append(vertex.realWord);
      ++appendTimes;
      }
      else
      {
      // 识别结束
      if (appendTimes > 1)
      {
      if (HanLP.Config.DEBUG)
      {
      System.out.println("数量词识别出:" + sbQuantifier.toString());
      }
      wordNetOptimum.insert(activeLine, new Vertex(Predefine.TAG_QUANTIFIER, sbQuantifier.toString(), new CoreDictionary.Attribute(Nature.mq), WORD_ID), wordNetAll);
      }
      sbQuantifier.setLength(0);
      appendTimes = 0;
      }
      }
      else
      {
      // 数字m触发识别
      if (vertex.guessNature() == Nature.m)
      {
      sbQuantifier.append(vertex.realWord);
      ++appendTimes;
      activeLine = line;
      }
      }

      line += vertex.realWord.length();
      

      }
      }
      }

@hankcs
Copy link
Owner

hankcs commented May 6, 2015

感谢支持,现在分词器已经全面支持了数词和数量词!

@yuchaozhou
Copy link

StandardTokenizer.SEGMENT.enableNumberQuantifierRecognize(true);
String[] testCase = new String[]
{
"十九元套餐包括什么",
"九千九百九十九朵玫瑰",
"壹佰块都不给我",
"9012345678只蚂蚁",
};
for (String sentence : testCase)
{
System.out.println(StandardTokenizer.segment(sentence));
}

=======================输出结果========================
[十/m, 九/b, 元/q, 套餐/n, 包括/v, 什么/ry]
[九/b, 千/m, 九/b, 千百/m, 九/b, 十/m, 九/b, 朵/q, 玫瑰/n]
[壹佰块/mq, 都/d, 不/d, 给/p, 我/rr]
[9012345678/m, 只/d, 蚂蚁/n]

其中,“九千九百九十九朵玫瑰” 分词结果出来“千百”???

@hankcs
Copy link
Owner

hankcs commented May 6, 2015

另外,由于我修改了data/dictionary/CoreNatureDictionary.txt,所以需要删除缓存data/dictionary/CoreNatureDictionary.txt.bin才能生效。

@hankcs
Copy link
Owner

hankcs commented May 6, 2015

data-for-1.1.5.zip依然是旧版数据,等下次发布新版本的时候,新缓存也会被压缩到data.zip,自然就没这个问题了。

@a198720
Copy link
Author

a198720 commented May 7, 2015

博主! 在索引分词的时候,貌似没有对数量词最小粒度的切分.

    IndexTokenizer.SEGMENT.enableNumberQuantifierRecognize(true);
    String[] testCase = new String[]
            {   
                    "中华人民共和国",
                    "十九元套餐包括什么",
                    "九千九百九十九朵玫瑰",
                    "壹佰块都不给我",
                    "9012345678只蚂蚁"
            };
    for (String sentence : testCase)
    {
        System.out.println(IndexTokenizer.segment(sentence));
    }
==============分词结果===================

[中华人民共和国/ns, 中华/nz, 中华人民/nz, 华人/n, 人民/n, 共和/n, 共和国/n]
[十九元/mq, 套餐/n, 包括/v, 什么/r]
[九千九百九十九朵/mq, 玫瑰/n]
[壹佰块/m, 都/d, 不/d, 给/p, 我/r]
[9012345678只/mq, 蚂蚁/n]

@hankcs
Copy link
Owner

hankcs commented May 7, 2015

数量词最小粒度的切分具体应该是什么效果呢?拆成单字吗?

@a198720
Copy link
Author

a198720 commented May 7, 2015

比方说 十九元 应该是拆分成 [十九元/mq, 十九/m ,元/q] 这种类型. 其实和ik模式差不多,一个是正向最大匹配,也就是相当于hanLP中的标准分词或者是智能分词. 最小匹配,就是配置所有可能分词情况.

@hankcs
Copy link
Owner

hankcs commented May 7, 2015

已经改进了,你再试试看

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants