加入了对数量词的识别! #9

a198720 · 2015-05-06T10:16:13Z

博主! 我加入了对数量词的识别! 主题代码如下:
package com.hankcs.hanlp.recognition.mq;

import com.hankcs.hanlp.HanLP;
import com.hankcs.hanlp.corpus.tag.Nature;
import com.hankcs.hanlp.dictionary.CoreDictionary;
import com.hankcs.hanlp.seg.common.Vertex;
import com.hankcs.hanlp.seg.common.WordNet;
import com.hankcs.hanlp.utility.Predefine;

import java.util.List;
import java.util.ListIterator;

import static com.hankcs.hanlp.dictionary.nr.NRConstant.WORD_ID;

/**

数量词识别
@author hankcs
/
public class TranslatedQuantifierRecognition
{
/*
- 执行识别
- @param segResult 粗分结果
- @param wordNetOptimum 粗分结果对应的词图
- @param wordNetAll 全词图
  */
  public static void Recognition(List segResult, WordNet wordNetOptimum, WordNet wordNetAll)
  {
  StringBuilder sbQuantifier = new StringBuilder();
  int appendTimes = 0;
  ListIterator listIterator = segResult.listIterator();
  listIterator.next();
  int line = 1;
  int activeLine = 1;
  while (listIterator.hasNext())
  {
  Vertex vertex = listIterator.next();
  if (appendTimes > 0)
  {
  if (vertex.guessNature() == Nature.q ||vertex.guessNature() == Nature.qt
  ||vertex.guessNature() == Nature.qv
  || vertex.guessNature() == Nature.qt
  ||vertex.guessNature() == Nature.nx)
  {
  sbQuantifier.append(vertex.realWord);
  ++appendTimes;
  }
  else
  {
  // 识别结束
  if (appendTimes > 1)
  {
  if (HanLP.Config.DEBUG)
  {
  System.out.println("数量词识别出：" + sbQuantifier.toString());
  }
  wordNetOptimum.insert(activeLine, new Vertex(Predefine.TAG_QUANTIFIER, sbQuantifier.toString(), new CoreDictionary.Attribute(Nature.mq), WORD_ID), wordNetAll);
  }
  sbQuantifier.setLength(0);
  appendTimes = 0;
  }
  }
  else
  {
  // 数字m触发识别
  if (vertex.guessNature() == Nature.m)
  {
  sbQuantifier.append(vertex.realWord);
  ++appendTimes;
  activeLine = line;
  }
  }
```
line += vertex.realWord.length();
```
  }
  }
  }

The text was updated successfully, but these errors were encountered:

hankcs · 2015-05-06T11:02:51Z

感谢支持，现在分词器已经全面支持了数词和数量词！

yuchaozhou · 2015-05-06T15:36:40Z

StandardTokenizer.SEGMENT.enableNumberQuantifierRecognize(true);
String[] testCase = new String[]
{
"十九元套餐包括什么",
"九千九百九十九朵玫瑰",
"壹佰块都不给我",
"９０１２３４５６７８只蚂蚁",
};
for (String sentence : testCase)
{
System.out.println(StandardTokenizer.segment(sentence));
}

=======================输出结果========================
[十/m, 九/b, 元/q, 套餐/n, 包括/v, 什么/ry]
[九/b, 千/m, 九/b, 千百/m, 九/b, 十/m, 九/b, 朵/q, 玫瑰/n]
[壹佰块/mq, 都/d, 不/d, 给/p, 我/rr]
[９０１２３４５６７８/m, 只/d, 蚂蚁/n]

其中，“九千九百九十九朵玫瑰” 分词结果出来“千百”？？？

hankcs · 2015-05-06T15:50:09Z

另外，由于我修改了data/dictionary/CoreNatureDictionary.txt，所以需要删除缓存data/dictionary/CoreNatureDictionary.txt.bin才能生效。

hankcs · 2015-05-06T15:53:38Z

data-for-1.1.5.zip依然是旧版数据，等下次发布新版本的时候，新缓存也会被压缩到data.zip，自然就没这个问题了。

a198720 · 2015-05-07T09:29:19Z

博主! 在索引分词的时候,貌似没有对数量词最小粒度的切分.

    IndexTokenizer.SEGMENT.enableNumberQuantifierRecognize(true);
    String[] testCase = new String[]
            {   
                    "中华人民共和国",
                    "十九元套餐包括什么",
                    "九千九百九十九朵玫瑰",
                    "壹佰块都不给我",
                    "９０１２３４５６７８只蚂蚁"
            };
    for (String sentence : testCase)
    {
        System.out.println(IndexTokenizer.segment(sentence));
    }
==============分词结果===================

[中华人民共和国/ns, 中华/nz, 中华人民/nz, 华人/n, 人民/n, 共和/n, 共和国/n]
[十九元/mq, 套餐/n, 包括/v, 什么/r]
[九千九百九十九朵/mq, 玫瑰/n]
[壹佰块/m, 都/d, 不/d, 给/p, 我/r]
[９０１２３４５６７８只/mq, 蚂蚁/n]

hankcs · 2015-05-07T10:16:37Z

数量词最小粒度的切分具体应该是什么效果呢？拆成单字吗？

a198720 · 2015-05-07T11:46:04Z

比方说十九元应该是拆分成 [十九元/mq, 十九/m ,元/q] 这种类型. 其实和ik模式差不多,一个是正向最大匹配,也就是相当于hanLP中的标准分词或者是智能分词. 最小匹配,就是配置所有可能分词情况.

hankcs · 2015-05-07T12:18:50Z

已经改进了，你再试试看

hankcs added the improvement label May 6, 2015

hankcs closed this as completed May 6, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

加入了对数量词的识别! #9

加入了对数量词的识别! #9

a198720 commented May 6, 2015

hankcs commented May 6, 2015

yuchaozhou commented May 6, 2015

hankcs commented May 6, 2015

hankcs commented May 6, 2015

a198720 commented May 7, 2015

hankcs commented May 7, 2015

a198720 commented May 7, 2015

hankcs commented May 7, 2015

加入了对数量词的识别! #9

加入了对数量词的识别! #9

Comments

a198720 commented May 6, 2015

hankcs commented May 6, 2015

yuchaozhou commented May 6, 2015

hankcs commented May 6, 2015

hankcs commented May 6, 2015

a198720 commented May 7, 2015

hankcs commented May 7, 2015

a198720 commented May 7, 2015

hankcs commented May 7, 2015