-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
加入了对数量词的识别! #9
Comments
感谢支持,现在分词器已经全面支持了数词和数量词! |
StandardTokenizer.SEGMENT.enableNumberQuantifierRecognize(true); =======================输出结果======================== 其中,“九千九百九十九朵玫瑰” 分词结果出来“千百”??? |
另外,由于我修改了data/dictionary/CoreNatureDictionary.txt,所以需要删除缓存data/dictionary/CoreNatureDictionary.txt.bin才能生效。 |
data-for-1.1.5.zip依然是旧版数据,等下次发布新版本的时候,新缓存也会被压缩到data.zip,自然就没这个问题了。 |
博主! 在索引分词的时候,貌似没有对数量词最小粒度的切分.
[中华人民共和国/ns, 中华/nz, 中华人民/nz, 华人/n, 人民/n, 共和/n, 共和国/n] |
数量词最小粒度的切分具体应该是什么效果呢?拆成单字吗? |
比方说 十九元 应该是拆分成 [十九元/mq, 十九/m ,元/q] 这种类型. 其实和ik模式差不多,一个是正向最大匹配,也就是相当于hanLP中的标准分词或者是智能分词. 最小匹配,就是配置所有可能分词情况. |
已经改进了,你再试试看 |
博主! 我加入了对数量词的识别! 主题代码如下:
package com.hankcs.hanlp.recognition.mq;
import com.hankcs.hanlp.HanLP;
import com.hankcs.hanlp.corpus.tag.Nature;
import com.hankcs.hanlp.dictionary.CoreDictionary;
import com.hankcs.hanlp.seg.common.Vertex;
import com.hankcs.hanlp.seg.common.WordNet;
import com.hankcs.hanlp.utility.Predefine;
import java.util.List;
import java.util.ListIterator;
import static com.hankcs.hanlp.dictionary.nr.NRConstant.WORD_ID;
/**
/
public class TranslatedQuantifierRecognition
{
/*
执行识别
@param segResult 粗分结果
@param wordNetOptimum 粗分结果对应的词图
@param wordNetAll 全词图
*/
public static void Recognition(List segResult, WordNet wordNetOptimum, WordNet wordNetAll)
{
StringBuilder sbQuantifier = new StringBuilder();
int appendTimes = 0;
ListIterator listIterator = segResult.listIterator();
listIterator.next();
int line = 1;
int activeLine = 1;
while (listIterator.hasNext())
{
Vertex vertex = listIterator.next();
if (appendTimes > 0)
{
if (vertex.guessNature() == Nature.q ||vertex.guessNature() == Nature.qt
||vertex.guessNature() == Nature.qv
|| vertex.guessNature() == Nature.qt
||vertex.guessNature() == Nature.nx)
{
sbQuantifier.append(vertex.realWord);
++appendTimes;
}
else
{
// 识别结束
if (appendTimes > 1)
{
if (HanLP.Config.DEBUG)
{
System.out.println("数量词识别出:" + sbQuantifier.toString());
}
wordNetOptimum.insert(activeLine, new Vertex(Predefine.TAG_QUANTIFIER, sbQuantifier.toString(), new CoreDictionary.Attribute(Nature.mq), WORD_ID), wordNetAll);
}
sbQuantifier.setLength(0);
appendTimes = 0;
}
}
else
{
// 数字m触发识别
if (vertex.guessNature() == Nature.m)
{
sbQuantifier.append(vertex.realWord);
++appendTimes;
activeLine = line;
}
}
}
}
}
The text was updated successfully, but these errors were encountered: