Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

分词"词元长度优先"导致在lucene中高亮显示错位 #23

Closed
GoogleCodeExporter opened this issue Apr 7, 2016 · 2 comments
Closed

Comments

@GoogleCodeExporter
Copy link

现象:在索引时如果一个Document对象只包含一个名为"content"的Fi
eld,即
********
doc.add(new Field("content", "大家好", Store.YES, Index.ANALYZED, 
TermVector.WITH_POSITIONS_OFFSETS))
********
这样在高亮时是没问题的。
但是如果一个Document对象包含两个或两个以上名为"content"的Fie
ld,即
********
doc.add(new Field("content", "大家好", Store.YES, Index.ANALYZED, 
TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new Field("content", "这里是随便的一句话", Store.YES, 
Index.ANALYZED, TermVector.WITH_POSITIONS_OFFSETS));
********
在搜索"随便"时的高亮结果为"大家好;这里<b>是随<b/>便的一句
话",而希望的结果是"大家好;这里是<b>随便<b/>的一句话"
注1:如果一个Document对象包含多个同名Field,在高亮时需要把所�
��的Field值加起来,用一个字符隔开,前面示例的分隔符为半�
��分号
注2:本问题只有在使用IK分词时产生,使用其他比如StrandardAn
alyzer不存在。

原因:IK在对“大家好”分词时的结果为
********
[0-3] 大家好
[0-2] 大家
********
导致在索引时计算offset和position错位一个字符,如果IK分词结�
��为
********
[0-2] 大家
[0-3] 大家好
********
则不会出现问题

解决办法:修改Lexeme.compareTo方法中如下代码
********
if(this.begin == other.getBegin()){                  
  //词元长度优先                   
  if(this.length > other.getLength()){                           
    return -1;                   
  }else if(this.length == other.getLength()){                           
    return 0;                   
  }else {
     return 1;                   
  }                              
} 
********
将其中的“词元长度优先”改为相反,即返回值-1和1互换位��
�。

Original issue reported on code.google.com by liyg1...@gmail.com on 19 Jan 2011 at 5:12

@GoogleCodeExporter
Copy link
Author

为啥没消息了呢?.....

Original comment by liyg1...@gmail.com on 27 Mar 2011 at 8:19

@GoogleCodeExporter
Copy link
Author

请下载IK2012_FF版本,其中的smart方式不会造成这样的问题。

严格意义上说,是“高亮的算法组件”对 
“位置交叠的词”的支持上不够完善。

Original comment by linliang...@gmail.com on 23 Oct 2012 at 9:36

  • Changed state: Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant