Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

类似#195的match_phrase问题 #204

Closed
smilesfc opened this issue Jun 3, 2016 · 21 comments
Closed

类似#195的match_phrase问题 #204

smilesfc opened this issue Jun 3, 2016 · 21 comments

Comments

@smilesfc
Copy link

smilesfc commented Jun 3, 2016

medcl大神,我觉的ik分词的position可能有问题

先描述下问题:原文中为“前次募集资金”,索引用的ik_max_word,搜索时match_phrase搜“前次募集资金”没问题,match_phrase搜“前次募集”啥也搜不到。

ik_max_word的analyzer测试:
_analyze?analyzer=ik_max_word&text=前次募集资金
返回:
{"tokens":[{"token":"前次","start_offset":0,"end_offset":2,"type":"CN_WORD","position":0},{"token":"募集","start_offset":2,"end_offset":4,"type":"CN_WORD","position":1},{"token":"募","start_offset":2,"end_offset":3,"type":"CN_WORD","position":2},{"token":"集","start_offset":3,"end_offset":4,"type":"CN_CHAR","position":3},{"token":"基金","start_offset":4,"end_offset":6,"type":"CN_WORD","position":4}]}

相关mapping:
[ElasticProperty(IncludeInAll = false, IndexAnalyzer = "ik_max_word", SearchAnalyzer = "ik_max_word")] public string Title { get; set; }

第一次用“前次募集资金”搜索:

{
  "_source": "false",
  "highlight": {
    "fields": {
      "title": {}
    }
  },
  "query": {
    "match_phrase": {
      "title": {
        "query": "前次募集资金"
      }
    }
  }
}

返回有结果:
{"took":1,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":21,"max_score":3.6136353,"hits":[{"_index":"disclosure.main.alpha","_type":"esdisclosurecomp","_id":"73419","_score":3.6136353,"highlight":{"title":["国金证券:<em>前次募集资金</em>使用情况报告"]}}]}}

然后第二次去掉“资金”:

{
  "_source": "false",
  "highlight": {
    "fields": {
      "title": {}
    }
  },
  "size": 1,
  "query": {
    "match_phrase": {
      "title": {
        "query": "前次募集"
      }
    }
  }
}

此时返回无匹配:
{"took":1,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

我就奇了怪了,于是加了term vector,翻国金证券这篇,找到:

"前次" : {
          "term_freq" : 1,
          "tokens" : [ {
            "position" : 5,
            "start_offset" : 5,
            "end_offset" : 7
          } ]
        },
"募集" : {
          "term_freq" : 1,
          "tokens" : [ {
            "position" : 7,
            "start_offset" : 7,
            "end_offset" : 9
          } ]
        },
"募集资金" : {
          "term_freq" : 1,
          "tokens" : [ {
            "position" : 6,
            "start_offset" : 7,
            "end_offset" : 11
          } ]
        },

我觉得问题就在这里,在ik_max_word下,募集资金本身是一个完整的词,同时又可以被分为募集和资金,而募集的position离前次已经差1个词了,所以match_phrase不认为前次募集可以构成一个词组。您看看是不是这么回事,然后想问下有没有解决的方法,谢谢!

@medcl
Copy link
Member

medcl commented Jun 3, 2016

@smilesfc 麻烦贴完整可复现的restful脚本,我这边测试没有你说的问题,和position没有关系的
`POST index/type3/_mapping
{
"properties": {
"myname":{
"type": "string"
, "analyzer": "ik_max_word"
}
}
}

PUT index/type3/1
{
"myname":"国金证券:前次募集资金使用情况报告"
}

POST index/type3/_search
{
"_source": "false",
"highlight": {
"fields": {
"title": {}
}
},
"query": {
"match_phrase": {
"myname": {
"query": "募集资金"
}
}
}
}`

{ "took": 6, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1, "hits": [ { "_index": "index", "_type": "type3", "_id": "1", "_score": 1 } ] } }

@smilesfc
Copy link
Author

smilesfc commented Jun 3, 2016

就你贴的这个脚本就可以,但是"募集资金"是添加在自定义词典myDict里面的。如果不加这个词的话没有问题,加了之后才出现的。

@medcl
Copy link
Member

medcl commented Jun 13, 2016

@smilesfc 我加到词典里面也没有出现这个问题,完整的复现流程贴一下吧,和我上面的格式一样,用sense

@wuyadong
Copy link

@medcl
我也遇到了类似的问题。
ES版本: 2.3.1, 2.3.3
使用 ik_max_word
测试了多组数据,总结了情况:

  1. 和filter无关
  2. 只在phrase 检索情况下出现
  3. 只有在分词字典中最长的词会出现,如 "北京宝软科技有限公司" 和 "宝软科技有限公司"都在字典中,但是搜索 "宝软科技有限公司" 没有问题,"北京宝软科技有限公司" 搜索不到。
  4. 将 "北京宝软科技有限公司" 从字典中删除重启,搜索 "北京宝软科技有限公司" 依然搜索不到

附上 "北京宝软科技有限公司" 分词器结果,我肉眼看看觉得也没啥问题:

{
  "tokens": [
    {
      "token": "北京宝软科技有限公司",
      "start_offset": 0,
      "end_offset": 10,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "北京",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "宝软科技有限公司",
      "start_offset": 2,
      "end_offset": 10,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "宝软科技",
      "start_offset": 2,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "宝软",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "科技",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "有限公司",
      "start_offset": 6,
      "end_offset": 10,
      "type": "CN_WORD",
      "position": 6
    },
    {
      "token": "有限",
      "start_offset": 6,
      "end_offset": 8,
      "type": "CN_WORD",
      "position": 7
    },
    {
      "token": "公司",
      "start_offset": 8,
      "end_offset": 10,
      "type": "CN_WORD",
      "position": 8
    }
  ]
}

@medcl
Copy link
Member

medcl commented Jun 21, 2016

词典修改之后,需要重建索引的

@wuyadong
Copy link

  1. 分词结果是词典修改前的
  2. 那么就是问题还是最长的那个分词会搜不到?

@medcl
Copy link
Member

medcl commented Jun 21, 2016

@wuyadong 麻烦给我一下复现脚本吧

@wuyadong
Copy link

@medcl 如下,
我重新索引测试了字典中是否有 "北京宝软科技有限公司"的两种情况,没有的情况下能检索到,有的情况下检索不到。

put index_test

{
    "mappings": {
          "test": {
              "properties": {
                  "name": {
                      "type": "string",
                      "analyzer": "ik_max_word",
                      "search_analyzer": "ik_max_word",
                      "include_in_all": "true"
                  }
              }
          }
    }

}


put index_test/test/1

{
    "name" : "北京宝软科技有限公司"
}

put index_test/test/2

{
    "name" : "宝软科技有限公司"
}


put index_test/test/3

{
    "name" : "宝软科技"
}


put index_test/test/4

{
    "name" : "网易科技"
}

post index_test/test/_search
{
    "query":{
      "bool" : {
        "must" : {
              "match" : {
                "name" : {
                  "query" : "北京宝软科技有限公司",
                  "type" : "phrase"
                }
              }
            }
        }
    }
}

@smilesfc
Copy link
Author

@wuyadong
我猜测是search anlayzer在搜索时,不是按照北京宝软科技有限公司拆的,而是按照其他方式拆的,比如北京/宝软/科技/有限公司。因为宝软的position是4,北京是0,所以match_phrase认为两者不挨着,所以没搜到。放松slop就可以搜到。

@medcl
Copy link
Member

medcl commented Jun 22, 2016

@wuyadong 奇怪,我什么我这边用你的脚本就是无法复现,都是能查出来的,也是用的2.3

@wuyadong
Copy link

@medcl 难道是其它配置导致的?我读了下配置:

mapping

"mappings": {
"test": {
"properties": {
"name": {
"include_in_all": true,
"analyzer": "ik_max_word",
"type": "string"
}
}
}
},

ik字典扩展配置,只保留了停止词扩展和自己的字典;字典里有 北京宝软科技有限公司、宝软科技有限公司、有限公司等等,还比较多。

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 扩展配置</comment>
    <!--用户可以在这里配置自己的扩展字典 -->
    <entry key="ext_dict">custom/myself.dic</entry>
     <!--用户可以在这里配置自己的扩展停止词字典-->
    <entry key="ext_stopwords">custom/ext_stopword.dic</entry>
    <!--用户可以在这里配置远程扩展字典 -->
    <!-- <entry key="remote_ext_dict">words_location</entry> -->
    <!--用户可以在这里配置远程扩展停止词字典-->
    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

ES配置,仅配置IK分词器,其它就是本地IP之类配置,不会有影响:

index.analysis.analyzer.default.type : ik

@wuyadong
Copy link

@medcl
搜索 宝软科技有限公司 结果:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 2.472951,
    "hits": [
      {
        "_index": "index_test",
        "_type": "test",
        "_id": "2",
        "_score": 2.472951,
        "_source": {
          "name": "宝软科技有限公司"
        }
      },
      {
        "_index": "index_test",
        "_type": "test",
        "_id": "1",
        "_score": 0.67124057,
        "_source": {
          "name": "北京宝软科技有限公司"
        }
      }
    ]
  }
}

北京宝软科技有限公司 结果:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

@wuyadong
Copy link

@medcl 有复现吗? 注意字典配置。
@smilesfc 不清楚内部哦,不过你猜的可能是对的。只是 北京宝软科技有限公司 在字典中存在,应该直接分成一个词呀。或者我要指定搜索使用 ik_smart?

@wuyadong
Copy link

@smilesfc @medcl
修改了mapping,指定了"search_analyzer": "ik_smart"结果能搜索出来。 ik_max_word 下搜不到。

{
    "mappings": {
          "test": {
              "properties": {
                  "name": {
                      "type": "string",
                      "analyzer": "ik_max_word",
                      "search_analyzer": "ik_smart",
                      "include_in_all": "true"
                  }
              }
          }
    }                 
}

搜索

{"query":
{
  "bool" : {
    "must" : {
          "match" : {
            "name" : {
              "query" : "北京宝软科技有限公司",
              "type" : "phrase"
            }
          }
        }
  }
}
}

结果

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.095891505,
    "hits": [
      {
        "_index": "index_test",
        "_type": "test",
        "_id": "1",
        "_score": 0.095891505,
        "_source": {
          "name": "北京宝软科技有限公司"
        }
      }
    ]
  }
}

@wuyadong
Copy link

@medcl 我刚又脚步测试了下,使用 ik_max_word,又正常了,我重建下index看看是不是生产数据库也会正常? 重建好痛苦。

@wuyadong
Copy link

@medcl @smilesfc 真的好了。回忆了下最近做了什么,唯一的改变就是修电路,服务器重启过一次。留下记录,以后再遇到的兄弟也许可以尝试下。。。

@smilesfc
Copy link
Author

我还是觉得这里面有雷,我仔细研究下分词的源码。

@medcl
Copy link
Member

medcl commented Jan 5, 2017

phrase 会使用到 position,phrase 适合分出来的词没有位置重叠的场景,如果有重叠,slop 计算的时候可能会有问题

@medcl medcl closed this as completed Jan 5, 2017
@pengqiuyuan
Copy link

pengqiuyuan commented Oct 19, 2017

相同的问题 @medcl @smilesfc 能否帮看下是为什么

curl -XPUT http://127.0.0.1:9200/ikindex2


curl -XPOST http://127.0.0.1:9200/ikindex2/fulltext2/_mapping -d'
{
  "fulltext2": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_max_word"
      }
    }
  }
}'

curl -XPOST http://127.0.0.1:9200/ikindex2/fulltext2/1 -d'
{
  "content": "国家主席习近平和夫人彭丽媛为金砖国家和对话会受邀国领导人夫妇举行欢迎宴会"
}'


curl -XPOST http://127.0.0.1:9200/ikindex2/fulltext2/_search?pretty  -d'
{
     "query" : { "match_phrase" : { "content" : {"query":"金砖国家","slop":0,"analyzer": "ik_max_word" }} },
      "highlight" : {
          "pre_tags" : ["<tag1", "<tag2"],
         "post_tags" : ["</tag1", "</tag2"],
          "fields" : {
              "content" : {}
          }
      }
}'


curl 'http://127.0.0.1:9200/ikindex2/_analyze?analyzer=ik_max_word&pretty=true' -d '
{
  "text":"国家主席习近平和夫人彭丽媛为金砖国家和对话会受邀国领导人夫妇举行欢迎宴会"
}'

{
  "tokens" : [
    {
      "token" : "国家主席",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "国家",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "家",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "主席",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "习近平",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "平和",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "夫人",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "彭丽媛",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "彭",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "丽",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 9
    },
    {
      "token" : "媛",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 10
    },
    {
      "token" : "为",
      "start_offset" : 13,
      "end_offset" : 14,
      "type" : "CN_CHAR",
      "position" : 11
    },
    {
      "token" : "金砖",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 12
    },
    {
      "token" : "国家",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "CN_WORD",
      "position" : 13
    },
    {
      "token" : "家和",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "CN_WORD",
      "position" : 14
    },
    {
      "token" : "家",
      "start_offset" : 17,
      "end_offset" : 18,
      "type" : "CN_WORD",
      "position" : 15
    },
    {
      "token" : "和",
      "start_offset" : 18,
      "end_offset" : 19,
      "type" : "CN_CHAR",
      "position" : 16
    },
    {
      "token" : "对话会",
      "start_offset" : 19,
      "end_offset" : 22,
      "type" : "CN_WORD",
      "position" : 17
    },
    {
      "token" : "对话",
      "start_offset" : 19,
      "end_offset" : 21,
      "type" : "CN_WORD",
      "position" : 18
    },
    {
      "token" : "会受",
      "start_offset" : 21,
      "end_offset" : 23,
      "type" : "CN_WORD",
      "position" : 19
    },
    {
      "token" : "受邀",
      "start_offset" : 22,
      "end_offset" : 24,
      "type" : "CN_WORD",
      "position" : 20
    },
    {
      "token" : "邀",
      "start_offset" : 23,
      "end_offset" : 24,
      "type" : "CN_WORD",
      "position" : 21
    },
    {
      "token" : "国",
      "start_offset" : 24,
      "end_offset" : 25,
      "type" : "CN_CHAR",
      "position" : 22
    },
    {
      "token" : "领导人",
      "start_offset" : 25,
      "end_offset" : 28,
      "type" : "CN_WORD",
      "position" : 23
    },
    {
      "token" : "领导",
      "start_offset" : 25,
      "end_offset" : 27,
      "type" : "CN_WORD",
      "position" : 24
    },
    {
      "token" : "人夫",
      "start_offset" : 27,
      "end_offset" : 29,
      "type" : "CN_WORD",
      "position" : 25
    },
    {
      "token" : "夫妇",
      "start_offset" : 28,
      "end_offset" : 30,
      "type" : "CN_WORD",
      "position" : 26
    },
    {
      "token" : "妇",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "CN_WORD",
      "position" : 27
    },
    {
      "token" : "举行",
      "start_offset" : 30,
      "end_offset" : 32,
      "type" : "CN_WORD",
      "position" : 28
    },
    {
      "token" : "欢迎宴会",
      "start_offset" : 32,
      "end_offset" : 36,
      "type" : "CN_WORD",
      "position" : 29
    },
    {
      "token" : "欢迎",
      "start_offset" : 32,
      "end_offset" : 34,
      "type" : "CN_WORD",
      "position" : 30
    },
    {
      "token" : "宴会",
      "start_offset" : 34,
      "end_offset" : 36,
      "type" : "CN_WORD",
      "position" : 31
    },
    {
      "token" : "宴",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "CN_WORD",
      "position" : 32
    },
    {
      "token" : "会",
      "start_offset" : 35,
      "end_offset" : 36,
      "type" : "CN_CHAR",
      "position" : 33
    }
  ]
}

@pengqiuyuan
Copy link

搜索 金砖国家 出不来结果。slop 设置为1 可以。但是 金砖国家 明明是相邻的啊。@medcl

@pengqiuyuan
Copy link

es 和 ik 都是 5.4.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants