6.0 后版本的高亮问题 #169

wych42 · 2018-06-15T10:19:38Z

6.0 后多了 "ignore_pinyin_offset" 参数，index 时 startOffset endOffset 都是 0，导致出来的结果高亮词都是 <em></em>xxxx。如果设置 ignore_pinyin_offset: false，会报错：

startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards

要如何解决这个问题呢？

The text was updated successfully, but these errors were encountered:

medcl · 2018-06-19T03:54:25Z

DELETE medcl
PUT /medcl/ 
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "pinyin_analyzer" : {
                    "tokenizer" : "my_pinyin"
                    }
            },
            "tokenizer" : {
                "my_pinyin" : {
                    "type" : "pinyin",
                    "keep_full_pinyin" : true,
                    "keep_original" : false,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "remove_duplicated_term":true,
                    "ignore_pinyin_offset": false,
                    "keep_first_letter":true,
                    "keep_separate_first_letter" :true
                }
            }
        }
    }
}
GET /medcl/_analyze
{
  "text": ["刘德华"],
  "analyzer": "pinyin_analyzer"
}

POST /medcl/doc/_mapping
{
  "properties": {
    "name":{
      "analyzer": "pinyin_analyzer",
      "type": "text"
    }
  }
}

POST medcl/doc/1
{
  "name":"刘德华"
}

GET medcl/_search
{
  "query": {"match": {
    "name": "ldh"
  }},
  "highlight": {
    "fields": {
      "name":{
      }
    }
  }
}

GET medcl/_search
{
  "query": {"match": {
    "name": "lh"
  }},
  "highlight": {
    "fields": {
      "name":{
      }
    }
  }
}

GET medcl/_search
{
  "query": {"match": {
    "name": "dehua"
  }},
  "highlight": {
    "fields": {
      "name":{
      }
    }
  }
}


GET medcl/_search
{
  "query": {"match": {
    "name": "刘dhua"
  }},
  "highlight": {
    "fields": {
      "name":{
      }
    }
  }
}

medcl · 2018-06-19T03:55:08Z

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1.222208,
    "hits": [
      {
        "_index": "medcl",
        "_type": "doc",
        "_id": "1",
        "_score": 1.222208,
        "_source": {
          "name": "刘德华"
        },
        "highlight": {
          "name": [
            "<em>刘</em><em>德</em><em>华</em>"
          ]
        }
      }
    ]
  }
}

wych42 · 2018-06-19T06:35:33Z

我描述的不好，问题发生在文档里有中英、数字混合的情况下：
比如用👆的mapping:

GET /medcl/_analyze
{
  "text": ["刘德华Andy"],
  "analyzer": "pinyin_analyzer"
}

结果里的 ldhandy token 就会导致issue 里描述的报错。

{
  "tokens": [
    {
      "token": "l",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "liu",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "d",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "de",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "h",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 2
    },
    {
      "token": "hua",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 2
    },
    {
      "token": "an",
      "start_offset": 3,
      "end_offset": 5,
      "type": "word",
      "position": 3
    },
    {
      "token": "y",
      "start_offset": 6,
      "end_offset": 7,
      "type": "word",
      "position": 5
    },
    {
      "token": "ldhandy",
      "start_offset": 0,
      "end_offset": 7,
      "type": "word",
      "position": 5
    }
  ]
}

alex-fang · 2018-11-09T07:49:43Z

我也遇到这个问题，有解吗？？

abia321 · 2018-11-13T09:49:38Z

@medcl pinyin 的tokenizer似乎有问题，如果设置"ignore_pinyin_offset": false，写入在一定量数据之后就会报startOffset must be non-negative异常，似乎是代码问题。

目前解决这个问题，可以采用ngram+pinyin filter方式，如下：

{
"settings" : {
"analysis" : {
"analyzer" : {
"pinyin_analyzer" : {
"tokenizer" : "my_ngram",
"filter": [
"pinyin_filter"
]
}
},
"tokenizer" : {
"my_ngram" : {
"type" : "ngram",
"min_gram" : 1,
"max_gram" : 50,
"token_chars" : [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"filter":{
"pinyin_filter":{
"type":"pinyin",
"keep_full_pinyin":false,
"keep_joined_full_pinyin":true,
"keep_none_chinese_in_joined_full_pinyin":true,
"none_chinese_pinyin_tokenize":false,
"remove_duplicated_term":true
}
}
}
},
"mappings": {
"abia321": {
"properties": {
"name": {
"type": "text",
"analyzer": "pinyin_analyzer",
"search_analyzer": "standard",
"term_vector": "with_positions_offsets"
}
}
}
}
}

wansho · 2021-08-09T11:09:49Z

7.6.2 版本的配置：

{
    "analyzer": {
      "pinyin_analyzer": {
        "tokenizer": "my_ngram",
        "filter": [
          "pinyin_filter"
        ]
      }
    },
    "tokenizer": {
      "my_ngram": {
        "type": "ngram",
        "min_gram": 3,
        "max_gram": 3,
        "token_chars": [
          "letter",
          "digit",
          "punctuation",
          "symbol"
        ]
      }
    },
    "filter": {
      "pinyin_filter": {
        "type": "pinyin",
        "keep_full_pinyin": false,
        "keep_joined_full_pinyin": true,
        "keep_none_chinese_in_joined_full_pinyin": true,
        "none_chinese_pinyin_tokenize": false,
        "remove_duplicated_term": true
      }
    }
  }

wych42 closed this as completed Apr 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

6.0 后版本的高亮问题 #169

6.0 后版本的高亮问题 #169

wych42 commented Jun 15, 2018

medcl commented Jun 19, 2018

medcl commented Jun 19, 2018

wych42 commented Jun 19, 2018

alex-fang commented Nov 9, 2018

abia321 commented Nov 13, 2018

wansho commented Aug 9, 2021

6.0 后版本的高亮问题 #169

6.0 后版本的高亮问题 #169

Comments

wych42 commented Jun 15, 2018

medcl commented Jun 19, 2018

medcl commented Jun 19, 2018

wych42 commented Jun 19, 2018

alex-fang commented Nov 9, 2018

abia321 commented Nov 13, 2018

wansho commented Aug 9, 2021