Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6.0 后版本的高亮问题 #169

Closed
wych42 opened this issue Jun 15, 2018 · 6 comments
Closed

6.0 后版本的高亮问题 #169

wych42 opened this issue Jun 15, 2018 · 6 comments

Comments

@wych42
Copy link

wych42 commented Jun 15, 2018

6.0 后多了 "ignore_pinyin_offset" 参数,index 时 startOffset endOffset 都是 0,导致出来的结果高亮词都是 <em></em>xxxx。如果设置 ignore_pinyin_offset: false,会报错:

startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards 

要如何解决这个问题呢?

@medcl
Copy link
Member

medcl commented Jun 19, 2018

DELETE medcl
PUT /medcl/ 
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "pinyin_analyzer" : {
                    "tokenizer" : "my_pinyin"
                    }
            },
            "tokenizer" : {
                "my_pinyin" : {
                    "type" : "pinyin",
                    "keep_full_pinyin" : true,
                    "keep_original" : false,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "remove_duplicated_term":true,
                    "ignore_pinyin_offset": false,
                    "keep_first_letter":true,
                    "keep_separate_first_letter" :true
                }
            }
        }
    }
}
GET /medcl/_analyze
{
  "text": ["刘德华"],
  "analyzer": "pinyin_analyzer"
}

POST /medcl/doc/_mapping
{
  "properties": {
    "name":{
      "analyzer": "pinyin_analyzer",
      "type": "text"
    }
  }
}

POST medcl/doc/1
{
  "name":"刘德华"
}

GET medcl/_search
{
  "query": {"match": {
    "name": "ldh"
  }},
  "highlight": {
    "fields": {
      "name":{
      }
    }
  }
}

GET medcl/_search
{
  "query": {"match": {
    "name": "lh"
  }},
  "highlight": {
    "fields": {
      "name":{
      }
    }
  }
}

GET medcl/_search
{
  "query": {"match": {
    "name": "dehua"
  }},
  "highlight": {
    "fields": {
      "name":{
      }
    }
  }
}


GET medcl/_search
{
  "query": {"match": {
    "name": "刘dhua"
  }},
  "highlight": {
    "fields": {
      "name":{
      }
    }
  }
}

@medcl
Copy link
Member

medcl commented Jun 19, 2018

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1.222208,
    "hits": [
      {
        "_index": "medcl",
        "_type": "doc",
        "_id": "1",
        "_score": 1.222208,
        "_source": {
          "name": "刘德华"
        },
        "highlight": {
          "name": [
            "<em>刘</em><em>德</em><em>华</em>"
          ]
        }
      }
    ]
  }
}

@wych42
Copy link
Author

wych42 commented Jun 19, 2018

我描述的不好,问题发生在文档里有中英、数字混合的情况下:
比如用👆的mapping:

GET /medcl/_analyze
{
  "text": ["刘德华Andy"],
  "analyzer": "pinyin_analyzer"
}

结果里的 ldhandy token 就会导致issue 里描述的报错。

{
  "tokens": [
    {
      "token": "l",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "liu",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "d",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "de",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "h",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 2
    },
    {
      "token": "hua",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 2
    },
    {
      "token": "an",
      "start_offset": 3,
      "end_offset": 5,
      "type": "word",
      "position": 3
    },
    {
      "token": "y",
      "start_offset": 6,
      "end_offset": 7,
      "type": "word",
      "position": 5
    },
    {
      "token": "ldhandy",
      "start_offset": 0,
      "end_offset": 7,
      "type": "word",
      "position": 5
    }
  ]
}

@alex-fang
Copy link

我也遇到这个问题,有解吗??

@abia321
Copy link
Contributor

abia321 commented Nov 13, 2018

@medcl pinyin 的tokenizer似乎有问题,如果设置"ignore_pinyin_offset": false,写入在一定量数据之后就会报startOffset must be non-negative异常,似乎是代码问题。

目前解决这个问题,可以采用ngram+pinyin filter方式,如下:

{
"settings" : {
"analysis" : {
"analyzer" : {
"pinyin_analyzer" : {
"tokenizer" : "my_ngram",
"filter": [
"pinyin_filter"
]
}
},
"tokenizer" : {
"my_ngram" : {
"type" : "ngram",
"min_gram" : 1,
"max_gram" : 50,
"token_chars" : [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"filter":{
"pinyin_filter":{
"type":"pinyin",
"keep_full_pinyin":false,
"keep_joined_full_pinyin":true,
"keep_none_chinese_in_joined_full_pinyin":true,
"none_chinese_pinyin_tokenize":false,
"remove_duplicated_term":true
}
}
}
},
"mappings": {
"abia321": {
"properties": {
"name": {
"type": "text",
"analyzer": "pinyin_analyzer",
"search_analyzer": "standard",
"term_vector": "with_positions_offsets"
}
}
}
}
}

@wych42 wych42 closed this as completed Apr 21, 2020
@wansho
Copy link

wansho commented Aug 9, 2021

7.6.2 版本的配置:

{
    "analyzer": {
      "pinyin_analyzer": {
        "tokenizer": "my_ngram",
        "filter": [
          "pinyin_filter"
        ]
      }
    },
    "tokenizer": {
      "my_ngram": {
        "type": "ngram",
        "min_gram": 3,
        "max_gram": 3,
        "token_chars": [
          "letter",
          "digit",
          "punctuation",
          "symbol"
        ]
      }
    },
    "filter": {
      "pinyin_filter": {
        "type": "pinyin",
        "keep_full_pinyin": false,
        "keep_joined_full_pinyin": true,
        "keep_none_chinese_in_joined_full_pinyin": true,
        "none_chinese_pinyin_tokenize": false,
        "remove_duplicated_term": true
      }
    }
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants