Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Hybrid request does not return inner_hits for nested objects. #718

Open
Kovsonq opened this issue Apr 30, 2024 · 11 comments
Open

Comments

@Kovsonq
Copy link

Kovsonq commented Apr 30, 2024

Is your feature request related to a problem?

Yes, I'm experiencing a problem when I use the hybrid search plugin in OpenSearch v2.11.0. Specifically, when I include the "inner_hits" parameter in my query for nested objects, I do not receive any inner hits in the response. This is causing frustration as my system requires this level of detail for optimal operation.

What solution would you like?

I would like the hybrid search plugin to be updated to include the functionality to correctly return inner hits from nested queries. Ideally, this would function seamlessly as it does in standard OpenSearch queries. This improvement would allow me and other users to fully utilize the power of the hybrid search plugin.

@martin-gaievski
Copy link
Member

Can you please share more details for us to understand your request better: index mapping, query example, expected response?

@Kovsonq
Copy link
Author

Kovsonq commented May 1, 2024

I removed vectors values, do you need them also?

Index mapping :

{
  "mappings": {
    "properties": {
      "chunks": {
        "type": "nested",
        "properties": {
          "embedding": {
            "type": "knn_vector",
            "dimension": 1536,
            "method": {
              "name": "hnsw",
              "space_type": "cosinesimil",
              "engine": "nmslib",
              "parameters": {
                "ef_construction": 128,
                "m": 24
              }
            }
          },
          "payload": {
            "index": "true",
            "norms": "false",
            "store": "true",
            "type": "text"
          },
          "length": {
            "type": "integer"
          },
          "id": {
            "type": "text"
          }
        }
      }
    }
  },
  "settings": {
    "index": {
      "knn": true,
      "number_of_shards": 5,
      "number_of_replicas": 1
    }
  }
}

Document example:

{
    "chunks": [
        {
            "id": 1,
            "length": 173,
            "payload": "Text 1 example",
            "tokens": 256,
            "embedding": [...]
        },
        {
            "id": 2,
            "length": 173,
            "payload": "Text 2 example",
            "tokens": 256,
            "embedding": [...]
        },
        {
            "id": 3,
            "length": 173,
            "payload": "Text 3 example",
            "tokens": 256,
            "embedding": [...]
        }
    ]
}

request:

{
    "_source": false,
    "query": {
        "hybrid": {
            "queries": [
                {
                    "nested": {
                        "path": "chunks",
                        "query": {
                            "knn": {
                                "chunks.embedding": {
                                    "vector": [...],
                                    "k": 10
                                }
                            }
                        },
                        "inner_hits": {
                            "size": 10,
                            "_source": {
                                "includes": [
                                    "chunks.payload",
                                    "chunks.id"
                                ]
                            }
                        }
                    }
                },
                {
                    "bool": {
                        "must": [
                            {
                                "nested": {
                                    "path": "chunks",
                                    "query": {
                                        "simple_query_string": {
                                            "query": "*",
                                            "fields": [
                                                "chunks.payload"
                                            ],
                                            "default_operator": "and"
                                        }
                                    }
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}

response:

{
    "took": 18,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "index_name",
                "_id": "doc_id_1",
                "_score": 1.0,
            }
        ]
    }
}

expected response:

{
    "took": 17,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1,
        "hits": [
            {
                "_index": "index_name",
                "_id": "doc_id_1",
                "_score": 1,
                "inner_hits": {
                    "hsr_chunks": {
                        "hits": {
                            "total": {
                                "value": 3,
                                "relation": "eq"
                            },
                            "max_score": 0.7954481,
                            "hits": [
                                {
                                    "_index": "index_name",
                                    "_id": "doc_id_1",
                                    "_nested": {
                                        "field": "hsr_chunks",
                                        "offset": 0
                                    },
                                    "_score": 0.7954481,
                                    "_source": {
                                        "payload": "Text 1 example",
                                        "id": 1
                                    }
                                },
                                {
                                    "_index": "index_name",
                                    "_id": "doc_id_1",
                                    "_nested": {
                                        "field": "hsr_chunks",
                                        "offset":1
                                    },
                                    "_score": 0.7949572,
                                    "_source": {
                                        "payload": "Text 2 example",
                                        "id": 2
                                    }
                                },
                                {
                                    "_index": "index_name",
                                    "_id": "doc_id_1",
                                    "_nested": {
                                        "field": "chunks",
                                        "offset": 2
                                    },
                                    "_score": 0.75225127,
                                    "_source": {
                                        "payload": "Text 3 example",
                                        "id": 3
                                    }
                                }
                            ]
                        }
                    }
                }
            }
        ]
    }
}

@dswitzer
Copy link

dswitzer commented May 6, 2024

This issue is also biting me.

We have nested property which stores attachments on a document. We use the inner_hits today to reflect when the query was found in one of the attachments. However, in trying to implement a hybrid search which combines a simple_query_string with a neural_sparse search, we're losing the inner_hits, which means we cannot identify when the search came from our nested search.

@navneet1v
Copy link
Collaborator

@dswitzer can we try 2 text queries with hybrid search and see if inner hits are coming or not. Reason I am asking this is for vector search there are improvements which are doing in 2.12 and 2.13 version relates to nested fields with vectors.
Ref: opensearch-project/k-NN#1447
Ref: opensearch-project/k-NN#1065

@heemin32
Copy link
Collaborator

@navneet1v The issue persist even if it contains query with non-vector fields only.
The issue with hybrid search with inner_hits is that, the innerHit result does not get generated at all.

@navneet1v
Copy link
Collaborator

@heemin32 thanks for confirming it. Can you please share the example on this issue on what and how you tested it.

@heemin32
Copy link
Collaborator

heemin32 commented May 20, 2024

Create Index

PUT /my-hybrid
{
  "mappings": {
    "properties": {
      "chunks": {
        "type": "nested",
        "properties": {
          "embedding": {
            "type": "knn_vector",
            "dimension": 3,
            "method": {
              "name": "hnsw",
              "space_type": "cosinesimil",
              "engine": "nmslib",
              "parameters": {
                "ef_construction": 128,
                "m": 24
              }
            }
          },
          "payload": {
            "index": "true",
            "norms": "false",
            "store": "true",
            "type": "text"
          },
          "length": {
            "type": "integer"
          },
          "id": {
            "type": "text"
          }
        }
      }
    }
  },
  "settings": {
    "index": {
      "knn": true
    }
  }
}

Add doc

PUT /_bulk?refresh=true
{ "index": { "_index": "my-hybrid", "_id": "1" } }
{ "chunks": [{"id": 1, "length": 173, "payload": "Text 1 example", "tokens": 256, "embedding": [1, 1, 1]}, {"id": 2, "length": 173, "payload": "Text 2 example", "tokens": 256, "embedding": [2, 2, 2]},{"id": 3,"length": 173,"payload": "Text 3 example","tokens": 256,"embedding": [3, 3, 3]}]}

Query

GET /my-hybrid/_search
{
  "_source": false,
  "query": {
    "hybrid": {
      "queries": [
        {
          "nested": {
            "path": "chunks",
            "query": {
              "simple_query_string": {
                "query": "*",
                "fields": [
                  "chunks.payload"
                ],
                "default_operator": "and"
              }
            },
            "inner_hits": {
              "size": 10,
              "_source": {
                "includes": [
                  "chunks.payload",
                  "chunks.id"
                ]
              }
            }
          }
        }
      ]
    }
  }
}

Response

Expect innerHit field is included in the result but no innerHit appears in the result.

{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "my-hybrid",
        "_id": "1",
        "_score": -9549512000
      },
      {
        "_index": "my-hybrid",
        "_id": "1",
        "_score": -4422440400
      },
      {
        "_index": "my-hybrid",
        "_id": "1",
        "_score": 1
      },
      {
        "_index": "my-hybrid",
        "_id": "1",
        "_score": -9549512000
      }
    ]
  }
}

@martin-gaievski
Copy link
Member

@Kovsonq @dswitzer what is the main use case for those inner hits returned in the result? How critical is the score information for that use case?

I spent some time checking what can be done for inner hits and our limitations. We can include an inner hits section in the response, similar to what's done for other queries in OpenSearch. The only limitation I'm seeing is with the scores. Inner hits have their own logic for retrieving scores; at a high level, they run a light version of the search again during the Fetch phase. At this point, the score normalization process for the hybrid query has been completed, and scores are updated in the query result section of the response. Scores added for inner hits will not be normalized but will be in raw form and scale. This means that, depending on the query, scores can be unbounded and will not correlate with the main hits in the query results (as those are normalized).

@dswitzer
Copy link

@martin-gaievski,

My primary use case is to just be able to highlight the matching terms. The score of the inner hits does not matter much to me, because I'm just using it to highlight keyword matches.

@Kovsonq
Copy link
Author

Kovsonq commented May 31, 2024

@martin-gaievski,

The primary use case for inner_hits in OpenSearch is to retrieve detailed matching information from nested objects within documents. This is particularly useful in scenarios where documents have complex structures with nested fields, and there is a need to understand which specific parts of these documents match the query criteria.

In the context of nested objects, score information for inner hits is important because it allows users to identify the most relevant chunks or sub-documents within a larger document. When a hybrid search is performed, having access to the scores of inner hits enables users to rank and prioritize these nested sections effectively.

Scenario: we need to return the top 20 most relevant nested documents (not parent documents) for the query.

@martin-gaievski
Copy link
Member

@Kovsonq
I'm still not 100% understand why you need normalized scores in a final list of results. If we enable inner_scores without normalized scores, then relative order of child documents will be present in the final result list. As the inner_hits is passed at the sub_query level those hits for child documents will be local to that sub-query anyway, not global for all hybrid query.
If you need to retrieve information about child documents with normalized scores then I feel those child document should be modeled as top level (-> parent) documents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants