Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Improved Hybrid Search relevancy by Normalization and Score Combination Feature API Design LLD #244

Closed
navneet1v opened this issue Aug 7, 2023 · 4 comments
Assignees
Labels
Features Introduces a new unit of functionality that satisfies a requirement v2.10.0 Issues targeting release v2.10.0

Comments

@navneet1v
Copy link
Collaborator

navneet1v commented Aug 7, 2023

Introduction

This document proposes the Low Level API design for Normalization and Score Combination Feature. This is the first of the three LLD that Vector Search Team will be doing for the Normalization and Score Combination Feature. As a pre-read please make sure that you read the high level design: Score Combination and Normalization for Semantics Search[HLD] and Search Phase Results Processor. The building blocks for the feature and high level decisions are taken in those issues.

Background

As discussed in the HLD, we will be adding a new Query Clause in OpenSearch which will help us fetch the results per shard for different queries whose scores are at different scale(example: Neural Query and Keyword Search). Once those results are retrieved at Shard level, at the coordinator node we will Normalize and Combine the Results via Search Phase Results Processors to build the final doc Ids list for which the Fetch Phase will run. Refer the below high Level Flow.
High Level Flow

221723293-58c74b08-45ae-40b1-8b1c-f68d5ef39cbb

Search Phase results Processor Flow

NormalizationInNeuralSearch-Search Processor Flow (1)

Scope of the Issue

In this issue we will try to propose solution for the below questions that are not discussed in detail in the HLD.

  1. Name of the new Query Clause, which will hold the queries whose results needs to be Normalized and Combined
  2. The shape of the _search api request which will include:
    1. How users will define the technique for doing Normalization?
    2. How users will define the technique and parameters for doing Score Combination?

Out of Scope

Below are some items that are out of scope for this design, but will be covered in other Low Level Design:

  1. How the data needed for doing Normalization will be fetched from Shards?
  2. Will the queries provided run in sequence or in parallel?
  3. How the results of different queries will be transferred to Coordinator node?
  4. How pagination will be supported on the Query Clause
  5. How explain API will be supported on the Query Clause.

Solution

Below are some names that we are proposing for the new Query Clause:

  1. Composite Query
  2. Multi Component Query
  3. Hybrid Query (Recommended)
  4. Ensemble Query

The idea which we want to use while building this Query Clause name is it should allow us change the way we update the Scores in future. Like going forward we don’t want to do normalization of the scores but use some other technique to bring the scores at same scale. All the above names fits well in that category, but based on my understanding I recommend to use Hybrid Query, as new Query Clause name.

Below is the proposed API shape for doing Normalization and Score Combination for different queries. Please see Usage section on how customers can use it.

POST <INDEX-NAME>/_search
{
   "query": {
       "hybrid": [
           {},// First Query
           {} // Second Query
           ..... // Other Queries
       ] 
   }
}

Search Pipeline interface that will help us do the normalization.

PUT /_search_processing/pipeline/<PIPELINE-NAME>
{
  "description": "A pipeline that adds a Normalization and Combination processor",
  "phase_injector_processors" : [
    {
      "scoring-processor" : {
        "normalization": { // Optional
           "technique/method": "<NORMALIZATION-TECHNIQUE>", //Possible Values: min-max, L2Norm(default)
        }
        "combination": { // Optional
           "technique/method" : "<SCORE-COMBINATION-TECHNIQUE>", //harmonic mean, arithmatic mean, sum(default), geometric
            "parameters" : { // optional
                // list of all the parameters that can be required for above algo
                "weights" : [0.4, 0.7] // a floating pointing array which can be used in the algorithm
            }
        }
      }
    }
  ]
}

Usage

Below are some examples on how customers can do the Normalization and Score combination feature in their Search request. Below example are few of the many ways customers can use the Query.

Prequisites

Sample Index Mapping

PUT /flicker-index
{
    "settings": {
        "index.knn": true,
    },
    "mappings": {
        "properties": {
            "passage_embedding": {
                "type": "knn_vector",
                "dimension": 2
            },
            "passage_text": { 
                "type": "text"
            },
            "status": {
                "type": "text"
            }
        }
    }
}

Once the index is created. Customer can index the data. The way they are indexing the data is completely upto the customer.

Example Query Usage 1

Creating Search Pipeline separately and then using it as a request param.

PUT /_search_processing/pipeline/normalizationPipeline
{
  "description": "A pipeline that adds a Normalization and Combination processor",
  "phase_injector_processors" : [
    {
      "scoring-processor" : {
        "normalization": { // Optional
           "technique/method": "<NORMALIZATION-TECHNIQUE>", //Possible Values: min-max, L2Norm
        }
        "combination": { // Optional
           "technique/method" : "<SCORE-COMBINATION-TECHNIQUE>", //harmonic mean, arithmatic mean, geometric
            "parameters" : { // optional
                // list of all the parameters that can be required for above algo
                "weights" : [0.4, 0.7] // a floating pointing array which can be used in the algorithm
            }
        }
      }
    }
  ]
}
  1. Use the search pipeline name in the Query params.
POST flicker-index/_search?search_pipeline=normalizationPipeline
{
    "query": {
        "hybrid": {
            "queries": [
                {
                    "neural": {
                        "passage_embedding": {
                            "query_text": "Girl with Brown Hair",
                            "model_id": "ABCBMODELID",
                            "k": 20
                        }
                    }
                },
                {
                    "bool": {
                        "must": [
                            {
                                "match": {
                                    "passage_text": "Girl Brown hair"
                                }
                            }
                        ],
                        "filter": {
                            "term": {
                                "status": "published"
                            }
                        }
                    }
                }
            ]
        }
    }
}

Example Query Usage 2

Using Search Pipeline in the Search Request itself, via adhoc pipeline. This will make sure that customer can do testing of the queries using different pipeline or can use Search comparison tool to see different results.

POST flicker-index/_search
{
    "query": {
        "hybrid": {
            "queries": [
                {
                    "neural": {
                        "passage_embedding": {
                            "query_text": "Girl with Brown Hair",
                            "model_id": "ABCBMODELID",
                            "k": 20,
                            "filter": {
                                "term": {
                                    "status": "published"
                                }
                            }
                        }
                    }
                },
                {
                    "bool": {
                        "must": [
                            {
                                "match": {
                                    "passage_text": "Girl Brown hair"
                                }
                            }
                        ],
                        "filter": {
                            "term": {
                                "status": "published"
                            }
                        }
                    }
                }
            ]
        }
    },
    "search_pipeline": {
        "description": "A pipeline that adds a Normalization and Combination processor",
        "phase_injector_processors" : [
            {
            "scoring-processor" : {
                "normalization": { // Optional
                "technique/method": "<NORMALIZATION-TECHNIQUE>", //Possible Values: min-max, L2Norm
                }
                "combination": { // Optional
                "technique/method" : "<SCORE-COMBINATION-TECHNIQUE>", //harmonic mean, arithmatic mean, geometric
                    "parameters" : { // optional
                        // list of all the parameters that can be required for above algo
                        "weights" : [0.4, 0.7] // a floating pointing array which can be used in the algorithm
                    }
                }
            }
            }
        ]
    }
}

Example Query Usage 3

Using Index settings to set a default search pipeline.

PUT /flicker-index/_settings 
{
  "index.default_search_pipeline" : "normalizationPipeline"
}


PUT /_search_processing/pipeline/normalizationPipeline
{
  "description": "A pipeline that adds a Normalization and Combination processor",
  "phase_injector_processors" : [
    {
      "scoring-processor" : {
        "normalization": { // Optional
           "technique/method": "<NORMALIZATION-TECHNIQUE>", //Possible Values: min-max, L2Norm
        }
        "combination": { // Optional
           "technique/method" : "<SCORE-COMBINATION-TECHNIQUE>", //harmonic mean, arithmatic mean, geometric
            "parameters" : { // optional
                // list of all the parameters that can be required for above algo
                "weights" : [0.4, 0.7] // a floating pointing array which can be used in the algorithm
            }
        }
      }
    }
  ]
}

POST flicker-index/_search
{
    "query": {
        "hybrid": {
            "queries": [
                {
                    "neural": {
                        "passage_embedding": {
                            "query_text": "Girl with Brown Hair",
                            "model_id": "ABCBMODELID",
                            "k": 20
                        }
                    }
                },
                {
                    "bool": {
                        "must": [
                            {
                                "match": {
                                    "passage_text": "Girl Brown hair"
                                }
                            }
                        ],
                        "filter": {
                            "term": {
                                "status": "published"
                            }
                        }
                    }
                }
            ]
        }
    }
}

Example usage 4

PUT /flicker-index
{
    "settings": {
        "index.knn": true,
        "index.hybrid": true
    },
    "mappings": {
        "properties": {
            "passage_embedding": {
                "type": "knn_vector",
                "dimension": 2
            },
            "passage_text": { 
                "type": "text"
            },
            "status": {
                "type": "text"
            }
        }
    }
}

The above index request will finally be stored like this:

{
    "settings": {
        "index.knn": true,
        "index.hybrid": true,
        "index.default_search_pipeline" : "defaultNormalizationPipeline"
    },
    "mappings": {
        "properties": {
            "passage_embedding": {
                "type": "knn_vector",
                "dimension": 2
            },
            "passage_text": { 
                "type": "text"
            },
            "status": {
                "type": "text"
            }
        }
    }
}

Users can just do this easily.

POST flicker-index/_search
{
    "query": {
        "hybrid": {
            "queries": [
                {
                    "neural": {
                        "passage_embedding": {
                            "query_text": "Girl with Brown Hair",
                            "model_id": "ABCBMODELID",
                            "k": 20
                        }
                    }
                },
                {
                    "bool": {
                        "must": [
                            {
                                "match": {
                                    "passage_text": "Girl Brown hair"
                                }
                            }
                        ],
                        "filter": {
                            "term": {
                                "status": "published"
                            }
                        }
                    }
                }
            ]
        }
    }
}

Pros:

  1. From customer standpoint, all their API calls will remain same, and they need to update only the body of the request.
  2. From cohesion standpoint, as we are doing the search it make sense to include with the _search api to provide a unified experience for customers who are doing search via OpenSearch.
  3. Less maintenance and consistent output format, as the new compound query is integrated with _search api.
  4. Integration with other search capabilities like Explain Query, Pagination, _msearch will be possible, rather than reinventing the wheel.

Cons:

  1. From implementation standpoint, we need define new concepts in OpenSearch like new Query clause, which will require customer education in terms of how to use it.

Alternatives

Alternatives Considered

Alternative-1: Implement a new Rest Handler instead of using creating a new compound query
The idea here to create a new rest handlers which define the list of queries whose scores needs to be normalized and combined.
Pros:

  1. This will provide flexibility for the team to do experiments without touching core capabilities of OpenSearch.
  2. Easier implementation as the new rest handlers is limited to Neural Search Plugin.

Cons:

  1. Duplicate code and interfaces as we will be implementing the same search api functionality(size, from and to, include source fields, scripting etc.)
  2. A higher learning curve and difficult in adoption for customers who are already using _search api for other search workloads.

Future Scope

  1. Currently we see that we define filters for each of the queries, which are present in the queries array. Ideally we would like to provide a single filters key which can define filters for all the queries, along with individual filters. We can borrow something filter context from bool query.

References:

  1. Science blog on Normalization and Score Combination: https://opensearch.org/blog/semantic-science-benchmarks/

Feedback Required:

  1. As this is a new Query Clause we want to get some feedback on the name of the query clause. We want to keep the name generic and can be used on any queries that give scores at different scale.
@navneet1v navneet1v added v2.10.0 Issues targeting release v2.10.0 Features Introduces a new unit of functionality that satisfies a requirement labels Aug 7, 2023
@navneet1v
Copy link
Collaborator Author

cc: @nknize , @dblock , @smacrakis please have a look at the proposal. I want to know your thoughts on this new query name that we are adding mainly.

@msfroh
Copy link

msfroh commented Aug 7, 2023

@navneet1v Does the hybrid query essentially behave like a conjunctive bool query, where the SearchPhaseResultProcessor takes care of normalizing scores across the clauses?

Do we need to do any customization at the collector level to collect both the neural scores and the textual scores during the query phase? (Or is that part of the "out of scope" detail to be covered elsewhere?)

@navneet1v
Copy link
Collaborator Author

@msfroh Yes we need customization on the Docs Collector and QueryPhaseSearcher class. Please refer this github issue: #193 for the details on how we are doing it.

@vamshin vamshin changed the title [RFC] Normalization and Score Combination Feature API Design LLD [RFC] Improved Hybrid Search relevancy by Normalization and Score Combination Feature API Design LLD Sep 5, 2023
@navneet1v
Copy link
Collaborator Author

Resolving this github issue as the changes for RC of 2.10 is finalized and merged. Please create a github issue if there are any further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Features Introduces a new unit of functionality that satisfies a requirement v2.10.0 Issues targeting release v2.10.0
Projects
None yet
Development

No branches or pull requests

3 participants