Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Low Level Design for Normalization and Score Combination Query Phase Searcher #193

Closed
martin-gaievski opened this issue Jun 3, 2023 · 2 comments
Assignees
Labels
Enhancements Increases software capabilities beyond original client specifications Features Introduces a new unit of functionality that satisfies a requirement neural-search RFC v2.10.0 Issues targeting release v2.10.0

Comments

@martin-gaievski
Copy link
Member

Introduction

This document describes details of Low Level Query Phase Searcher Design in scope of Score Normalization and Combination Feature. This is one of multiple LLDs in scope the Score Normalization and Combination Feature. Pre-read of following documents is highly recommended: high-level design [RFC] High Level Approach and Design For Normalization and Score Combination, antecedent LLD design [RFC] Low Level Design for Normalization and Score Combination Query.

Background

As per HLD and in scope of Normalization feature we need an ability to collect results of multiple sub-queries and send them unmerged for post-processing by coordinator node. New part that currently does not exist in OpenSearch is storing of such multiple results. Current logic assumes that search at shard level should return a single list of results (doc ids and scores), and merge should happen at shard level.

OpenSearch provides mechanism of extensions for this type of use cases. Such goal can be achieved through custom QueryPhaseSearcher, that can call custom DocCollector. Both are exwecuted at the shard level as part of the Query phase of search request (caller of the query phase searcher). Those abstractions along with new DTO that can hold results of multiple sub-queries will be a focus for this design.

New QueryPhaseSearcher and related classes will be added as part of the Neural Search plugin and code changes will be done in the plugin repo.

Requirements

  1. Results with doc ids from multiple sub-queries need to be collected and passed to controller node as part of the query execution results. At later Fetch phase we need to be able to identify what result belongs to what sub-query.

  2. New query should keep added latency (for functions like query parsing etc.) to minimum and not degrade performance in both latency and resource utilization comparing to a similar query that does combination at shard level. We will add exact latency numbers after the benchmark is done, initial expectations are: added latency within 15%.

  3. Fetch phase of query execution (“reduce” in terms of OpenSearch, executed at coordinator node) should work without changes.

Scope

In this document we propose a solution for the questions below:

  1. How do we collect scores for each sub-query and form final resulting DTO.

  2. How do we use existing extension point of OpenSearch for implementation of custom QueryPhaseSearcher.

Solution Overview

New custom query phase searcher will be created as implementation of core OpenSearch interface QueryPhaseSearcher. New custom DocCollector will be responsible for collecting search results from a single shard. We’ll be using new DTO object that holds collection of top scored docs for all sub-queries.

Risks / Known limitations

In first phase we’ll create a setting in the plugin that will allow to disable our query phase searcher in case user needs to use concurrent search feature. Using feature flag has a drawback of using command line arguments that are set during the distribution build. For setting user can change the value and restart the cluster. By default hybrid query searcher will be enabled, and setting will allow to disable it.

  • new DTO for top docs will be implementing two views for results: collection of doc ids for single query and collection of collections of matching doc ids for a new hybrid query. First collection is required to be complied with existing core API contracts and keep number of changes to minimum. Actual results per sub-query will be used in later implementations, in transformer that is part of the search pipeline; it will be processed on a Query phase and before the Fetch phase.

  • new doc collector will be implementing minimal set of features required to collect results of hybrid query. This is because collector will be used for only one query type, and some of the features supported by core doc collector (like pagination or max score threshold) are not supported in initial release (e.g. pagination, see [RFC] High Level Approach and Design For Normalization and Score Combination for details).

Future extensions

  • Resolve multiple query searchers limitation. This can be a feature in core, currently it’s under discussion (Enabling Multiple QueryPhaseSearcher in OpenSearch OpenSearch#7020). There are two possible ways here:

    • use a single query searcher, choose one of multiple implementations in runtime
    • create new query phase (or sub-phase), plugins can register multiple custom implementations
  • pagination for query results. Assumption is that this should work if we pass “to” and “from” to doc collector, but it needs testing (small POC using first implementation version or previous POCs).

Solution Details

We’re going to use existing plugin class NeuralSearch as an entry point to register new HybridQueryPhaseSearcher. Searcher will call new doc collector only in case query is of type HybridQuery, this check is required as query searcher is global at plugin level and will work for all queries (currently NeuralSearch query).

Normalization and Score combination-Class Diagram QueryPhaseSearcher drawio

Figure 1: Class diagram for HybridQueryPhaseSearcher implementation

Below is the general data flow for collecting query results using custom doc collector for Hybrid query. QueryPhaseSearcher will be executing at shard level, after coordinator node sends the Query request to each shard. DocCollector will Get max scores for each sub-query using doc id iterator and priority queue, then it will form a collection of all results. This is set as a shard query result and sent back to coordinator node for fetch phase.

Normalization and Score combination-Copy of Sequence diagram Searcher and Collector drawio

Figure 2: General sequence diagram for collecting query results from shards

Final query results will be set to the QueryResult object as a single instance of CompoundTopDocs object.

For example, we are sending Hybrid Query search request with 3 sub-queries:

POST <index-name>/_search
{
    "query": {
        "hybrid": {
            "queries": [
                { /* standard term query 1 */ },
                { /* standard term query 2 */ },
                { /* neural query */ }
            ]
        }
    }
}

Our DTO with query results will look something like this:

CompoundTopDocs: //new class
    docs:
        [0] TopDocs: //existing class
            totalHits
            scoreDocs:
                [0] ScoreDoc: //existing class
                    docId
                    score
                [1] ScoreDoc:
                    docId
                    score
        [1] TopDocs:
            totalHits
            scoreDocs:
                [0] ScoreDoc:
                    docId
                    score
        [2] TopDocs:
            totalHits
            scoreDocs:
                [0] ScoreDoc:
                    docId
                    score
                [1] ScoreDoc:
                    docId
                    score
                [2] ScoreDoc:
                    docId
                    score

Main difference between new TopDocs and existing core implementation is that new object has a collection of results. Standard core object has always single result, for instance

TopDocs: 
    totalHits
    scoreDocs:
         [0] ScoreDoc:
               docId
               score
         [1] ScoreDoc:
               docId
               score

Testability

New query phase searcher is testable via existing /search REST API and lower level direct API calls. Main testing will be done via unit and integration tests. We don’t need backward compatibility tests as Neural-search is in experimental mode and there is no commitment for support of previous versions.

Tests will be focused on overall query stability and results that are collected. Actual explicit testing of result correctness is not possible at this stage, as score normalization and combination is done at later stage by future extension on text processor (or similar alternative implementation):

  • collect result doc ids for hybrid query that has no sub-queries, has one sub-query and has multiple sub-queries
  • test on cluster with multiple shards/nodes
  • check that query result object is set and accessible from SearchQueryThenFetchAsyncAction part of workflow

Mentioned tests are part of the plugin repo CI, main OpenSearch build CI, and also can be executed on demand from development environment.

Tests for metrics like normalized score correctness, performance etc. will be added in later implementations when end-to-end solution will be available.

Reference Links

  1. Meta Issue for Feature: [META] Score Combination and Normalization for Semantics Search. Score Normalization for k-NN and BM25 #123
  2. [RFC] High Level Approach and Design For Normalization and Score Combination: [RFC] High Level Approach and Design For Normalization and Score Combination #126
  3. [RFC] Low Level Design for Normalization and Score Combination Query: [RFC] Low Level Design for Normalization and Score Combination Query #174
@martin-gaievski
Copy link
Member Author

martin-gaievski commented Sep 1, 2023

We have found during testing that for multiple node scenario, which is typical in production, custom implementation of TopDocs doesn't work well. We choose to adjust our format to existing logic and switch to existing TopDocs with a single list of scores. Details of our findings we explained in core OpenSearch issue, TL;DR: coordinator node will receive only single list of scores from data nodes: opensearch-project/OpenSearch#9697

Updated approach for DTO

We're using using TopDocs class as a DTO for sending results from data nodes to coordinator node. Scores of sub-queries will be in a single list of scores, each sub-query will be preceded with a special delimiter score. Special score will also be first and last element of the score list, this will mark such TopDocs as related to Hybrid Query and simplify parsing of that score list.

High level protocol

     *  doc_id | magic_number_1    //start/stop
     *  doc_id | magic_number_2   //delimiter for sub-query 1
     *  ...
     *  doc_id | magic_number_2   //delimiter for sub-query 2
     *  ...
     *  doc_id | magic_number_2   //delimiter for sub-query 3
     *  ...
     *  doc_id | magic_number_1  //start/stop

Example

TopDocs: 
    totalHits
    scoreDocs:
         [0] ScoreDoc:
               0, start/stop_score
         [1] ScoreDoc:
               0, delimiter_score
         [2] ScoreDoc:
               0, 0.95
         [3] ScoreDoc:
               2, 0.9
         [4] ScoreDoc:
               3, 0.75
         [5] ScoreDoc:
               0, delimiter_score
         [6] ScoreDoc:
               0, 12.1
         [7] ScoreDoc:
               1, 8.9
         [8] ScoreDoc:
               0, start/stop_score

We have to utilize only score field of ScoreDocs, this is due to limitation in implementation of pipelines and processor execution. For case of only 1 shard it's not guaranteed that FETCH phase will be executed before the processor is called, even for case when processor is registered between QUERY and FETCH phases. In such case if docId is not valid then Fetch phase code fails.

Corresponding code in normalization processor needs to support this new format. Main logic of score processing will remain the same, changes will be only in logic of parsing TopDocs objects from shards/nodes.

@navneet1v
Copy link
Collaborator

Resolving this github issue as the changes for RC of 2.10 is finalized and merged. Please create a github issue if there are any further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancements Increases software capabilities beyond original client specifications Features Introduces a new unit of functionality that satisfies a requirement neural-search RFC v2.10.0 Issues targeting release v2.10.0
Projects
None yet
Development

No branches or pull requests

2 participants