[RFC] Enhanced Access to Term-Level Statistics in OpenSearch #8702

noCharger · 2023-07-14T19:21:10Z

Is your feature request related to a problem? Please describe.

A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

In its present state, OpenSearch, a fork of Elasticsearch, offers only constrained access to term-level statistics extracted from Lucene via its scripting functionality. The current process requires setting the similarity model, which can include scripted similarity, at the index level during index creation. This entails defining the settings and mappings for an index, specifying the similarity model for a specific field or for the whole index. Subsequently, during search operations, OpenSearch uses the predefined similarity model to calculate scores for the documents in the index.

This design choice has been made for performance optimization. The similarity model is employed at index time to precompute certain values required at search time. Additionally, considering it influences how the inverted index is stored and queried, altering the similarity settings on a per-query basis is not practical.

Describe the solution you'd like

A clear and concise description of what you want to happen.

To enhance OpenSearch's capabilities, we suggest broadening the direct access to detailed statistics like term frequency (termfreq), term frequency-inverse document frequency (tf-idf), total term frequency (totaltermfreq), sum of total term frequencies (sumtotaltermfreq), and payload information. This improved access can spur the creation of more refined information retrieval and ranking algorithms.

We propose augmenting OpenSearch's scripting functionality to include more Lucene ValueSource statistics. This would involve extending existing scripting classes and creating new ones as necessary, leveraging Lucene's existing ValueSource and Similarity classes for the underlying statistics. This new functionality needs to be carefully integrated and thoroughly tested for reliability and performance. This would empower script creators with new tools for customizing information retrieval and ranking in OpenSearch.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Implementing this functionality outside OpenSearch: This would involve pulling data out of OpenSearch, calculating the statistics externally, and then pushing the data back into OpenSearch. However, this approach is likely to be inefficient and would not benefit from the optimizations available within OpenSearch and Lucene.
Relying solely on OpenSearch's existing scripting functionality: While OpenSearch's scripting does provide some access to term-level statistics, it's not flexible as tuning and customizing during the fetch phase.
1. Term vector: As described in Expose term frequency in Painless script score context #7558 (comment), it’s not one-pass since the doc ids have to be granted
2. Rank feature: Rank feature do scoring by adding the weight to the original score, for example: <BM25> + boost * <value>
3. Scripted similarity: As described in Expose term frequency in Painless script score context #7558 (comment), script similarity doesn't allow parameters to be included into the similarity score on a per query basis. While the multiplier and default_value can be injected by function_score query, the target term must be in query context which is not configurable as params.

Additional context

Add any other context or screenshots about the feature request here.

Related issue: #7558

The proposed enhancement to OpenSearch's scripting functionality will provide a wider range of statistics for use in complex information retrieval and ranking algorithms. This opens up new possibilities for improving the accuracy and relevance of search results, tailoring the retrieval process to specific use cases, and optimizing performance. These statistics can be particularly useful in domains such as information retrieval research, e-commerce, document classification, and others where fine-grained control over the ranking algorithm is desirable.

The text was updated successfully, but these errors were encountered:

noCharger · 2023-07-14T19:24:53Z

cc: @nknize @macohen @msfroh @jainankitk @rishabhmaurya

noCharger · 2023-07-17T18:44:25Z

Some early stage experiments:

1. Extend https://github.com/opensearch-project/OpenSearch/blob/main/plugins/examples/script-expert-scoring/src/main/java/org/opensearch/example/expertscript/ExpertScriptPlugin.java to support several predefinied functions to fully covering all use cases including:

def multiplier = params.multiplier;
for (int x = 0; x < params.fields.length; x++) {
 if (_doc(params.fields[x]) != null) {
   return multiplier * _doc(params.fields[x]).term_freq(params.term);
 }
}

return params.default_value;

Pros: Very simple implementation
Cons: It is not flexible and does not support any scripting language because it is only support functions.
Additional thought: Implementing another ScoreScript with any other scripting language support from stracth is more difficult than it appears.

2. Support functions in script_score

a. Simple function expose in painless language
b Experiments with using TermVectors / ValueSource / PostingsEnum

Pros: Flexible enough to cover complex scripting use case
Cons: A challenging implementation based on painless reflection machnism is hidden behind the elegance.

Some errors when binding Lucene LeafReaderContext and execute during compile time and runtime:


{
  "error": {
    "root_cause": [
      {
        "type": "script_exception",
        "reason": "compile error",
        "script_stack": [],
        "script": "\n            termFreq('field', 'foo');\n          ",
        "lang": "painless"
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "index1",
        "node": "TRpPyMPvSIW9IDEjvyyZkw",
        "reason": {
          "type": "query_shard_exception",
          "reason": "script_score: the script could not be loaded",
          "index": "index1",
          "index_uuid": "gFl0UNcxQqiPSV1YzSr3yg",
          "caused_by": {
            "type": "script_exception",
            "reason": "compile error",
            "script_stack": [],
            "script": "\n            termFreq('field', 'foo');\n          ",
            "lang": "painless",
            "caused_by": {
              "type": "illegal_argument_exception",
              "reason": "[getLeafReaderContext] has unknown return type [org.apache.lucene.index.LeafReaderContext]. Painless can only support getters with return types that are allowlisted."
            }
          }
        }
      }
    ],
    "caused_by": {
      "type": "script_exception",
      "reason": "compile error",
      "script_stack": [],
      "script": "\n            termFreq('field', 'foo');\n          ",
      "lang": "painless",
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "[getLeafReaderContext] has unknown return type [org.apache.lucene.index.LeafReaderContext]. Painless can only support getters with return types that are allowlisted."
      }
    }
  },
  "status": 400
}

{
  "error": {
    "root_cause": [
      {
        "type": "script_exception",
        "reason": "compile error",
        "script_stack": [
          "\n            termFreq('field', 'foo'); ...",
          "             ^---- HERE"
        ],
        "script": "\n            termFreq('field', 'foo');\n          ",
        "lang": "painless",
        "position": {
          "offset": 13,
          "start": 0,
          "end": 38
        }
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "index1",
        "node": "FfngdfQ7Tn-crFQzbZTE4g",
        "reason": {
          "type": "query_shard_exception",
          "reason": "script_score: the script could not be loaded",
          "index": "index1",
          "index_uuid": "7QCIjAZFTXaq386R6LGtJw",
          "caused_by": {
            "type": "script_exception",
            "reason": "compile error",
            "script_stack": [
              "\n            termFreq('field', 'foo'); ...",
              "             ^---- HERE"
            ],
            "script": "\n            termFreq('field', 'foo');\n          ",
            "lang": "painless",
            "position": {
              "offset": 13,
              "start": 0,
              "end": 38
            },
            "caused_by": {
              "type": "illegal_argument_exception",
              "reason": "Unknown call [termFreq] with [[org.opensearch.painless.node.EString@22883031, org.opensearch.painless.node.EString@505887d1]] arguments."
            }
          }
        }
      }
    ],
    "caused_by": {
      "type": "script_exception",
      "reason": "compile error",
      "script_stack": [
        "\n            termFreq('field', 'foo'); ...",
        "             ^---- HERE"
      ],
      "script": "\n            termFreq('field', 'foo');\n          ",
      "lang": "painless",
      "position": {
        "offset": 13,
        "start": 0,
        "end": 38
      },
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "Unknown call [termFreq] with [[org.opensearch.painless.node.EString@22883031, org.opensearch.painless.node.EString@505887d1]] arguments."
      }
    }
  },
  "status": 400
}

[2023-07-17T12:01:21,654][INFO ][o.o.p.p.DefaultSemanticAnalysisPhase] [runTask-0] importedMethod: null
[2023-07-17T12:01:21,659][INFO ][o.o.p.p.DefaultSemanticAnalysisPhase] [runTask-0] classBinding: null
[2023-07-17T12:01:21,660][INFO ][o.o.p.p.DefaultSemanticAnalysisPhase] [runTask-0] classBinding: org.opensearch.painless.lookup.PainlessClassBinding@ba1b93dc
[2023-07-17T12:01:21,661][INFO ][o.o.p.p.DefaultSemanticAnalysisPhase] [runTask-0] instanceBinding: null

Next steps:

How is '_doc' / 'ScriptDocValues' available in script context for smooth execution? Is it also dependent on reflection? If not, how is the async doc value being updated?
What does Lucene CollectionStatistics and TermStatistics exposed in ScriptedSimilarity for painless script execution

noCharger · 2023-07-17T20:19:23Z

Some brainstorming with @nknize @jainankitk @rishabhmaurya on how to exposing the feature

Create a new plugin named 'extra-scripts' or 'customize-script-functions' for approach 1, into which developers can add predefined functions. This plugin may ultimately support user-defined functions.
Consider utilizing the sandbox module, which has the advantage of exposing the functionality without the use of a feature flag. We can eventually include new functionalities into the core.
Other options to feature flag the use of these new functionalities include the allowlist in the painless resource directory, the fielddata flag, exposing it in search pipelines, and so on.

I'm looking for more ideas / opinions on this because it's been a long debate supporting this functionality and there are actual use cases.

russcam · 2023-07-20T23:59:35Z

In #8702 (comment), Approach 1 looks like it would tend towards Approach 2, in that the usefulness in having term frequency available to scripting is greatly enhanced by the ability to combine it with other scripting functions. With Approach 1, having term frequency exposed in a custom script engine by itself is not as useful as having it available with all other scripting capabilities in Painless, as in Approach 2.

noCharger · 2023-07-28T23:42:30Z

@russcam Here's a prototype for Approach 2 that incorporates these functions:

PoC-TermFreq.mov

Will get it in the repo soon. Thanks @msfroh for inspiring me with the idea of using currying in LeafSearchLookup.

russcam · 2023-08-03T04:59:43Z

This looks promising @noCharger! I pulled the branch down locally and successfully ran some example scripts incorporating termFreq

mingshl · 2023-08-21T17:16:30Z

closing the RFC as the PR is merged.

noCharger added enhancement Enhancement or improvement to existing feature or request untriaged labels Jul 14, 2023

noCharger self-assigned this Jul 14, 2023

noCharger added Search Search query, autocomplete ...etc and removed untriaged labels Jul 14, 2023

noCharger mentioned this issue Aug 2, 2023

[Feature] Expose term frequency in Painless script score context #9081

Merged

5 tasks

mingshl closed this as completed Aug 21, 2023

msfroh mentioned this issue Aug 23, 2023

[DOC] Document new doc/term frequency functions in Painless score scripts opensearch-project/documentation-website#4858

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Enhanced Access to Term-Level Statistics in OpenSearch #8702

[RFC] Enhanced Access to Term-Level Statistics in OpenSearch #8702

noCharger commented Jul 14, 2023 •

edited

noCharger commented Jul 14, 2023 •

edited

noCharger commented Jul 17, 2023 •

edited

noCharger commented Jul 17, 2023

russcam commented Jul 20, 2023

noCharger commented Jul 28, 2023

russcam commented Aug 3, 2023

mingshl commented Aug 21, 2023

[RFC] Enhanced Access to Term-Level Statistics in OpenSearch #8702

[RFC] Enhanced Access to Term-Level Statistics in OpenSearch #8702

Comments

noCharger commented Jul 14, 2023 • edited

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

noCharger commented Jul 14, 2023 • edited

noCharger commented Jul 17, 2023 • edited

Some early stage experiments:

1. Extend https://github.com/opensearch-project/OpenSearch/blob/main/plugins/examples/script-expert-scoring/src/main/java/org/opensearch/example/expertscript/ExpertScriptPlugin.java to support several predefinied functions to fully covering all use cases including:

2. Support functions in script_score

noCharger commented Jul 17, 2023

russcam commented Jul 20, 2023

noCharger commented Jul 28, 2023

russcam commented Aug 3, 2023

mingshl commented Aug 21, 2023

noCharger commented Jul 14, 2023 •

edited

noCharger commented Jul 14, 2023 •

edited

noCharger commented Jul 17, 2023 •

edited