Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROPOSAL] Language Analyzer plugin as an extension #766

Open
owaiskazi19 opened this issue May 19, 2023 · 3 comments
Open

[PROPOSAL] Language Analyzer plugin as an extension #766

owaiskazi19 opened this issue May 19, 2023 · 3 comments
Labels
discuss RFC Issues requesting major changes

Comments

@owaiskazi19
Copy link
Member

owaiskazi19 commented May 19, 2023

What/Why

What are you proposing?

OpenSearch applies text analysis during indexing or searching for text fields. OpenSearch currently employs Standard Analyzer by default for text analysis. It also supports a couple of other language analyzer which are part of the module here. There are native language analyzer plugins as well like: analysis-icu, analysis-nori, analysis-kuromiji, analysis-ukranian as mentioned here.
This proposal aims to enhance OpenSearch by introducing a Language Analyzer extension using the Language Analyzer plugin and SDK. The important extension points of Language Analyzer Plugins are getAnalyzers, getTokenFilters, getTokenizers and others like getPreConfiguredCharFilters, getPreConfiguredTokenFilters, getCharFilters .

What users have asked for this feature?

The Language Analyzer extension as an independent process would provide insights into the framework for developing custom plugins for OpenSearch.

What problems are you trying to solve?

The way ingesting/searching works in Lucene is:

  1. Analyzer selection based on the data being ingested. Analysis converts the data into an indexed representation stored in the inverted index data structure for efficient retrieval.

  2. Tokenizers break down the pre-processed text generated by the analyzer into tokens.
    Example: Sample Text: "OpenSearch is a distributed search engine"
    Tokens: [OpenSearch] [is] [a] [distributed] [search] [engine]

  3. Token Filters perform additional processing on tokens, such as converting them to lowercase or removing stop words.
    After filtering: [opensearch] [distributed] [search] [engine]

lucene

The overall flow can be summarized as follows:
Actual Text –> Basic Token Preparation –> lower case filtering –> stop word filtering → Filtering by Custom Logic –> Final Token

The abstract class oal.analysis.TokenStream breaks the incoming text into a sequence of tokens that are retrieved using an iterator-like pattern. TokenStream has two subclasses: oal.analysis.Tokenizer and oal.analysis.TokenFilter. A Tokenizer takes a java.io.Reader as input whereas a TokenFilter takes another oal.analysis.TokenStream as input. This allows us to chain together tokenizers such that the initial tokenizer gets its input from a reader and the others operate on tokens from the preceding TokenStream in the chain. [1]

Possible solutions

Three potential approaches, along with their respective pros and cons, are outlined below. However, it's important to note that the TokenStream class from Lucene lacks serialization capabilities, which makes it challenging to transfer it over a transport.

  1. [On SDK] Keep the extension points of Language Analyzer Extension in SDK and let the analysis be done on the extension end. This extension point design will be similar to the current NamedXContent extension point.

    Pros:

  • This will avoid transferring the Map<String, AnalysisProvider> to OpenSearch and thus saving a transport call to OpenSearch.

    Cons:

  • Might affect the processing time as the request for the language analyzer extension would be redirected from OpenSearch to the specific extension.

  1. [On OpenSearch] Create wrapper classes in OpenSearch and invoke the respective Lucene Filter Factories from extension considering any language analyzer extension is using a filter factory from Lucene. This approach would also require to dynamically register all of the extension points in OpenSearch after the bootstrapping is done like dynamic settings.

    Pros:

  • Simplifies the design for the way language analyzer is currently designed in OpenSearch.

    Cons:

  • Imposes a strict requirement for Language Analyzer extensions to use filter factories exclusively

  1. [On OpenSearch] Use Java Serializable. Serialize the whole extension class and transport over to OpenSearch from an extension.

    Pros:

  • No need to worry about the serialization of every entities separately like: TokenStream.

    Cons:

  • https://openjdk.org/jeps/154 issue which talks about removing Serializable.

What will it take to execute?

  1. Run a language analyzer plugin as an extension on a separate process. (Let's start with Nori.)

  2. Based on the above mentioned approaches. The best approach to proceed forward IMO would be the first one i.e. to keep the analyzer, filters and tokens on SDK when a language analyzer is running as an extension.

    Sub issues for the above:

    1. [FEATURE] Run a Language Analyzer plugin as an extension #772
    2. [FEATURE] getAnalyzer extension point for Language Analyzer extension #773
    3. [FEATURE] getTokenFilters extension point for Language Analyzer extension #774
    4. [FEATURE] getTokenizers extension point for Language Analyzer extension #775

Any remaining open questions?

Looking for suggestions for any other approach needed to achieve this or any thoughts about the proposal.

[1] https://www.javacodegeeks.com/2015/09/lucene-analysis-process-guide.html

@dbwiddis
Copy link
Member

The 3 solutions really boil down to two use cases:

  1. Process tokens on the Extension node(s)
    • This is a great candidate for a serverless (Lambda, Azure Function, etc.) application as there is minimal configuration/setup/extension points. It's just "process this token stream".
    • This would be the ideal choice if the pattern of usage brings benefits that outweigh the extra costs in transport time
      • Can we collect some data on how often (and at what times simultaneously) these token streams are processed?
      • If it's a "bursty" pattern exceeding available CPU then paralleling it across a lot of serverless functions would be ideal, the transport time would be in parallel
      • if the processing time exceeds the transport time by a significant ratio it's a minimal effect
  2. Process tokens on the OpenSearch cluster
    • 2a. Create the needed classes via Reflection, just sending across the class names of the Lucene (or other) dependencies to import
      • This is the closest corollary to what we currently have, but "serializing" classes by sending over their class name
      • This hardly seems like an extension, it's really just configuration and could be done with a configuration file/yaml/json/etc.
      • It's language agnostic because really it's just text
    • 2b. Serialize the classes using Java Serializable
      • There have been moves to get rid of Serializable, although the quoted JEP was withdrawn so it probably is safe to stay
      • This does lock in the extensions to be Java-based, or at least some language with a plugin/module that can generate java serialized class files, like this for python: https://pypi.org/project/javaobj-py3/ and (I don't know Rust but) maybe this: https://lib.rs/crates/javawithrust
      • I've got less of a concern with the Java lock in (as the interfaces are Lucene based anyway) because it does give us version decoupling

I'd love to see option 1 work, but we need to measure enough to show that it's worth it.

@joshpalis
Copy link
Member

+1 to option 1, in my opinion processing tokens on the OpenSearch cluster kind of negates the need of having the language analyzer as an extension as it would effectively just provide configuration. The onus of language analysis should be on the extension itself.

@dbwiddis
Copy link
Member

Spent a little bit of time trying to see if anyone had done any Analyzer benchmarking.

The Lucene Nightly Benchmarks give some insight: Analyzer throughput varies per analyzer, but the standard analyzer is about 12M tokens per second.

Lambda function throughput limit is 16 Mbps (2 MBps); an average word length of 5 bytes plus a space means 6 lambas could keep up with that level of throughput.

This post is rather interesting. The author's throughput was limited by the EFS indexing, not by the lambda processing, but did indicate (and said not to quote, so take this number with a grain of salt) about a 500ms indexing latency (EFS limited, probably faster). Compare to OpenSearch 1.x median indexing latency of 2400ms, seems that OpenSearch indexing is high enough that the processing time of a lambda isn't much higher.

Since the Analyzer workflow (higher level than @owaiskazi19 cited above) is essentially text -> tokens -> token filters -> index tokens, the filtering reduces the indexing requirements; due to locking better performance is achieved through caching the indexing.

Seems option 1 looks promising based on back-of-the-envelope math.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss RFC Issues requesting major changes
Projects
None yet
Development

No branches or pull requests

3 participants