-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROPOSAL] Language Analyzer plugin as an extension #766
Comments
The 3 solutions really boil down to two use cases:
I'd love to see option 1 work, but we need to measure enough to show that it's worth it. |
+1 to option 1, in my opinion processing tokens on the OpenSearch cluster kind of negates the need of having the language analyzer as an extension as it would effectively just provide configuration. The onus of language analysis should be on the extension itself. |
Spent a little bit of time trying to see if anyone had done any Analyzer benchmarking. The Lucene Nightly Benchmarks give some insight: Analyzer throughput varies per analyzer, but the standard analyzer is about 12M tokens per second. Lambda function throughput limit is 16 Mbps (2 MBps); an average word length of 5 bytes plus a space means 6 lambas could keep up with that level of throughput. This post is rather interesting. The author's throughput was limited by the EFS indexing, not by the lambda processing, but did indicate (and said not to quote, so take this number with a grain of salt) about a 500ms indexing latency (EFS limited, probably faster). Compare to OpenSearch 1.x median indexing latency of 2400ms, seems that OpenSearch indexing is high enough that the processing time of a lambda isn't much higher. Since the Analyzer workflow (higher level than @owaiskazi19 cited above) is essentially text -> tokens -> token filters -> index tokens, the filtering reduces the indexing requirements; due to locking better performance is achieved through caching the indexing. Seems option 1 looks promising based on back-of-the-envelope math. |
What/Why
What are you proposing?
OpenSearch applies text analysis during indexing or searching for text fields. OpenSearch currently employs Standard Analyzer by default for text analysis. It also supports a couple of other language analyzer which are part of the module here. There are native language analyzer plugins as well like: analysis-icu, analysis-nori, analysis-kuromiji, analysis-ukranian as mentioned here.
This proposal aims to enhance OpenSearch by introducing a Language Analyzer extension using the Language Analyzer plugin and SDK. The important extension points of Language Analyzer Plugins are getAnalyzers, getTokenFilters, getTokenizers and others like getPreConfiguredCharFilters, getPreConfiguredTokenFilters, getCharFilters .
What users have asked for this feature?
The Language Analyzer extension as an independent process would provide insights into the framework for developing custom plugins for OpenSearch.
What problems are you trying to solve?
The way ingesting/searching works in Lucene is:
Analyzer selection based on the data being ingested. Analysis converts the data into an indexed representation stored in the inverted index data structure for efficient retrieval.
Tokenizers break down the pre-processed text generated by the analyzer into tokens.
Example: Sample Text: "OpenSearch is a distributed search engine"
Tokens: [OpenSearch] [is] [a] [distributed] [search] [engine]
Token Filters perform additional processing on tokens, such as converting them to lowercase or removing stop words.
After filtering: [opensearch] [distributed] [search] [engine]
The overall flow can be summarized as follows:
Actual Text –> Basic Token Preparation –> lower case filtering –> stop word filtering → Filtering by Custom Logic –> Final Token
The abstract class oal.analysis.TokenStream breaks the incoming text into a sequence of tokens that are retrieved using an iterator-like pattern. TokenStream has two subclasses: oal.analysis.Tokenizer and oal.analysis.TokenFilter. A Tokenizer takes a java.io.Reader as input whereas a TokenFilter takes another oal.analysis.TokenStream as input. This allows us to chain together tokenizers such that the initial tokenizer gets its input from a reader and the others operate on tokens from the preceding TokenStream in the chain. [1]
Possible solutions
Three potential approaches, along with their respective pros and cons, are outlined below. However, it's important to note that the TokenStream class from Lucene lacks serialization capabilities, which makes it challenging to transfer it over a transport.
[On SDK] Keep the extension points of Language Analyzer Extension in SDK and let the analysis be done on the extension end. This extension point design will be similar to the current NamedXContent extension point.
Pros:
This will avoid transferring the Map<String, AnalysisProvider> to OpenSearch and thus saving a transport call to OpenSearch.
Cons:
Might affect the processing time as the request for the language analyzer extension would be redirected from OpenSearch to the specific extension.
[On OpenSearch] Create wrapper classes in OpenSearch and invoke the respective Lucene Filter Factories from extension considering any language analyzer extension is using a filter factory from Lucene. This approach would also require to dynamically register all of the extension points in OpenSearch after the bootstrapping is done like dynamic settings.
Pros:
Simplifies the design for the way language analyzer is currently designed in OpenSearch.
Cons:
Imposes a strict requirement for Language Analyzer extensions to use filter factories exclusively
[On OpenSearch] Use Java Serializable. Serialize the whole extension class and transport over to OpenSearch from an extension.
Pros:
No need to worry about the serialization of every entities separately like: TokenStream.
Cons:
https://openjdk.org/jeps/154 issue which talks about removing Serializable.
What will it take to execute?
Run a language analyzer plugin as an extension on a separate process. (Let's start with Nori.)
Based on the above mentioned approaches. The best approach to proceed forward IMO would be the first one i.e. to keep the analyzer, filters and tokens on SDK when a language analyzer is running as an extension.
Sub issues for the above:
getAnalyzer
extension point for Language Analyzer extension #773getTokenFilters
extension point for Language Analyzer extension #774getTokenizers
extension point for Language Analyzer extension #775Any remaining open questions?
Looking for suggestions for any other approach needed to achieve this or any thoughts about the proposal.
[1] https://www.javacodegeeks.com/2015/09/lucene-analysis-process-guide.html
The text was updated successfully, but these errors were encountered: