Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] "# => ALPHANUM" in word_delimiter filter's type_table can cause exception. #6867

Closed
keithmo opened this issue Mar 29, 2023 · 5 comments · Fixed by #10220
Closed

[BUG] "# => ALPHANUM" in word_delimiter filter's type_table can cause exception. #6867

keithmo opened this issue Mar 29, 2023 · 5 comments · Fixed by #10220
Assignees
Labels
bug Something isn't working Search:Query Capabilities Search Search query, autocomplete ...etc v2.11.0 Issues and PRs related to version 2.11.0

Comments

@keithmo
Copy link

keithmo commented Mar 29, 2023

Describe the bug

If # => ALPHANUM is the only entry in a word_delimiter filter's type_table, analysis using that filter causes an exception. If the # is replaced with \u0023 then it works as expected. The same behavior is seen with the word_delimiter_graph filter.

To Reproduce

curl "http://localhost:9200/_analyze?pretty" \
    -s \
    -X POST \
    -H 'Content-Type: application/json' \
    -d'
{
    "tokenizer": "whitespace",
    "filter": [
        {
            "type": "word_delimiter",
            "split_on_numerics": false,
            "type_table": [
                "# => ALPHANUM"
            ]
        }
    ],
    "text": "text1 #text2"
}
'

OS 2.6.0 (and 2.5.0 where I originally noticed) produces:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "no_such_element_exception",
        "reason" : null
      }
    ],
    "type" : "no_such_element_exception",
    "reason" : null
  },
  "status" : 500
}

Server log:

opensearch-node  | [2023-03-29T15:20:30,600][WARN ][r.suppressed             ] [opensearch-docker-node] path: /_analyze, params: {pretty=}
opensearch-node  | org.opensearch.transport.RemoteTransportException: [opensearch-docker-node][172.31.0.2:9300][indices:admin/analyze[s]]
opensearch-node  | Caused by: java.util.NoSuchElementException
opensearch-node  | 	at java.util.TreeMap.key(TreeMap.java:1602) ~[?:?]
opensearch-node  | 	at java.util.TreeMap.lastKey(TreeMap.java:298) ~[?:?]
opensearch-node  | 	at org.opensearch.analysis.common.WordDelimiterTokenFilterFactory.parseTypes(WordDelimiterTokenFilterFactory.java:156) ~[?:?]
opensearch-node  | 	at org.opensearch.analysis.common.WordDelimiterTokenFilterFactory.<init>(WordDelimiterTokenFilterFactory.java:89) ~[?:?]
opensearch-node  | 	at org.opensearch.index.analysis.AnalysisRegistry.getComponentFactory(AnalysisRegistry.java:156) ~[opensearch-2.6.0.jar:2.6.0]
opensearch-node  | 	at org.opensearch.index.analysis.AnalysisRegistry.buildCustomAnalyzer(AnalysisRegistry.java:285) ~[opensearch-2.6.0.jar:2.6.0]
opensearch-node  | 	at org.opensearch.action.admin.indices.analyze.TransportAnalyzeAction.buildCustomAnalyzer(TransportAnalyzeAction.java:238) ~[opensearch-2.6.0.jar:2.6.0]
opensearch-node  | 	at org.opensearch.action.admin.indices.analyze.TransportAnalyzeAction.analyze(TransportAnalyzeAction.java:165) ~[opensearch-2.6.0.jar:2.6.0]
opensearch-node  | 	at org.opensearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:151) ~[opensearch-2.6.0.jar:2.6.0]
opensearch-node  | 	at org.opensearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:88) ~[opensearch-2.6.0.jar:2.6.0]
opensearch-node  | 	at org.opensearch.action.support.single.shard.TransportSingleShardAction.lambda$asyncShardOperation$0(TransportSingleShardAction.java:131) ~[opensearch-2.6.0.jar:2.6.0]
opensearch-node  | 	at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:73) [opensearch-2.6.0.jar:2.6.0]
opensearch-node  | 	at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:88) ~[opensearch-2.6.0.jar:2.6.0]
opensearch-node  | 	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) [opensearch-2.6.0.jar:2.6.0]
opensearch-node  | 	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.6.0.jar:2.6.0]
opensearch-node  | 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
opensearch-node  | 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
opensearch-node  | 	at java.lang.Thread.run(Thread.java:833) [?:?]

Notes:

  1. This only happens if # => ALPHANUM is the only entry in the type_table. Adding a second entry causes the analysis to succeed, but it behaves as if the # => ALPHANUM is not present. For example:
curl "http://localhost:9200/_analyze?pretty" \
    -s \
    -X POST \
    -H 'Content-Type: application/json' \
    -d'
{
    "tokenizer": "whitespace",
    "filter": [
        {
            "type": "word_delimiter",
            "split_on_numerics": false,
            "type_table": [
                "# => ALPHANUM".
                "@ => ALPHANUM"
            ]
        }
    ],
    "text": "text1 #text2"
}
'

Produces (incorrect):

{
  "tokens" : [
    {
      "token" : "text1",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "text2",
      "start_offset" : 7,
      "end_offset" : 12,
      "type" : "word",
      "position" : 1
    }
  ]
}

Note the text2 token should be #text2. The ordering of the the two type_table entries does not matter.

  1. Changing # => ALPHANUM to \\u0023 => ALPHANUM (0x23 is of course the character code for #) causes the analysis to work as expected. This:
curl "http://localhost:9200/_analyze?pretty" \
    -s \
    -X POST \
    -H 'Content-Type: application/json' \
    -d'
{
    "tokenizer": "whitespace",
    "filter": [
        {
            "type": "word_delimiter",
            "split_on_numerics": false,
            "type_table": [
                "\\u0023 => ALPHANUM"
            ]
        }
    ],
    "text": "text1 #text2"
}
'

Produces (correct):

{
  "tokens" : [
    {
      "token" : "text1",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "#text2",
      "start_offset" : 6,
      "end_offset" : 12,
      "type" : "word",
      "position" : 1
    }
  ]
}

Expected behavior

# => ALPHANUM in a word_delimiter or word_delimiter_graph filter's type_table should treat # as an alphanumeric character.

Plugins

None.

Screenshots

N/A

Host/Environment (please complete the following information):

Official OS 2.6.0 Docker image run under Docker Desktop 4.17.0 (99724) on Mac OS 13.2.1 (Intel).

Additional context

None

@keithmo keithmo added bug Something isn't working untriaged labels Mar 29, 2023
@tlfeng tlfeng added the Search Search query, autocomplete ...etc label Apr 4, 2023
@macohen macohen removed the untriaged label Apr 17, 2023
@noCharger
Copy link
Contributor

noCharger commented Sep 19, 2023

In light of recent developments, particularly this commit, OpenSearch seems to be leaning towards supporting comments in user-supplied analyzer files (ref). However, the hashtag (#) remains an uncovered case from the previous commit. I'd like to reopen this discussion and present a couple of approaches and their implications:

Discontinue support for comments in WordDelimiterFilter and WordDelimiterGraphFilter:

Pros:

  • Aligns with the decision taken for MappingCharFilter as noted here. This would offer a uniform stance across token filters.
  • Simplifies parsing and removes potential edge cases.

Cons:

  • Represents a breaking change that might upset existing users who rely on this feature.

Special Case Handling for "# =>" in WordDelimiterFilter and WordDelimiterGraphFilter:

Pros:

  • Ensures backward compatibility, minimizing disruption for current users.
  • Delivers a targeted solution that directly addresses the hashtag (#) use case.

Cons:

  • Adds complexity and potential edge cases, especially if different formats are addressed using RegEx.

Note: The following is a preliminary solution I tested on my local setup, and it works seamlessly:

if (removeComments == false || word.startsWith("#") == false ||  word.startsWith("# =>") == true) {
OpenSearch % >....                                                                                                                                                                                                                                                                                                                
    -H 'Content-Type: application/json' \
    -d'
{
    "tokenizer": "whitespace",
    "filter": [
        {
            "type": "word_delimiter",
            "split_on_numerics": false,
            "type_table": [
                "# => ALPHANUM"
            ]
        }
    ],
    "text": "text1 #text2"
}
'
{
  "tokens" : [
    {
      "token" : "text1",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "#text2",
      "start_offset" : 6,
      "end_offset" : 12,
      "type" : "word",
      "position" : 1
    }
  ]
}

@nathanmyles, if possible, I'd appreciate your insights on this issue, especially in light of your earlier comments here.

cc @dblock @reta @macohen @msfroh

@msfroh
Copy link
Collaborator

msfroh commented Sep 20, 2023

I personally like the suggestion to add special-case handling for # => for mapping rules, but I would do it for all mapping rules. That way, we could bring back comments on the mapping char_filter.

@nathanmyles
Copy link
Contributor

Yeah, I can get behind having a special case for the hashtag mappings.

I’m not familiar with the use case where we’d want to support comments in these mappings. My understanding is that it’s possible to load these configurations via a file, so we likely do want to maintain that support if true.

The special case is probably best to maintain backwards compatibility. The complexity tradeoff doesn’t seem too bad to me.

@reta
Copy link
Collaborator

reta commented Sep 20, 2023

I personally like the suggestion to add special-case handling for # => for mapping rules, but I would do it for all mapping rules. That way, we could bring back comments on the mapping char_filter.

+1 to that, I think it should be possible to support comments and hashtags nicely

@macohen
Copy link
Contributor

macohen commented Sep 20, 2023

+1 this helps move us forward with hashtags as a thing in modern society while still providing backwards compatibility

@noCharger noCharger added the v2.11.0 Issues and PRs related to version 2.11.0 label Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search:Query Capabilities Search Search query, autocomplete ...etc v2.11.0 Issues and PRs related to version 2.11.0
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

8 participants