[BUG] "# => ALPHANUM" in word_delimiter filter's type_table can cause exception. #6867

keithmo · 2023-03-29T15:50:08Z

Describe the bug

If # => ALPHANUM is the only entry in a word_delimiter filter's type_table, analysis using that filter causes an exception. If the # is replaced with \u0023 then it works as expected. The same behavior is seen with the word_delimiter_graph filter.

To Reproduce

curl "http://localhost:9200/_analyze?pretty" \
    -s \
    -X POST \
    -H 'Content-Type: application/json' \
    -d'
{
    "tokenizer": "whitespace",
    "filter": [
        {
            "type": "word_delimiter",
            "split_on_numerics": false,
            "type_table": [
                "# => ALPHANUM"
            ]
        }
    ],
    "text": "text1 #text2"
}
'

OS 2.6.0 (and 2.5.0 where I originally noticed) produces:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "no_such_element_exception",
        "reason" : null
      }
    ],
    "type" : "no_such_element_exception",
    "reason" : null
  },
  "status" : 500
}

Server log:

opensearch-node  | [2023-03-29T15:20:30,600][WARN ][r.suppressed             ] [opensearch-docker-node] path: /_analyze, params: {pretty=}
opensearch-node  | org.opensearch.transport.RemoteTransportException: [opensearch-docker-node][172.31.0.2:9300][indices:admin/analyze[s]]
opensearch-node  | Caused by: java.util.NoSuchElementException
opensearch-node  | 	at java.util.TreeMap.key(TreeMap.java:1602) ~[?:?]
opensearch-node  | 	at java.util.TreeMap.lastKey(TreeMap.java:298) ~[?:?]
opensearch-node  | 	at org.opensearch.analysis.common.WordDelimiterTokenFilterFactory.parseTypes(WordDelimiterTokenFilterFactory.java:156) ~[?:?]
opensearch-node  | 	at org.opensearch.analysis.common.WordDelimiterTokenFilterFactory.<init>(WordDelimiterTokenFilterFactory.java:89) ~[?:?]
opensearch-node  | 	at org.opensearch.index.analysis.AnalysisRegistry.getComponentFactory(AnalysisRegistry.java:156) ~[opensearch-2.6.0.jar:2.6.0]
opensearch-node  | 	at org.opensearch.index.analysis.AnalysisRegistry.buildCustomAnalyzer(AnalysisRegistry.java:285) ~[opensearch-2.6.0.jar:2.6.0]
opensearch-node  | 	at org.opensearch.action.admin.indices.analyze.TransportAnalyzeAction.buildCustomAnalyzer(TransportAnalyzeAction.java:238) ~[opensearch-2.6.0.jar:2.6.0]
opensearch-node  | 	at org.opensearch.action.admin.indices.analyze.TransportAnalyzeAction.analyze(TransportAnalyzeAction.java:165) ~[opensearch-2.6.0.jar:2.6.0]
opensearch-node  | 	at org.opensearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:151) ~[opensearch-2.6.0.jar:2.6.0]
opensearch-node  | 	at org.opensearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:88) ~[opensearch-2.6.0.jar:2.6.0]
opensearch-node  | 	at org.opensearch.action.support.single.shard.TransportSingleShardAction.lambda$asyncShardOperation$0(TransportSingleShardAction.java:131) ~[opensearch-2.6.0.jar:2.6.0]
opensearch-node  | 	at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:73) [opensearch-2.6.0.jar:2.6.0]
opensearch-node  | 	at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:88) ~[opensearch-2.6.0.jar:2.6.0]
opensearch-node  | 	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) [opensearch-2.6.0.jar:2.6.0]
opensearch-node  | 	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.6.0.jar:2.6.0]
opensearch-node  | 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
opensearch-node  | 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
opensearch-node  | 	at java.lang.Thread.run(Thread.java:833) [?:?]

Notes:

This only happens if # => ALPHANUM is the only entry in the type_table. Adding a second entry causes the analysis to succeed, but it behaves as if the # => ALPHANUM is not present. For example:

curl "http://localhost:9200/_analyze?pretty" \
    -s \
    -X POST \
    -H 'Content-Type: application/json' \
    -d'
{
    "tokenizer": "whitespace",
    "filter": [
        {
            "type": "word_delimiter",
            "split_on_numerics": false,
            "type_table": [
                "# => ALPHANUM".
                "@ => ALPHANUM"
            ]
        }
    ],
    "text": "text1 #text2"
}
'

Produces (incorrect):

{
  "tokens" : [
    {
      "token" : "text1",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "text2",
      "start_offset" : 7,
      "end_offset" : 12,
      "type" : "word",
      "position" : 1
    }
  ]
}

Note the text2 token should be #text2. The ordering of the the two type_table entries does not matter.

Changing # => ALPHANUM to \\u0023 => ALPHANUM (0x23 is of course the character code for #) causes the analysis to work as expected. This:

curl "http://localhost:9200/_analyze?pretty" \
    -s \
    -X POST \
    -H 'Content-Type: application/json' \
    -d'
{
    "tokenizer": "whitespace",
    "filter": [
        {
            "type": "word_delimiter",
            "split_on_numerics": false,
            "type_table": [
                "\\u0023 => ALPHANUM"
            ]
        }
    ],
    "text": "text1 #text2"
}
'

Produces (correct):

{
  "tokens" : [
    {
      "token" : "text1",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "#text2",
      "start_offset" : 6,
      "end_offset" : 12,
      "type" : "word",
      "position" : 1
    }
  ]
}

Expected behavior

# => ALPHANUM in a word_delimiter or word_delimiter_graph filter's type_table should treat # as an alphanumeric character.

Plugins

None.

Screenshots

N/A

Host/Environment (please complete the following information):

Official OS 2.6.0 Docker image run under Docker Desktop 4.17.0 (99724) on Mac OS 13.2.1 (Intel).

Additional context

None

The text was updated successfully, but these errors were encountered:

noCharger · 2023-09-19T23:36:19Z

In light of recent developments, particularly this commit, OpenSearch seems to be leaning towards supporting comments in user-supplied analyzer files (ref). However, the hashtag (#) remains an uncovered case from the previous commit. I'd like to reopen this discussion and present a couple of approaches and their implications:

Discontinue support for comments in WordDelimiterFilter and WordDelimiterGraphFilter:

Pros:

Aligns with the decision taken for MappingCharFilter as noted here. This would offer a uniform stance across token filters.
Simplifies parsing and removes potential edge cases.

Cons:

Represents a breaking change that might upset existing users who rely on this feature.

Special Case Handling for "# =>" in WordDelimiterFilter and WordDelimiterGraphFilter:

Pros:

Ensures backward compatibility, minimizing disruption for current users.
Delivers a targeted solution that directly addresses the hashtag (#) use case.

Cons:

Adds complexity and potential edge cases, especially if different formats are addressed using RegEx.

Note: The following is a preliminary solution I tested on my local setup, and it works seamlessly:

if (removeComments == false || word.startsWith("#") == false ||  word.startsWith("# =>") == true) {

OpenSearch % >....                                                                                                                                                                                                                                                                                                                
    -H 'Content-Type: application/json' \
    -d'
{
    "tokenizer": "whitespace",
    "filter": [
        {
            "type": "word_delimiter",
            "split_on_numerics": false,
            "type_table": [
                "# => ALPHANUM"
            ]
        }
    ],
    "text": "text1 #text2"
}
'
{
  "tokens" : [
    {
      "token" : "text1",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "#text2",
      "start_offset" : 6,
      "end_offset" : 12,
      "type" : "word",
      "position" : 1
    }
  ]
}

@nathanmyles, if possible, I'd appreciate your insights on this issue, especially in light of your earlier comments here.

cc @dblock @reta @macohen @msfroh

msfroh · 2023-09-20T00:01:51Z

I personally like the suggestion to add special-case handling for # => for mapping rules, but I would do it for all mapping rules. That way, we could bring back comments on the mapping char_filter.

nathanmyles · 2023-09-20T00:30:26Z

Yeah, I can get behind having a special case for the hashtag mappings.

I’m not familiar with the use case where we’d want to support comments in these mappings. My understanding is that it’s possible to load these configurations via a file, so we likely do want to maintain that support if true.

The special case is probably best to maintain backwards compatibility. The complexity tradeoff doesn’t seem too bad to me.

reta · 2023-09-20T14:08:06Z

I personally like the suggestion to add special-case handling for # => for mapping rules, but I would do it for all mapping rules. That way, we could bring back comments on the mapping char_filter.

+1 to that, I think it should be possible to support comments and hashtags nicely

macohen · 2023-09-20T19:57:28Z

+1 this helps move us forward with hashtags as a thing in modern society while still providing backwards compatibility

keithmo added bug Something isn't working untriaged labels Mar 29, 2023

tlfeng added the Search Search query, autocomplete ...etc label Apr 4, 2023

macohen removed the untriaged label Apr 17, 2023

msfroh assigned noCharger Sep 18, 2023

anasalkouz added the Search:Query Capabilities label Sep 20, 2023

noCharger mentioned this issue Sep 25, 2023

[Feature] Support hashtag in token filter type_table rules parser #10220

Merged

5 tasks

msfroh closed this as completed in #10220 Sep 27, 2023

noCharger added the v2.11.0 Issues and PRs related to version 2.11.0 label Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] "# => ALPHANUM" in word_delimiter filter's type_table can cause exception. #6867

[BUG] "# => ALPHANUM" in word_delimiter filter's type_table can cause exception. #6867

keithmo commented Mar 29, 2023

noCharger commented Sep 19, 2023 •

edited

msfroh commented Sep 20, 2023

nathanmyles commented Sep 20, 2023

reta commented Sep 20, 2023

macohen commented Sep 20, 2023

[BUG] "# => ALPHANUM" in word_delimiter filter's type_table can cause exception. #6867

[BUG] "# => ALPHANUM" in word_delimiter filter's type_table can cause exception. #6867

Comments

keithmo commented Mar 29, 2023

noCharger commented Sep 19, 2023 • edited

msfroh commented Sep 20, 2023

nathanmyles commented Sep 20, 2023

reta commented Sep 20, 2023

macohen commented Sep 20, 2023

noCharger commented Sep 19, 2023 •

edited