TokenStream error: "Only <= 256 finite strings are supported" #33

missinglink · 2015-02-26T17:00:39Z

These non-fatal errors are recurring and may be fixed with a config/schema/plugin update. @hkrishna may be able to provide more info.

[2015-02-26 16:58:53,415][DEBUG][action.bulk              ] [Demolition Man] [pelias][0] failed to execute bulk item (index) index {[pelias][osmnode][1974379318], source[{"center_point":{"lat":51.7543469,"lon":-0.3363454},"name":{"default":"St Peter's St o/s St Albans Tandoori"},"type":"node","alpha3":"GBR","admin1":"Hertfordshire","locality":"St Albans","neighborhood":"Porters Wood","admin0":"United Kingdom","admin2":"Hertfordshire","suggest":{"input":["st peter's st o/s st albans tandoori"],"output":"osmnode:1974379318","weight":6}}]}
java.lang.IllegalArgumentException: TokenStream expanded to 336 finite strings. Only <= 256 finite strings are supported
    at org.elasticsearch.search.suggest.completion.CompletionTokenStream.incrementToken(CompletionTokenStream.java:66)
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:618)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:359)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:318)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:239)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:457)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1511)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1246)
    at org.elasticsearch.index.engine.internal.InternalEngine.innerIndex(InternalEngine.java:594)
    at org.elasticsearch.index.engine.internal.InternalEngine.index(InternalEngine.java:522)
    at org.elasticsearch.index.shard.service.InternalIndexShard.index(InternalIndexShard.java:425)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:439)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:150)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:512)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:419)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

missinglink · 2015-02-26T17:02:41Z

related: https://github.com/pelias/elasticsearch-plugin/issues/5

hkrishna · 2015-02-26T17:25:51Z

this happens when plugin uses synonym token filter to expand tokens like St to Saint.

In this case, St Peter's St o/s St Albans Tandoori gets expanded to saint street peter s street saint o s street saint albans tandoori and the plugin attempts to retokenize which is where the recursive problem arises. This is related to pelias/elasticsearch-plugin#5 for sure.

One short term solution is to just increase max_token_length in the schema.

sevko · 2015-03-12T15:21:33Z

Can I close as a duplicate of pelias/elasticsearch-plugin#5?

flotpk · 2015-04-24T03:01:10Z

Just wanna find out how I can increse max_token_length ?!?!?

hkrishna · 2015-04-24T19:24:01Z

Hey @flotpk! Are you having problems with token expansion as well?

Every tokenizer used in an elasticsearch schema has a setting called max_token_length which defaults to 255 but it can be increased.

http://www.elastic.co/guide/en/elasticsearch/reference/1.x/analysis-standard-tokenizer.html

flotpk · 2015-04-29T08:40:17Z

Hi @hkrishna, thanks a lot for the feedback. Tried and still getting error: [dbclient] [500] IllegalArgumentException[TokenStream expanded to 360 finite strings. Only <= 256 finite strings are supported].

My changes to the schema/setting.js
var settings = {
"analysis": {
"analyzer": {
"suggestions": {
"type": "custom",
"tokenizer": "myTokenizer1",
"filter": ["lowercase", "asciifolding"]
},
"pelias": {
"type": "custom",
"tokenizer": "myTokenizer1",
"filter": ["lowercase", "asciifolding","ampersand","word_delimiter"]
},
"plugin": {
"type": "pelias-analysis"
}
},
"tokenizer": {
"myTokenizer1": {
"type": "whitespace",
"max_token_length": "900"
}
},
"filter" : {
"ampersand" :{
"type" : "pattern_replace",
"pattern" : "[&]",
"replacement" : " and "
}
}
},
"index": {
"number_of_replicas": "0",
"number_of_shards": "1",

  // A safe default can be 65% of the number of bounded cores (bounded at 32), with a minimum of 8 (which is the default in Lucene).
  "index_concurrency": "10"
}

};

flotpk · 2015-06-03T04:15:47Z

Hi @hkrishna, would like to follow up on this issue and while waiting for the fix, any suggestion for a workaround to avoid the recursive problem with synonym filter ?

flotpk · 2015-06-03T09:30:57Z

My workaround at the moment is modifying the synonym mapping file in elasticsearch plugin ("eng.json"), rebuild and redeploy to the elasticsearch plugin dir.

Failed cases:
e.g. there are quite a number of street name in my data consists of "st pl" which generate the following synonyms : saint, street, plain, plains, place, plaza
By changing the mapping file, it reduces the synonyms to saint, street and place -> this somehow helps to escape token stream greater than 256 finite strings problem.

As this is only a workaround, with different abbreviations combination in the data, I might face the same problem again with another dataset. I wonder if anyone know the root cause of the above problem - if it is recursive problem - is this consider a bug in elasticsearch ??

missinglink · 2015-07-07T14:36:20Z

this issue is resolved as the plugin has been deprecated in favour of https://github.com/pelias/schema/blob/master/settings.js#L50-L59

flotpk · 2015-07-08T09:27:22Z

Thanks for the update @missinglink. Have a quick look and seems like a total revamp on the "schema" - don't see "suggest" anymore - wonder if there's any documentation of what have been changed and how it affects those who uses the older version (with the plugin) ?

missinglink · 2015-07-08T12:53:19Z

hey @flotpk I'm in the process of writing a blog post about it since it's a major architectural change

miniquery · 2015-09-01T02:46:48Z

hi @missinglink where can I find the blog post regarding to the architectural change that you have mentioned?

missinglink assigned hkrishna Feb 26, 2015

missinglink changed the title ~~TokenStream errors~~ TokenStream error: "Only <= 256 finite strings are supported" Feb 26, 2015

missinglink mentioned this issue Feb 26, 2015

Increase max_token_length pelias/schema#44

Closed

dianashk mentioned this issue Mar 3, 2015

Elasticsearch plugin [PL-GG11] #39

Closed

missinglink mentioned this issue Apr 14, 2015

TokenStream expanded to 512 finite strings. Only <= 256 finite strings are supported elastic/elasticsearch#10192

Closed

missinglink added the on-deck label Apr 29, 2015

missinglink unassigned hkrishna Apr 29, 2015

dianashk added this to the Pelias v1.0.0 milestone May 4, 2015

dianashk assigned hkrishna May 13, 2015

missinglink assigned missinglink and unassigned hkrishna Jul 7, 2015

missinglink added in review and removed on-deck labels Jul 7, 2015

dianashk closed this as completed Jul 8, 2015

dianashk removed the in review label Jul 8, 2015

missinglink added the in progress label Sep 1, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TokenStream error: "Only <= 256 finite strings are supported" #33

TokenStream error: "Only <= 256 finite strings are supported" #33

missinglink commented Feb 26, 2015

missinglink commented Feb 26, 2015

hkrishna commented Feb 26, 2015

sevko commented Mar 12, 2015

flotpk commented Apr 24, 2015

hkrishna commented Apr 24, 2015

flotpk commented Apr 29, 2015

flotpk commented Jun 3, 2015

flotpk commented Jun 3, 2015

missinglink commented Jul 7, 2015

flotpk commented Jul 8, 2015

missinglink commented Jul 8, 2015

miniquery commented Sep 1, 2015

TokenStream error: "Only <= 256 finite strings are supported" #33

TokenStream error: "Only <= 256 finite strings are supported" #33

Comments

missinglink commented Feb 26, 2015

missinglink commented Feb 26, 2015

hkrishna commented Feb 26, 2015

sevko commented Mar 12, 2015

flotpk commented Apr 24, 2015

hkrishna commented Apr 24, 2015

flotpk commented Apr 29, 2015

flotpk commented Jun 3, 2015

flotpk commented Jun 3, 2015

missinglink commented Jul 7, 2015

flotpk commented Jul 8, 2015

missinglink commented Jul 8, 2015

miniquery commented Sep 1, 2015