Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TokenStream error: "Only <= 256 finite strings are supported" #33

Closed
missinglink opened this issue Feb 26, 2015 · 12 comments
Closed

TokenStream error: "Only <= 256 finite strings are supported" #33

missinglink opened this issue Feb 26, 2015 · 12 comments
Assignees
Milestone

Comments

@missinglink
Copy link
Member

These non-fatal errors are recurring and may be fixed with a config/schema/plugin update. @hkrishna may be able to provide more info.

[2015-02-26 16:58:53,415][DEBUG][action.bulk              ] [Demolition Man] [pelias][0] failed to execute bulk item (index) index {[pelias][osmnode][1974379318], source[{"center_point":{"lat":51.7543469,"lon":-0.3363454},"name":{"default":"St Peter's St o/s St Albans Tandoori"},"type":"node","alpha3":"GBR","admin1":"Hertfordshire","locality":"St Albans","neighborhood":"Porters Wood","admin0":"United Kingdom","admin2":"Hertfordshire","suggest":{"input":["st peter's st o/s st albans tandoori"],"output":"osmnode:1974379318","weight":6}}]}
java.lang.IllegalArgumentException: TokenStream expanded to 336 finite strings. Only <= 256 finite strings are supported
    at org.elasticsearch.search.suggest.completion.CompletionTokenStream.incrementToken(CompletionTokenStream.java:66)
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:618)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:359)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:318)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:239)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:457)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1511)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1246)
    at org.elasticsearch.index.engine.internal.InternalEngine.innerIndex(InternalEngine.java:594)
    at org.elasticsearch.index.engine.internal.InternalEngine.index(InternalEngine.java:522)
    at org.elasticsearch.index.shard.service.InternalIndexShard.index(InternalIndexShard.java:425)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:439)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:150)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:512)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:419)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
@missinglink missinglink changed the title TokenStream errors TokenStream error: "Only <= 256 finite strings are supported" Feb 26, 2015
@missinglink
Copy link
Member Author

@hkrishna
Copy link
Contributor

this happens when plugin uses synonym token filter to expand tokens like St to Saint.

In this case, St Peter's St o/s St Albans Tandoori gets expanded to saint street peter s street saint o s street saint albans tandoori and the plugin attempts to retokenize which is where the recursive problem arises. This is related to pelias/elasticsearch-plugin#5 for sure.

One short term solution is to just increase max_token_length in the schema.

@sevko
Copy link
Contributor

sevko commented Mar 12, 2015

Can I close as a duplicate of pelias/elasticsearch-plugin#5?

@flotpk
Copy link

flotpk commented Apr 24, 2015

Just wanna find out how I can increse max_token_length ?!?!?

@hkrishna
Copy link
Contributor

Hey @flotpk! Are you having problems with token expansion as well?

Every tokenizer used in an elasticsearch schema has a setting called max_token_length which defaults to 255 but it can be increased.

http://www.elastic.co/guide/en/elasticsearch/reference/1.x/analysis-standard-tokenizer.html

@flotpk
Copy link

flotpk commented Apr 29, 2015

Hi @hkrishna, thanks a lot for the feedback. Tried and still getting error: [dbclient] [500] IllegalArgumentException[TokenStream expanded to 360 finite strings. Only <= 256 finite strings are supported].

My changes to the schema/setting.js
var settings = {
"analysis": {
"analyzer": {
"suggestions": {
"type": "custom",
"tokenizer": "myTokenizer1",
"filter": ["lowercase", "asciifolding"]
},
"pelias": {
"type": "custom",
"tokenizer": "myTokenizer1",
"filter": ["lowercase", "asciifolding","ampersand","word_delimiter"]
},
"plugin": {
"type": "pelias-analysis"
}
},
"tokenizer": {
"myTokenizer1": {
"type": "whitespace",
"max_token_length": "900"
}
},
"filter" : {
"ampersand" :{
"type" : "pattern_replace",
"pattern" : "[&]",
"replacement" : " and "
}
}
},
"index": {
"number_of_replicas": "0",
"number_of_shards": "1",

  // A safe default can be 65% of the number of bounded cores (bounded at 32), with a minimum of 8 (which is the default in Lucene).
  "index_concurrency": "10"
}

};

@flotpk
Copy link

flotpk commented Jun 3, 2015

Hi @hkrishna, would like to follow up on this issue and while waiting for the fix, any suggestion for a workaround to avoid the recursive problem with synonym filter ?

@flotpk
Copy link

flotpk commented Jun 3, 2015

My workaround at the moment is modifying the synonym mapping file in elasticsearch plugin ("eng.json"), rebuild and redeploy to the elasticsearch plugin dir.

Failed cases:
e.g. there are quite a number of street name in my data consists of "st pl" which generate the following synonyms : saint, street, plain, plains, place, plaza
By changing the mapping file, it reduces the synonyms to saint, street and place -> this somehow helps to escape token stream greater than 256 finite strings problem.

As this is only a workaround, with different abbreviations combination in the data, I might face the same problem again with another dataset. I wonder if anyone know the root cause of the above problem - if it is recursive problem - is this consider a bug in elasticsearch ??

@missinglink
Copy link
Member Author

this issue is resolved as the plugin has been deprecated in favour of https://github.com/pelias/schema/blob/master/settings.js#L50-L59

@flotpk
Copy link

flotpk commented Jul 8, 2015

Thanks for the update @missinglink. Have a quick look and seems like a total revamp on the "schema" - don't see "suggest" anymore - wonder if there's any documentation of what have been changed and how it affects those who uses the older version (with the plugin) ?

@missinglink
Copy link
Member Author

hey @flotpk I'm in the process of writing a blog post about it since it's a major architectural change

@miniquery
Copy link

hi @missinglink where can I find the blog post regarding to the architectural change that you have mentioned?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants