Mismatch in synonym analysis between ngram and phrase analyzers #105

missinglink · 2016-03-04T12:43:01Z

dev ticket to fix problems noted in pelias/pelias#211

the bug effects two 'classes' of tokens (street suffix synonyms/compass directional synonyms), in both cases it is triggered when the final token of the search text is a synonym. The result is that 0 results are returned:

/v1/autocomplete?text=world trade center      # last token 'center' has a synonym 'ctr'
/v1/autocomplete?text=hackney road            # last token 'road' has a synonym 'rd'
/v1/autocomplete?text=30 west                 # last token 'west' has a synonym 'w'

... all return 0 results

.. however it is not triggered when adding a comma and then specifying an 'admin' component:

/v1/autocomplete?text=30 west, new york

... returns >0 results

The reason this is happening is due to a 'mismatch' between how the 'ngrams' analyzer handles synonyms and how the 'phrase' analyzer handles them.

Since the query is split up in to 'finished' tokens and 'unfinished tokens', these different 'types' of tokens get analyzed in different ways.

Eg. 'world trade center', we know that 'world' and 'trade' are finished (the user is done typing them) but the last term 'center' we are not yet sure if this is a partial word or a complete word.

So the first two tokens get sent to the 'phrase' analyzer which is super efficient while the last token has some tricky analysis applied to it.

Since we don't know if it's complete yet we have to check it against the ngrams index; however we have a performance 'hack' in place which uses the phrase analyzer to produce a single token, so instead of using the ngrams analyzer to produce [ 'c', 'ce', 'cen', 'cent', 'center' ] we just produce [ 'center ], this results in a bit of a performance boost as searching the other prefixes adds no value.

The issue with this is that using the peliasPhrase analyzer against an index created using peliasTwoEdgeGram analysis will not work properly because they handle synonyms differently, in the example above the token created is [ 'ctr' ] not [ 'center' ] as expected, it can't find any docs with the ngram 'ctr' and no results are returned.

in progress, more to come.

Connected to pelias/pelias#211

The text was updated successfully, but these errors were encountered:

missinglink · 2016-03-14T11:47:03Z

[edit] I also noticed another issue in Germany:

When specifying the 'short' version of a street name such as Grolmanstr. vs. Grolmanstraße then the phrase analyzer and phrase index fail to match any returns.

eg:

missinglink · 2016-04-21T19:33:55Z

acceptance tests: pelias/acceptance-tests@a92393c

dianashk · 2016-05-02T23:42:08Z

All the acceptance tests pass on production.

missinglink added the in progress label Mar 4, 2016

missinglink self-assigned this Mar 4, 2016

missinglink added this to the Autocomplete Improvements milestone Mar 4, 2016

missinglink mentioned this issue Mar 16, 2016

Refactor autocomplete analysis #109

Closed

missinglink added a commit to pelias/acceptance-tests that referenced this issue Apr 21, 2016

add test for pelias/schema#105

a92393c

This was referenced Apr 22, 2016

Mismatch in synonym analysis between ngram and phrase analyzers pelias/acceptance-tests#226

Merged

autocomplete milestone pelias/api#526

Merged

missinglink mentioned this issue Apr 29, 2016

autocomplete milestone #127

Merged

dianashk closed this as completed May 2, 2016

dianashk removed the in progress label May 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatch in synonym analysis between ngram and phrase analyzers #105

Mismatch in synonym analysis between ngram and phrase analyzers #105

missinglink commented Mar 4, 2016

missinglink commented Mar 14, 2016

missinglink commented Apr 21, 2016

dianashk commented May 2, 2016

Mismatch in synonym analysis between ngram and phrase analyzers #105

Mismatch in synonym analysis between ngram and phrase analyzers #105

Comments

missinglink commented Mar 4, 2016

missinglink commented Mar 14, 2016

missinglink commented Apr 21, 2016

dianashk commented May 2, 2016