autocomplete milestone #127

Merged
merged 21 commits into from Apr 22, 2016

Conversation

Projects
None yet
2 participants
@missinglink
Member

missinglink commented Apr 22, 2016

This PR refactors the analyzers used by the /v1/autocomplete endpoint, with the goals of:

  • removing all interdependencies with the /v1/search endpoint making subsequent refactoring easier.
  • providing a more robust method of handling synonym substitution:
    • by considering the differences between 'index time' analysis and 'query time' analysis.
    • by handling 'partial tokens' (partially complete words) and 'full tokens' differently.

Currently we use 3 different analyzers in the /v1/autocomplete endpoint:

analyzer "trade center"
peliasOneEdgeGram "t", "tr", "tra", "trad", "trade", "c", "ce", "cen", "cent", "cente", "center"
peliasTwoEdgeGram "tr", "tra", "trad", "trade", "ce", "cen", "cent", "cente", "center"
peliasPhrase "trade", "ctr"

The peliasPhrase analyzer was originally intended to be used with /v1/search and you can see above that the way it handles synonyms is mismatched with the way the other 2 analyzers handle the word center (for example). this is the cause of pelias/pelias#211

new analyzers:

The new analyzers proposed in this PR are:

analyzer tokenizer partial safe? "center"
peliasIndexOneEdgeGram 1gram × "c", "ce", "cen", "cent", "cente", "center"
peliasIndexTwoEdgeGram 2gram × "ce", "cen", "cent", "cente", "center"
peliasQueryPartialToken word "center"
peliasQueryFullToken keyword × "center"

They produce the same tokens when given the abbreviated/contracted form "ctr":

analyzer tokenizer partial safe? "ctr"
peliasIndexOneEdgeGram 1gram × "c", "ce", "cen", "cent", "cente", "center"
peliasIndexTwoEdgeGram 2gram × "ce", "cen", "cent", "cente", "center"
peliasQueryPartialToken word "center"
peliasQueryFullToken keyword × "center"

directionals:

They also handle directional synonyms in a similar way:

analyzer tokenizer partial safe? "north"
peliasIndexOneEdgeGram 1gram × "n", "no", "nor", "nort", "north"
peliasIndexTwoEdgeGram 2gram × "no", "nor", "nort", "north", "n"
peliasQueryPartialToken word "north"
peliasQueryFullToken keyword × "north"

Again, they produce the same tokens when given the abbreviated/contracted form "n":

analyzer tokenizer partial safe? "n"
peliasIndexOneEdgeGram 1gram × "n", "no", "nor", "nort", "north"
peliasIndexTwoEdgeGram 2gram × "no", "nor", "nort", "north", "n"
peliasQueryPartialToken word "n"
peliasQueryFullToken keyword × "north"

note: there is a bit of a 'hack' in place for the above peliasIndexTwoEdgeGram analysis that is specific to directionals, you can see it adds a single gram 'n' in to a token stream which usually only contains grams of size 2+. This improves address matching and reduces 'jitter'.

api/query changes:

All usages of existing analyzers in /v1/autocomplete must be updated:

  • peliasOneEdgeGram -> peliasQueryPartialToken
  • peliasPhrase -> peliasQueryFullToken

Additionally the autocomplete queries should no longer need to use the phrase.* index, all queries can safely be performed against the name.* index (if not already doing so).

note: we can discuss removing the phrase.* index completely! this would greatly reduce the cluster disk/ram usage, it might be possible to achieve all the functionality of /v1/search using the prefixGram index. let's discuss this in another issue.

dataset importer changes:

nil

risks / expected acceptance test changes:

There is not much that can go wrong here, the only differences at index time are that:

  • peliasIndexOneEdgeGram expands directionals whereas peliasOneEdgeGram does not.
  • peliasIndexTwoEdgeGram is the same and includes the 'hack' mentioned above.

The differences at query time are:

  • issue 211 is resolved
  • expect to see better handling of queries containing a single directional gram such as 'w 26 st'.

I've left some other changes I would like to make for a future PR in order to reduce the amount of changes going in at the same time.

related:

closes #96 (contained in this branch)
closes #109 (contained)
closes #113 (contained)

closes #105
resolves pelias/pelias#211
related pelias/openaddresses#68

@missinglink missinglink merged commit d38ad7d into master Apr 22, 2016

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@missinglink missinglink self-assigned this Apr 22, 2016

@missinglink missinglink referenced this pull request in pelias/api Apr 29, 2016

Merged

autocomplete milestone #526

@orangejulius

This comment has been minimized.

Show comment
Hide comment
@orangejulius

orangejulius Apr 29, 2016

Member

I copied and pasted the PR notes from #109 to here since we're going to link directly to this PR in the release notes!

Member

orangejulius commented Apr 29, 2016

I copied and pasted the PR notes from #109 to here since we're going to link directly to this PR in the release notes!

@orangejulius orangejulius deleted the missinglink branch May 24, 2016

je-l pushed a commit to nlsfi/pelias-schema that referenced this pull request Aug 31, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment