-
-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add filter to remove leading zeros to peliasPhrase
analyzer
#475
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Joxit
approved these changes
Feb 16, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
You can add fixes pelias/pelias#898
to your commit
orangejulius
force-pushed
the
postalcodes-removeLeadingZero
branch
from
February 17, 2021 01:08
25bcec2
to
f79eed5
Compare
orangejulius
added a commit
to pelias/acceptance-tests
that referenced
this pull request
Feb 17, 2021
This adds tests for a particular type of postalcode we've had trouble with: those with a leading zero. Tests are added to the existing 'search postalcode' suite, and a new 'autocomplete postalcode' sutie is also created. We have pelias/schema#475 in progress which we hope can fix the current known issues. Connects pelias/pelias#898
This change adds the `removeAllZeroNumericPrefix` filter to the `peliasPhrase` analyzer. The idea is to help ensure postalcodes with a leading zero can show up in autocomplete queries, as reported in pelias/pelias#898. The change is based on two assumptions: - We want the `peliasPhrase`, `peliasQuery`, and `peliasIndexOneEdgeGram` analyzers to all handle leading zeros similarly - We _do_ want to remove leading zeros in our analyzers I recall the original motivation for removing leading zeros is that we sometimes see street names like `05th avenue` in various data sources (or possibly queries), and we want to allow that to match on `5th avenue`. The original code to do this seems to have been written pretty long ago and the [pull request I believe is relevant](#56) doesn't go into a lot of detail, so it's hard to say for sure. Anyway, assuming we do want to handle those cases, it seems like removing leading zeros everywhere will allow us to handle postalcodes that start with zero, of course with some downsides: the leading zeros are ignored completely, so we cant distinguish between `01000` and `1000`, which might be valid postalcodes or housenumbers, for example. This can lead to cases where clearly incorrect results come up, like `1000 main street` matching a request for a hypothetical `01000` postalcode. But I think it's the best we can do without a bunch more work. I tested this code with a global set of postalcodes and it _does_ allow the relevant postalcodes to match. Fixes pelias/pelias#898
orangejulius
force-pushed
the
postalcodes-removeLeadingZero
branch
from
February 18, 2021 02:48
f79eed5
to
8a6ac3a
Compare
Full planet build looks good, our testing shows no regressions introduced, but it does allow finding postalcodes that start with a leading zero! 🥳 Look at us go with all these schema changes lately ⏩ |
orangejulius
added a commit
to pelias/acceptance-tests
that referenced
this pull request
Apr 9, 2021
These were fixed by pelias/schema#475 and reported in pelias/pelias#898.
orangejulius
added a commit
to pelias/acceptance-tests
that referenced
this pull request
Aug 25, 2021
We currently can't handle queries for postalcodes like `03100` perfectly, and they are interpreted identically to a query for `3100`. This means that results for postalcodes like `31000` will come up as well. Since we would prefer the intended postalcode comes up first, these tests have to be marked failing for now. See pelias/pelias#898 and pelias/schema#475 for more discussion.
orangejulius
added a commit
to pelias/acceptance-tests
that referenced
this pull request
Aug 25, 2021
We currently can't handle queries for postalcodes like `03100` perfectly, and they are interpreted identically to a query for `3100`. This means that results for postalcodes like `31000` will come up as well. Since we would prefer the intended postalcode comes up first, these tests have to be marked failing for now. See pelias/pelias#898 and pelias/schema#475 for more discussion.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This change adds the
removeAllZeroNumericPrefix
filter to thepeliasPhrase
analyzer. The idea is to help ensure postalcodes with a leading zero can show up in autocomplete queries, as reported in pelias/pelias#898.The change is based on two assumptions:
peliasPhrase
,peliasQuery
, andpeliasIndexOneEdgeGram
analyzers to all handle leading zeros similarlyI recall the original motivation for removing leading zeros is that we sometimes see street names like
05th avenue
in various data sources (or possibly queries), and we want to allow that to match on5th avenue
.The original code to do this seems to have been written pretty long ago so it's hard to say for sure.
Anyway, assuming we do want to handle those cases, it seems like removing leading zeros everywhere will allow us to handle postalcodes that start with zero, of course with some downsides: the leading zeros are ignored completely, so we cant distinguish between
01000
and1000
, which might be valid postalcodes or housenumbers, for example.This can lead to cases where clearly incorrect results come up, like
1000 main street
matching a request for a hypothetical01000
postalcode. But I think it's the best we can do without a bunch more work.I tested this code with a global set of postalcodes and it does allow the relevant postalcodes to match.
Assuming this is the best idea anyone else has we can move forward with testing this PR on a full planet build and going from there.
Fixes pelias/pelias#898