Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add filter to remove leading zeros to peliasPhrase analyzer #475

Merged
merged 1 commit into from
Feb 18, 2021

Conversation

orangejulius
Copy link
Member

@orangejulius orangejulius commented Feb 13, 2021

This change adds the removeAllZeroNumericPrefix filter to the peliasPhrase analyzer. The idea is to help ensure postalcodes with a leading zero can show up in autocomplete queries, as reported in pelias/pelias#898.

The change is based on two assumptions:

  • We want the peliasPhrase, peliasQuery, and peliasIndexOneEdgeGram analyzers to all handle leading zeros similarly
  • We do want to remove leading zeros in our analyzers

I recall the original motivation for removing leading zeros is that we sometimes see street names like 05th avenue in various data sources (or possibly queries), and we want to allow that to match on 5th avenue.

The original code to do this seems to have been written pretty long ago so it's hard to say for sure.

Anyway, assuming we do want to handle those cases, it seems like removing leading zeros everywhere will allow us to handle postalcodes that start with zero, of course with some downsides: the leading zeros are ignored completely, so we cant distinguish between 01000 and 1000, which might be valid postalcodes or housenumbers, for example.
This can lead to cases where clearly incorrect results come up, like 1000 main street matching a request for a hypothetical 01000 postalcode. But I think it's the best we can do without a bunch more work.

I tested this code with a global set of postalcodes and it does allow the relevant postalcodes to match.

Assuming this is the best idea anyone else has we can move forward with testing this PR on a full planet build and going from there.

Fixes pelias/pelias#898

Copy link
Member

@Joxit Joxit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

You can add fixes pelias/pelias#898 to your commit

orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Feb 17, 2021
This adds tests for a particular type of postalcode we've had trouble
with: those with a leading zero.

Tests are added to the existing 'search postalcode' suite, and a new
'autocomplete postalcode' sutie is also created.

We have pelias/schema#475 in progress which we
hope can fix the current known issues.

Connects pelias/pelias#898
This change adds the `removeAllZeroNumericPrefix` filter to the
`peliasPhrase` analyzer. The idea is to help ensure postalcodes with a
leading zero can show up in autocomplete queries, as reported in
pelias/pelias#898.

The change is based on two assumptions:
- We want the `peliasPhrase`, `peliasQuery`, and
  `peliasIndexOneEdgeGram` analyzers to all handle leading zeros
  similarly
- We _do_ want to remove leading zeros in our analyzers

I recall the original motivation for removing leading zeros is that we
sometimes see street names like `05th avenue` in various data sources
(or possibly queries), and we want to allow that to match on `5th
avenue`.

The original code to do this seems to have been written pretty long ago
and the [pull request I believe is
relevant](#56) doesn't go into a
lot of detail, so it's hard to say for sure.

Anyway, assuming we do want to handle those cases, it seems like
removing leading zeros everywhere will allow us to handle postalcodes
that start with zero, of course with some downsides: the leading zeros
are ignored completely, so we cant distinguish between `01000` and
`1000`, which might be valid postalcodes or housenumbers, for example.
This can lead to cases where clearly incorrect results come up, like
`1000 main street` matching a request for a hypothetical `01000`
postalcode. But I think it's the best we can do without a bunch more
work.

I tested this code with a global set of postalcodes and it _does_ allow
the relevant postalcodes to match.

Fixes pelias/pelias#898
@orangejulius
Copy link
Member Author

Full planet build looks good, our testing shows no regressions introduced, but it does allow finding postalcodes that start with a leading zero! 🥳

Look at us go with all these schema changes lately ⏩

@orangejulius orangejulius merged commit 7eca249 into master Feb 18, 2021
@orangejulius orangejulius deleted the postalcodes-removeLeadingZero branch February 18, 2021 02:54
orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Apr 9, 2021
orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Aug 25, 2021
We currently can't handle queries for postalcodes like `03100`
perfectly, and they are interpreted identically to a query for `3100`.
This means that results for postalcodes like `31000` will come up as
well.

Since we would prefer the intended postalcode comes up first, these
tests have to be marked failing for now.

See pelias/pelias#898 and
pelias/schema#475 for more discussion.
orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Aug 25, 2021
We currently can't handle queries for postalcodes like `03100`
perfectly, and they are interpreted identically to a query for `3100`.
This means that results for postalcodes like `31000` will come up as
well.

Since we would prefer the intended postalcode comes up first, these
tests have to be marked failing for now.

See pelias/pelias#898 and
pelias/schema#475 for more discussion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

autocomplete does not return postal codes lower than 10000
2 participants