Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance postal cities resilience against erroneous data #288

Closed
orangejulius opened this issue Mar 6, 2020 · 8 comments · Fixed by #297
Closed

Enhance postal cities resilience against erroneous data #288

orangejulius opened this issue Mar 6, 2020 · 8 comments · Fixed by #297

Comments

@orangejulius
Copy link
Member

orangejulius commented Mar 6, 2020

As it stands now, the postal cities dataset can cause records to have invalid admin hierarchy if even a single record in OSM has an incorrect mapping from postal code to locality.

A good example of this is postal code 11215 in Brooklyn, NY, which currently shows up as part of Geneseo, NY, several hundred miles away.

/v1/search?text=111+8th+Avenue%2C+Brooklyn%2C+Geneseo%2C+NY%2C+USA
Screenshot_2020-03-06_13-42-18
image

It turns out there is a single incorrect record in OSM with postal code 11215:
Screenshot_2020-03-06_13-39-58

This is enough to introduce an incorrect mapping.

Possible solutions

I'm sure there are many things we can do here, and we might end up including several:

  • Require there to be more than one confirmation of any individual postal code <-> locality mapping, to guard against errors
  • Improve our existing code that doesn't consider matches outside of a certain distance
  • Add checks that can determine mapping outliers. In this case, for example, there are 130 OSM records that confirm the mapping from 11215 to Brooklyn, NY.
@missinglink
Copy link
Member

IIRC there is a total number of occurrences in OSM preserved in our data for this purpose.

@missinglink
Copy link
Member

Do we know how many people mapped this as 11215 = Geneseo?

@orangejulius
Copy link
Member Author

Right, there is only one occurrence. Here's the relevant lines from the USA.tsv data file:

11215	421205765	Brooklyn		borough	130
11215	85978297	Geneseo		locality	1
11215	85977539	New York	NYC	locality	1

@missinglink
Copy link
Member

Yeah, a lone wolf, we should probably only load data for occurrences > x.

Where x is 10? Or 5?

@orangejulius
Copy link
Member Author

Yeah, that seems like a good approach. Out of the 39585 line USA.tsv file, here's the breakdown of occurrence frequency:

awk -F '\t' '{ print $6 }' USA.tsv | sort -n | uniq -c | head -n 20
  11895 1
   4413 2
   2740 3
   2178 4
   1773 5
   1519 6
   1189 7
   1144 8
    782 9
    741 10
    596 11
    603 12
    459 13
    430 14
    363 15
    340 16
    273 17
    275 18
    212 19
    212 20

So for example, there are 11895 mappings with only one occurrence.

@missinglink
Copy link
Member

missinglink commented Mar 7, 2020

Maybe we should make the lastline data an npm module so we don't need to copy the files here on every rebuild?

Just mentioning that because right now we're only using a small fraction of the lastline dataset.

@orangejulius
Copy link
Member Author

orangejulius commented Apr 30, 2020

Just wanted to write down some thoughts on different cases we might want to handle when dealing with errors in OSM data we use to derive postal cities data.

Multiple frequently seen values

When there are a reasonably high number of confirmations for two different mappings of a postal code to a city, we want to keep them both. The more popular should be used for display, which will hopefully be the correct one, but either way allowing searches on both to succeed is key.

A real world example of this is seen for Louisville, KY

postal code WOF ID Admin Name Admin Layer Count
40047 85947523 Louisville locality 14
40047 85946765 Mount Washington locality 13

Single unambiguous interpretation

If there's just one mapping from a postal code to a city, we probably want to keep it. As mentioned above there are quite a few of these, so we'd be throwing away essentially 1/3 of the mappings if we ignored this data.

Here's a real world example of a correct zip code mapping that only has a single occurrence in OSM

postal code WOF ID Admin Name Admin Layer Count
48099 85951983 Troy locality 1

Multiple interpretations with outlier(s)

In the case where there are multiple interpretations and one or more of them are common, but there are outliers that are uncommon, we probably want to ignore the outliers. Another example from above:

postal code WOF ID Admin Name Admin Layer Count
11215 421205765 Brooklyn borough 130
11215 85978297 Geneseo locality 1
11215 85977539 New York locality 1

In this case, Brooklyn is the correct value. New York is technically incorrect, and Geneseo is completely wrong.

Summary

A strategy for handling all these would be a little more complicated that something as simple as "ignore rows with fewer than X occurrences" but would be very valuable. Anyone have thoughts on the parameters and strategy we would want to use?

orangejulius added a commit that referenced this issue May 11, 2020
Because we now manage postal cities replacements for both borough and
locality values, we have to be a bit careful about when we replace each
of them.

After invesitating what looked like an error in our logic for
determining the best postal cities match in
#288, it turns out a
significant issue we were seeing in Brooklyn was actually a more subtle
bugs.

In short, it looks like if we replace the borough field with with a
postal city value, we should _not_ replace the localtiy value. This PR
implements that change.
@orangejulius
Copy link
Member Author

I looked into this more, and I believe our logic for determining the best postal cities match is correct. We do prefer the most frequent value to use as the display name, and while occasionally erroneous data will make its way in, overall I think our current logic does a good job.

There's one exception, which is when looking at boroughs! I don't think we should replace a locality value after replacing a borough value. Looking at the table of values for US zip code 11215:

11215	421205765	Brooklyn		borough	130
11215	85978297	Geneseo		locality	1
11215	85977539	New York	NYC	locality	1

Both the second and third columns are actually incorrect. The official postal city value for the zip code is the borough of Brooklyn, not the city of New York. That explains why there is only one instance of each value. Our current code is not very resilient against this, because it always looks to replace both the borough and the locality on a record if it can.

#297 implements logic to avoid changing the locality value with postal cities data if the borough value was already changed, and ensures this invalid data is no longer an issue. It definitely resolves the particular problem in Brooklyn with zip code 11215 that caused us to open this issue, and I think once it's merged, it will mean that our existing logic is resilient enough that we don't need to change anything right now .:)

orangejulius added a commit to pelias/openstreetmap that referenced this issue May 11, 2020
We want to ensure the fix for
pelias/wof-admin-lookup#288, contained in
pelias/wof-admin-lookup#297, is used when
importing.
orangejulius added a commit to pelias/openaddresses that referenced this issue May 11, 2020
We want to ensure the fix for
pelias/wof-admin-lookup#288, contained in
pelias/wof-admin-lookup#297, is used when
importing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants