Enhance postal cities resilience against erroneous data #288

orangejulius · 2020-03-06T21:41:58Z

As it stands now, the postal cities dataset can cause records to have invalid admin hierarchy if even a single record in OSM has an incorrect mapping from postal code to locality.

A good example of this is postal code 11215 in Brooklyn, NY, which currently shows up as part of Geneseo, NY, several hundred miles away.

/v1/search?text=111+8th+Avenue%2C+Brooklyn%2C+Geneseo%2C+NY%2C+USA

It turns out there is a single incorrect record in OSM with postal code 11215:

This is enough to introduce an incorrect mapping.

Possible solutions

I'm sure there are many things we can do here, and we might end up including several:

Require there to be more than one confirmation of any individual postal code <-> locality mapping, to guard against errors
Improve our existing code that doesn't consider matches outside of a certain distance
Add checks that can determine mapping outliers. In this case, for example, there are 130 OSM records that confirm the mapping from 11215 to Brooklyn, NY.

The text was updated successfully, but these errors were encountered:

missinglink · 2020-03-06T21:53:22Z

IIRC there is a total number of occurrences in OSM preserved in our data for this purpose.

missinglink · 2020-03-06T21:55:09Z

Do we know how many people mapped this as 11215 = Geneseo?

orangejulius · 2020-03-06T22:06:41Z

Right, there is only one occurrence. Here's the relevant lines from the USA.tsv data file:

11215	421205765	Brooklyn		borough	130
11215	85978297	Geneseo		locality	1
11215	85977539	New York	NYC	locality	1

missinglink · 2020-03-06T22:08:57Z

Yeah, a lone wolf, we should probably only load data for occurrences > x.

Where x is 10? Or 5?

orangejulius · 2020-03-06T22:13:54Z

Yeah, that seems like a good approach. Out of the 39585 line USA.tsv file, here's the breakdown of occurrence frequency:

awk -F '\t' '{ print $6 }' USA.tsv | sort -n | uniq -c | head -n 20
  11895 1
   4413 2
   2740 3
   2178 4
   1773 5
   1519 6
   1189 7
   1144 8
    782 9
    741 10
    596 11
    603 12
    459 13
    430 14
    363 15
    340 16
    273 17
    275 18
    212 19
    212 20

So for example, there are 11895 mappings with only one occurrence.

missinglink · 2020-03-07T13:31:49Z

Maybe we should make the lastline data an npm module so we don't need to copy the files here on every rebuild?

Just mentioning that because right now we're only using a small fraction of the lastline dataset.

orangejulius · 2020-04-30T15:33:55Z

Just wanted to write down some thoughts on different cases we might want to handle when dealing with errors in OSM data we use to derive postal cities data.

Multiple frequently seen values

When there are a reasonably high number of confirmations for two different mappings of a postal code to a city, we want to keep them both. The more popular should be used for display, which will hopefully be the correct one, but either way allowing searches on both to succeed is key.

A real world example of this is seen for Louisville, KY

postal code	WOF ID	Admin Name	Admin Layer	Count
40047	85947523	Louisville	locality	14
40047	85946765	Mount Washington	locality	13

Single unambiguous interpretation

If there's just one mapping from a postal code to a city, we probably want to keep it. As mentioned above there are quite a few of these, so we'd be throwing away essentially 1/3 of the mappings if we ignored this data.

Here's a real world example of a correct zip code mapping that only has a single occurrence in OSM

postal code	WOF ID	Admin Name	Admin Layer	Count
48099	85951983	Troy	locality	1

Multiple interpretations with outlier(s)

In the case where there are multiple interpretations and one or more of them are common, but there are outliers that are uncommon, we probably want to ignore the outliers. Another example from above:

postal code	WOF ID	Admin Name	Admin Layer	Count
11215	421205765	Brooklyn	borough	130
11215	85978297	Geneseo	locality	1
11215	85977539	New York	locality	1

In this case, Brooklyn is the correct value. New York is technically incorrect, and Geneseo is completely wrong.

Summary

A strategy for handling all these would be a little more complicated that something as simple as "ignore rows with fewer than X occurrences" but would be very valuable. Anyone have thoughts on the parameters and strategy we would want to use?

Because we now manage postal cities replacements for both borough and locality values, we have to be a bit careful about when we replace each of them. After invesitating what looked like an error in our logic for determining the best postal cities match in #288, it turns out a significant issue we were seeing in Brooklyn was actually a more subtle bugs. In short, it looks like if we replace the borough field with with a postal city value, we should _not_ replace the localtiy value. This PR implements that change.

orangejulius · 2020-05-11T16:59:52Z

I looked into this more, and I believe our logic for determining the best postal cities match is correct. We do prefer the most frequent value to use as the display name, and while occasionally erroneous data will make its way in, overall I think our current logic does a good job.

There's one exception, which is when looking at boroughs! I don't think we should replace a locality value after replacing a borough value. Looking at the table of values for US zip code 11215:

11215	421205765	Brooklyn		borough	130
11215	85978297	Geneseo		locality	1
11215	85977539	New York	NYC	locality	1

Both the second and third columns are actually incorrect. The official postal city value for the zip code is the borough of Brooklyn, not the city of New York. That explains why there is only one instance of each value. Our current code is not very resilient against this, because it always looks to replace both the borough and the locality on a record if it can.

#297 implements logic to avoid changing the locality value with postal cities data if the borough value was already changed, and ensures this invalid data is no longer an issue. It definitely resolves the particular problem in Brooklyn with zip code 11215 that caused us to open this issue, and I think once it's merged, it will mean that our existing logic is resilient enough that we don't need to change anything right now .:)

We want to ensure the fix for pelias/wof-admin-lookup#288, contained in pelias/wof-admin-lookup#297, is used when importing.

orangejulius mentioned this issue Apr 30, 2020

Publish this data via NPM? pelias/postal-cities#7

Open

orangejulius mentioned this issue May 11, 2020

Only change locality if borough is _not_ changed #297

Merged

orangejulius closed this as completed in #297 May 11, 2020

orangejulius added a commit to pelias/openstreetmap that referenced this issue May 11, 2020

fix(deps): Upgrade wof-admin-lookup to the latest

13f76c8

We want to ensure the fix for pelias/wof-admin-lookup#288, contained in pelias/wof-admin-lookup#297, is used when importing.

orangejulius added a commit to pelias/openaddresses that referenced this issue May 11, 2020

fix(deps): Upgrade wof-admin-lookup to the latest

eb3e794

We want to ensure the fix for pelias/wof-admin-lookup#288, contained in pelias/wof-admin-lookup#297, is used when importing.

This was referenced May 11, 2020

fix(deps): Upgrade wof-admin-lookup to the latest pelias/openstreetmap#530

Merged

fix(deps): Upgrade wof-admin-lookup to the latest pelias/openaddresses#468

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance postal cities resilience against erroneous data #288

Enhance postal cities resilience against erroneous data #288

orangejulius commented Mar 6, 2020 •

edited

Loading

missinglink commented Mar 6, 2020

missinglink commented Mar 6, 2020

orangejulius commented Mar 6, 2020

missinglink commented Mar 6, 2020

orangejulius commented Mar 6, 2020

missinglink commented Mar 7, 2020 •

edited

Loading

orangejulius commented Apr 30, 2020 •

edited

Loading

orangejulius commented May 11, 2020

Enhance postal cities resilience against erroneous data #288

Enhance postal cities resilience against erroneous data #288

Comments

orangejulius commented Mar 6, 2020 • edited Loading

Possible solutions

missinglink commented Mar 6, 2020

missinglink commented Mar 6, 2020

orangejulius commented Mar 6, 2020

missinglink commented Mar 6, 2020

orangejulius commented Mar 6, 2020

missinglink commented Mar 7, 2020 • edited Loading

orangejulius commented Apr 30, 2020 • edited Loading

Multiple frequently seen values

Single unambiguous interpretation

Multiple interpretations with outlier(s)

Summary

orangejulius commented May 11, 2020

orangejulius commented Mar 6, 2020 •

edited

Loading

missinglink commented Mar 7, 2020 •

edited

Loading

orangejulius commented Apr 30, 2020 •

edited

Loading