-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance postal cities resilience against erroneous data #288
Comments
IIRC there is a total number of occurrences in OSM preserved in our data for this purpose. |
Do we know how many people mapped this as 11215 = Geneseo? |
Right, there is only one occurrence. Here's the relevant lines from the
|
Yeah, a lone wolf, we should probably only load data for occurrences > x. Where x is 10? Or 5? |
Yeah, that seems like a good approach. Out of the 39585 line USA.tsv file, here's the breakdown of occurrence frequency:
So for example, there are 11895 mappings with only one occurrence. |
Maybe we should make the lastline data an Just mentioning that because right now we're only using a small fraction of the lastline dataset. |
Just wanted to write down some thoughts on different cases we might want to handle when dealing with errors in OSM data we use to derive postal cities data. Multiple frequently seen valuesWhen there are a reasonably high number of confirmations for two different mappings of a postal code to a city, we want to keep them both. The more popular should be used for display, which will hopefully be the correct one, but either way allowing searches on both to succeed is key. A real world example of this is seen for
Single unambiguous interpretationIf there's just one mapping from a postal code to a city, we probably want to keep it. As mentioned above there are quite a few of these, so we'd be throwing away essentially 1/3 of the mappings if we ignored this data. Here's a real world example of a correct zip code mapping that only has a single occurrence in OSM
Multiple interpretations with outlier(s)In the case where there are multiple interpretations and one or more of them are common, but there are outliers that are uncommon, we probably want to ignore the outliers. Another example from above:
In this case, Brooklyn is the correct value. New York is technically incorrect, and Geneseo is completely wrong. SummaryA strategy for handling all these would be a little more complicated that something as simple as "ignore rows with fewer than X occurrences" but would be very valuable. Anyone have thoughts on the parameters and strategy we would want to use? |
Because we now manage postal cities replacements for both borough and locality values, we have to be a bit careful about when we replace each of them. After invesitating what looked like an error in our logic for determining the best postal cities match in #288, it turns out a significant issue we were seeing in Brooklyn was actually a more subtle bugs. In short, it looks like if we replace the borough field with with a postal city value, we should _not_ replace the localtiy value. This PR implements that change.
I looked into this more, and I believe our logic for determining the best postal cities match is correct. We do prefer the most frequent value to use as the display name, and while occasionally erroneous data will make its way in, overall I think our current logic does a good job. There's one exception, which is when looking at boroughs! I don't think we should replace a locality value after replacing a borough value. Looking at the table of values for US zip code 11215:
Both the second and third columns are actually incorrect. The official postal city value for the zip code is the borough of Brooklyn, not the city of New York. That explains why there is only one instance of each value. Our current code is not very resilient against this, because it always looks to replace both the borough and the locality on a record if it can. #297 implements logic to avoid changing the locality value with postal cities data if the borough value was already changed, and ensures this invalid data is no longer an issue. It definitely resolves the particular problem in Brooklyn with zip code 11215 that caused us to open this issue, and I think once it's merged, it will mean that our existing logic is resilient enough that we don't need to change anything right now .:) |
We want to ensure the fix for pelias/wof-admin-lookup#288, contained in pelias/wof-admin-lookup#297, is used when importing.
We want to ensure the fix for pelias/wof-admin-lookup#288, contained in pelias/wof-admin-lookup#297, is used when importing.
As it stands now, the postal cities dataset can cause records to have invalid admin hierarchy if even a single record in OSM has an incorrect mapping from postal code to locality.
A good example of this is postal code
11215
in Brooklyn, NY, which currently shows up as part of Geneseo, NY, several hundred miles away./v1/search?text=111+8th+Avenue%2C+Brooklyn%2C+Geneseo%2C+NY%2C+USA
It turns out there is a single incorrect record in OSM with postal code
11215
:This is enough to introduce an incorrect mapping.
Possible solutions
I'm sure there are many things we can do here, and we might end up including several:
11215
toBrooklyn, NY
.The text was updated successfully, but these errors were encountered: