-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
comparing addresses without coordinates #3
Comments
@restoretheday that might be possible in the future but the accuracy would necessarily be a bit lower than working with (even somewhat badly recorded) lat/lons. I was recently geocoding some voter file addresses, which do not have coordinates, to a subset of OpenAddresses, and have made some (unpublished) changes to support this sort of workflow. The new version has a non-default option for blocking by either postcode or city. However, the thinking there was more around deduping local data sets that are known to have certain boundaries. For instance, in an NYC 5 borough data set, city="Brooklyn" always means the same thing, but internationally (and even just within the US), there are dozens of different cities named Brooklyn. It's also entirely possible to get duplicate addresses that are very common e.g "1 Main St" so simply address+city is probably not going to be unique. Adding venue name helps but I'm sure there are also edge cases there since venue names are distributed LogNormal so frequent venue names e.g. "Starbucks" are very common. Adding state in addition to city (in countries where that's applicable) should probably be enough to disambiguate in most cases. The caveat there would be the case of missing data, so if one record had city="Brooklyn", country="US" and the other had city="Brooklyn", state="NY", country="US", they would not match when using city+state+country as a string key. If that's not a problem in your data, it should at least be possible to avoid false positives. Depending on the data there are still cases where you would get false negatives though because of things like synonymy. In the NYC-only example, blocking by city alone may not always be useful because many Queens addresses use the neighborhood name as the city, so an address in Jamaica, Queens might be written "Jamaica, NY" and for blocking by city as a string, there'd be no way of matching Jamaica, NY to Queens, NY. If your system already does some form of coarse geocoding and resolution of place names, you can use the place ID (or multiple potential place IDs) obtained from that process as part of the blocking key(s) and not have to worry about these issues. If not, it might be useful to consider coarse geocoding to something like WoF or GeoNames. Coarse geocoding may be implemented as a feature in libpostal at some point, but it's not on the immediate roadmap. |
Yes my intention was to block by whatever i do have each time, either through your API or as a pre-step (coarse geocoding or name matching). Thanks! |
This is more a question than an issue.
My need is to maintain a database of addresses around the world, and detect when the same address/venue already exists, even when the format is different when previously entered by someone else. However i do not have coordinates in general, just name/street/city/country etc. Your doc seems to say lat/lon are required for blocking but can't I block by country then state then city etc?
Thank you!
The text was updated successfully, but these errors were encountered: