Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

comparing addresses without coordinates #3

Open
restoretheday opened this issue Sep 6, 2017 · 2 comments
Open

comparing addresses without coordinates #3

restoretheday opened this issue Sep 6, 2017 · 2 comments

Comments

@restoretheday
Copy link

This is more a question than an issue.
My need is to maintain a database of addresses around the world, and detect when the same address/venue already exists, even when the format is different when previously entered by someone else. However i do not have coordinates in general, just name/street/city/country etc. Your doc seems to say lat/lon are required for blocking but can't I block by country then state then city etc?

Thank you!

@albarrentine
Copy link
Contributor

@restoretheday that might be possible in the future but the accuracy would necessarily be a bit lower than working with (even somewhat badly recorded) lat/lons.

I was recently geocoding some voter file addresses, which do not have coordinates, to a subset of OpenAddresses, and have made some (unpublished) changes to support this sort of workflow. The new version has a non-default option for blocking by either postcode or city. However, the thinking there was more around deduping local data sets that are known to have certain boundaries. For instance, in an NYC 5 borough data set, city="Brooklyn" always means the same thing, but internationally (and even just within the US), there are dozens of different cities named Brooklyn. It's also entirely possible to get duplicate addresses that are very common e.g "1 Main St" so simply address+city is probably not going to be unique. Adding venue name helps but I'm sure there are also edge cases there since venue names are distributed LogNormal so frequent venue names e.g. "Starbucks" are very common.

Adding state in addition to city (in countries where that's applicable) should probably be enough to disambiguate in most cases. The caveat there would be the case of missing data, so if one record had city="Brooklyn", country="US" and the other had city="Brooklyn", state="NY", country="US", they would not match when using city+state+country as a string key. If that's not a problem in your data, it should at least be possible to avoid false positives.

Depending on the data there are still cases where you would get false negatives though because of things like synonymy. In the NYC-only example, blocking by city alone may not always be useful because many Queens addresses use the neighborhood name as the city, so an address in Jamaica, Queens might be written "Jamaica, NY" and for blocking by city as a string, there'd be no way of matching Jamaica, NY to Queens, NY.

If your system already does some form of coarse geocoding and resolution of place names, you can use the place ID (or multiple potential place IDs) obtained from that process as part of the blocking key(s) and not have to worry about these issues. If not, it might be useful to consider coarse geocoding to something like WoF or GeoNames. Coarse geocoding may be implemented as a feature in libpostal at some point, but it's not on the immediate roadmap.

@restoretheday
Copy link
Author

Yes my intention was to block by whatever i do have each time, either through your API or as a pre-step (coarse geocoding or name matching).
The venue name fuzzy matching helps reduce the chances of false positives enough already in my case, i think, that lat/lon is a safety net i could do without in the cases i don't have it, but still use country>state>...>city>... as they're almost always available.
Thanks for the recommendations, I'll check them out; but i thought my usecase would be fairly typical, as you seem to have encountered as well :)

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants