comparing addresses without coordinates #3

restoretheday · 2017-09-06T17:36:50Z

This is more a question than an issue.
My need is to maintain a database of addresses around the world, and detect when the same address/venue already exists, even when the format is different when previously entered by someone else. However i do not have coordinates in general, just name/street/city/country etc. Your doc seems to say lat/lon are required for blocking but can't I block by country then state then city etc?

Thank you!

albarrentine · 2017-09-12T04:24:48Z

@restoretheday that might be possible in the future but the accuracy would necessarily be a bit lower than working with (even somewhat badly recorded) lat/lons.

I was recently geocoding some voter file addresses, which do not have coordinates, to a subset of OpenAddresses, and have made some (unpublished) changes to support this sort of workflow. The new version has a non-default option for blocking by either postcode or city. However, the thinking there was more around deduping local data sets that are known to have certain boundaries. For instance, in an NYC 5 borough data set, city="Brooklyn" always means the same thing, but internationally (and even just within the US), there are dozens of different cities named Brooklyn. It's also entirely possible to get duplicate addresses that are very common e.g "1 Main St" so simply address+city is probably not going to be unique. Adding venue name helps but I'm sure there are also edge cases there since venue names are distributed LogNormal so frequent venue names e.g. "Starbucks" are very common.

Adding state in addition to city (in countries where that's applicable) should probably be enough to disambiguate in most cases. The caveat there would be the case of missing data, so if one record had city="Brooklyn", country="US" and the other had city="Brooklyn", state="NY", country="US", they would not match when using city+state+country as a string key. If that's not a problem in your data, it should at least be possible to avoid false positives.

Depending on the data there are still cases where you would get false negatives though because of things like synonymy. In the NYC-only example, blocking by city alone may not always be useful because many Queens addresses use the neighborhood name as the city, so an address in Jamaica, Queens might be written "Jamaica, NY" and for blocking by city as a string, there'd be no way of matching Jamaica, NY to Queens, NY.

If your system already does some form of coarse geocoding and resolution of place names, you can use the place ID (or multiple potential place IDs) obtained from that process as part of the blocking key(s) and not have to worry about these issues. If not, it might be useful to consider coarse geocoding to something like WoF or GeoNames. Coarse geocoding may be implemented as a feature in libpostal at some point, but it's not on the immediate roadmap.

restoretheday · 2017-09-22T18:06:24Z

Yes my intention was to block by whatever i do have each time, either through your API or as a pre-step (coarse geocoding or name matching).
The venue name fuzzy matching helps reduce the chances of false positives enough already in my case, i think, that lat/lon is a safety net i could do without in the cases i don't have it, but still use country>state>...>city>... as they're almost always available.
Thanks for the recommendations, I'll check them out; but i thought my usecase would be fairly typical, as you seem to have encountered as well :)

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

comparing addresses without coordinates #3

comparing addresses without coordinates #3

restoretheday commented Sep 6, 2017

albarrentine commented Sep 12, 2017

restoretheday commented Sep 22, 2017

comparing addresses without coordinates #3

comparing addresses without coordinates #3

Comments

restoretheday commented Sep 6, 2017

albarrentine commented Sep 12, 2017

restoretheday commented Sep 22, 2017