# Obtain Hex Addresses

We will use reverse geocoding to find the nearest address to each hex centre.

In [1]:
import geopandas as gpd
import numpy as np

from ratelimiter import RateLimiter
from diskcache import Cache
from tqdm.notebook import tqdm

from geopy.distance import distance
from geopy import Nominatim

import plotly.express as px

In [2]:
# Set cache location for Nominatim requests
nom_cache = '../data/cache/nominatim'

# Create Nominatim geocoder object with custom user agent as required by their terms of service
geocoder = Nominatim(user_agent = 'coursera_capstone')

# Create RateLimiter object to ensure we don't exceed Nominatim's 1 reqest per second rule
limiter = RateLimiter(max_calls = 1, period = 1)

## Load Data

Data is loaded from the GeoJSON file created earlier and displayed.

In [3]:
df_hex = gpd.read_file('../data/bangalore_hex_grid.geojson')

In [4]:
print(df_hex.shape)
df_hex.head()

(2904, 5)


Unnamed: 0,id,centre_lat,centre_lng,resolution,geometry
0,886014432dfffff,12.964012,77.43864,8,"POLYGON ((77.44280 12.96151, 77.44289 12.96649..."
1,88618936cbfffff,12.708381,77.729075,8,"POLYGON ((77.73325 12.70586, 77.73333 12.71086..."
2,8861892d67fffff,13.112819,77.710739,8,"POLYGON ((77.71492 13.11031, 77.71500 13.11530..."
3,886014c993fffff,12.776673,77.540713,8,"POLYGON ((77.54488 12.77416, 77.54496 12.77915..."
4,8860145837fffff,13.016192,77.494183,8,"POLYGON ((77.49835 13.01369, 77.49843 13.01867..."


We obtain the closest addresses of each hex centre using Nominatim reverse geocoding. We need to cache the responses received, since querying Nominatim is quite slow.

In [5]:
# Empty lists to store responses
address = []
addr_coord = []

points = df_hex[['centre_lat', 'centre_lng']].to_records()

queries = 0 # Added variable to track number of records retrieved from cache

with Cache(nom_cache) as cache:
    for p in tqdm(points):
        query = (p.centre_lat, p.centre_lng)
        key = str(query) #! key must be a unique string
        if key in cache:
            response = cache[key] # Read cached value
        else:
            with limiter:
                response = geocoder.reverse(query, timeout = 30, addressdetails=True)
                cache[key] = response # Set cache value
                queries += 1
            
        address.append(response.address)
        addr_coord.append((response.latitude, response.longitude))
print('{} new queries made.'.format(queries))

  0%|          | 0/2904 [00:00<?, ?it/s]

0 new queries made.


## Calculating Distances

We use geopy to calculate the geodesic distance between each hexagon centre and the address obtained. We want to check if the addresses provided are close to the hex centres. There will be some distance though, so we will set a threshold above which we consider it an issue.

In [6]:
centre_coord = df_hex[['centre_lat', 'centre_lng']].to_records(index=False)

address_error = [] # Empty list to store data
error_threshold = 300 # Any distance (in meters) above this will be flagged

for centre, addr in zip(centre_coord, addr_coord):
    dist = distance(centre, addr).meters
    if dist > error_threshold:
        address_error.append(dist) # Use Geopy function
        
percent_below = ((len(centre_coord) - len(address_error))/len(centre_coord))*100
print('{:.2f} percent of the errors are below {} meters'.format(percent_below, error_threshold))

# Plot histogram
address_err_fig = px.histogram(
    x = address_error,
    #// nbins = 200,
    histnorm = 'percent',
    #// cumulative = True,
    template = 'plotly',
)

address_err_fig.update_layout(
    title = "Error in address locations",
    title_x = 0.5,
    xaxis_title = 'Distance between Hex centre & Address (meters)',
    yaxis_title = 'Percentage of hexes (Cumulative)',
    bargap = 0.01,
)

88.12 percent of the errors are below 300 meters


We see that the reverse geocoding has mostly been accurate enough. Our hex radius is around 500m, so most of the addresses will be within the hex except for a few outliers. This isn't hugely important though, since the addresses will just be used as a reference, all other functions will use the coordinates.

If we wanted to, we could try and improve upon these using another API, but for now we will not use our limited API quotas on this. Let's add this to the dataframe and persist to a file.

In [7]:
# Assign to new columns
df_hex['address'] = address
df_hex['addr_lat'], df_hex['addr_lng'] = zip(*addr_coord)

df_hex.to_feather('../data/bangalore_hex_data.feather') # Save file

df_hex.head() # Display final table


this is an initial implementation of Parquet/Feather file support and associated metadata.  This is tracking version 0.1.0 of the metadata specification at https://github.com/geopandas/geo-arrow-spec

This metadata specification does not yet make stability promises.  We do not yet recommend using this in a production setting unless you are able to rewrite your Parquet/Feather files.




Unnamed: 0,id,centre_lat,centre_lng,resolution,geometry,address,addr_lat,addr_lng
0,886014432dfffff,12.964012,77.43864,8,"POLYGON ((77.44280 12.96151, 77.44289 12.96649...","Kenchanapura, Bangalore South, Bangalore Urban...",12.963196,77.43862
1,88618936cbfffff,12.708381,77.729075,8,"POLYGON ((77.73325 12.70586, 77.73333 12.71086...","Anekal, Bangalore Urban, Karnataka, 562106, India",12.710459,77.72959
2,8861892d67fffff,13.112819,77.710739,8,"POLYGON ((77.71492 13.11031, 77.71500 13.11530...","Marasandura, Bangalore North, Bangalore Urban,...",13.099492,77.714854
3,886014c993fffff,12.776673,77.540713,8,"POLYGON ((77.54488 12.77416, 77.54496 12.77915...","Tattaguppe, Bangalore South, Bangalore Urban, ...",12.777207,77.541815
4,8860145837fffff,13.016192,77.494183,8,"POLYGON ((77.49835 13.01369, 77.49843 13.01867...","Dodda Bidarakallu, Rajarajeshwari Nagar Zone, ...",13.016104,77.493968
