# Reverse-geocode Google location history

See [this blog post](http://geoffboeing.com/2016/06/mapping-google-location-history-python/) for my full write-up of this project, or [this one](http://geoffboeing.com/2014/08/reverse-geocode-a-set-of-lat-long-coordinates-to-city-country/) for more about reverse-geocoding with Google.

In the previous notebook, we clustered the location history data to reduce the size of the data set. This reduced set was saved as 'location-history-clustered.csv'. Now we'll reverse-geocode it from lat/long to neighborhood, city, state, country. First, this script copies that csv file and renames the copy 'google-history-to-geocode.csv'. It uses this file as our working file to do the reverse-geocoding and takes full advantage of local caching of results to prevent duplicate API calls during multiple runs. As Google limits your IP address to 2,500 requests per day, we might need to do the entire data set in multiple passes. Hence the working file.

Sample API request: https://maps.googleapis.com/maps/api/geocode/json?latlng=39.9058153,-86.054788

In [1]:
import pandas as pd, time, requests, json, os.path, logging as lg, datetime as dt

In [2]:
pause = 0.1 #google limits you to 10 requests per second
use_second_geocoder = False #only set True on your last pass, if multiple
max_google_requests = 2500 #how many requests to make of google
google_requests_count = 0
final_output_filename = 'data/google-location-history.csv'

In [3]:
# configure local caching
geocode_cache_filename = 'data/reverse_geocode_cache.js'
cache_save_frequency = 100
requests_count = 0
geocode_cache = json.load(open(geocode_cache_filename)) if os.path.isfile(geocode_cache_filename) else {}

In [4]:
# create a logger to capture progress
log = lg.getLogger('reverse_geocoder')
if not getattr(log, 'handler_set', None):
    todays_date = dt.datetime.today().strftime('%Y_%m_%d_%H_%M_%S')
    log_filename = 'logs/reverse_geocoder_{}.log'.format(todays_date)
    handler = lg.FileHandler(log_filename, encoding='utf-8')
    formatter = lg.Formatter('%(asctime)s %(levelname)s %(name)s %(message)s')
    handler.setFormatter(formatter)
    log.addHandler(handler)
    log.setLevel(lg.INFO)
    log.handler_set = True

In [5]:
# set up the working file (see note at top of notebook)
working_filename = 'data/google-history-to-geocode.csv'
if not os.path.isfile(working_filename):
    df_temp = pd.read_csv('data/location-history-clustered.csv', encoding='utf-8')
    df_temp.to_csv(working_filename, index=False, encoding='utf-8')

## Define functions

In [6]:
# saves the dict cache to disk as json
def save_cache_to_disk(cache, filename):
    with open(filename, 'w', encoding='utf-8') as cache_file:
        cache_file.write(json.dumps(cache))
    log.info('saved {:,} cached items to {}'.format(len(cache.keys()), filename))

In [7]:
# make a http request
def make_request(url):
    log.info('requesting {}'.format(url))
    return requests.get(url).json()

In [8]:
# parse neighborhood data from a google reverse-geocode result
def get_neighborhood_google(result):
    if pd.notnull(result):
        if 'address_components' in result:
            for component in result['address_components']:
                if 'neighborhood' in component['types']:
                    return component['long_name']
                elif 'sublocality_level_1' in component['types']:
                    return component['long_name']
                elif 'sublocality_level_2' in component['types']:
                    return component['long_name']                

# parse city data from a google reverse-geocode result
# to find city, return the finest-grain address component 
# google returns these components in order from finest to coarsest grained
def get_city_google(result):
    if pd.notnull(result):
        if 'address_components' in result:
            for component in result['address_components']:
                if 'locality' in component['types']:
                    return component['long_name']
                elif 'postal_town' in component['types']:
                    return component['long_name']              
                elif 'administrative_area_level_5' in component['types']:
                    return component['long_name']
                elif 'administrative_area_level_4' in component['types']:
                    return component['long_name']
                elif 'administrative_area_level_3' in component['types']:
                    return component['long_name']
                elif 'administrative_area_level_2' in component['types']:
                    return component['long_name']

# parse state data from a google reverse-geocode result                
# to find state, you want the lowest-level admin area available
# but, google returns admin areas listed from highest-level to lowest
# so you can't just return as soon as you find the first match
# this is is opposite of the previous, because this time we want the coarsest-grain match
# otherwise we end up with counties and so forth instead of states
def get_state_google(result):
    if pd.notnull(result):
        state = None
        if 'address_components' in result:
            for component in result['address_components']:
                if 'administrative_area_level_1' in component['types']:
                    state = component['long_name']
                elif 'administrative_area_level_2' in component['types']:
                    state = component['long_name']
                elif 'administrative_area_level_3' in component['types']:
                    state = component['long_name']
                elif 'locality' in component['types']:
                    state = component['long_name']
        return state

# parse country data from a google reverse-geocode result
def get_country_google(result):
    if pd.notnull(result):
        if 'address_components' in result:
            for component in result['address_components']:
                if 'country' in component['types']:
                    return component['long_name']

In [9]:
# parse city, state, country data from a nominatim reverse-geocode result
def parse_nominatim_data(data):
    country = None
    state = None
    city = None
    if isinstance(data, dict):
        if 'address' in data:
            if 'country' in data['address']:
                country = data['address']['country']

            #state
            if 'region' in data['address']:
                state = data['address']['region']
            if 'state' in data['address']:
                state = data['address']['state']

            #city
            if 'county' in data['address']:
                county = data['address']['county']
            if 'village' in data['address']:
                city = data['address']['village']
            if 'city' in data['address']:
                city = data['address']['city']
    return city, state, country

In [10]:
# pass latlng data to osm nominatim to reverse geocode it
def reverse_geocode_nominatim(latlng):

    time.sleep(pause)
    url = 'https://nominatim.openstreetmap.org/reverse?format=json&lat={}&lon={}&zoom=18&addressdetails=1'
    data = make_request(url.format(latlng.split(',')[0], latlng.split(',')[1]))

    place = {}
    place['neighborhood'] = None
    place['city'], place['state'], place['country'] = parse_nominatim_data(data)
    return place

In [11]:
# pass the Google API latlng data to reverse geocode it
def reverse_geocode_google(latlng):
    
    global google_requests_count
    
    if google_requests_count < max_google_requests:
        # we have not yet made the max # of requests
        time.sleep(pause)
        google_requests_count += 1
        url = 'https://maps.googleapis.com/maps/api/geocode/json?latlng={}'
        data = make_request(url.format(latlng))
        if len(data['results']) > 0:
            result = data['results'][0]
            
            place = {}
            place['neighborhood'] = get_neighborhood_google(result)
            place['city'] = get_city_google(result)
            place['state'] = get_state_google(result)
            place['country'] = get_country_google(result)
            return place

In [12]:
def reverse_geocode(latlng, reverse_geocode_function=reverse_geocode_google, use_cache=True):
    
    global geocode_cache, requests_count
    
    if use_cache and latlng in geocode_cache and pd.notnull(geocode_cache[latlng]):
        log.info('retrieving results from cache for lat-long "{}"'.format(latlng))
        return geocode_cache[latlng]
    else:
        place = reverse_geocode_function(latlng)
        geocode_cache[latlng] = place
        log.info('stored place details in cache for lat-long "{}"'.format(latlng))
        
        requests_count += 1
        if requests_count % cache_save_frequency == 0: 
            save_cache_to_disk(geocode_cache, geocode_cache_filename)
            
        return place

In [13]:
def get_place_by_latlng(latlng, component):
    try:
        return place_dict[latlng][component]
    except:
        return None

## Prep the data for geocoding

If there are more than 2,500 rows in the dataset, you need to run this notebook multiple times because Google limits you to 2,500 requests per day. Or fall back on the nominatim API, with `use_second_geocoder=True`.

In [14]:
df = pd.read_csv(working_filename, encoding='utf-8')
print('{:,} rows in dataset'.format(len(df)))

4,149 rows in dataset


In [15]:
# create city, state, country columns only if they don't already exist
new_cols = ['city', 'state', 'country', 'neighborhood']
for col in new_cols:
    if not col in df.columns:
        df[col] = None
        
# drop the locations and timestamp_ms columns if they are still here
cols_to_remove = ['locations', 'timestamp_ms']
for col in cols_to_remove:
    if col in df.columns:
        df.drop(col, axis=1, inplace=True)
        
df.head()

Unnamed: 0,lat,lon,timestamp_s,datetime,city,state,country,neighborhood
0,37.863151,-122.273774,1381094629,2013-10-06 21:23:49,,,,
1,26.878138,100.242087,1432343754,2015-05-23 01:15:54,,,,
2,48.522827,9.054906,1402128471,2014-06-07 08:07:51,,,,
3,33.309692,-111.902135,1419581216,2014-12-26 08:06:56,,,,
4,-33.893208,151.215002,1464539101,2016-05-29 16:25:01,,,,


In [16]:
# put latlng in the format google likes so it's easy to call their api
# and round to 7 decimal places so our cache's keys are consistent
# (so you don't get weird float precision artifacts like 114.1702368000000001)
df['latlng'] = df.apply(lambda row: '{:.7f},{:.7f}'.format(row['lat'], row['lon']), axis=1)

In [17]:
mask = pd.isnull(df['country']) & pd.isnull(df['state']) & pd.isnull(df['city']) & pd.isnull(df['neighborhood'])
ungeocoded_rows = df[mask]
print('{:,} out of {:,} rows lack reverse-geocode results'.format(len(ungeocoded_rows), len(df)))
print('We will attempt to reverse-geocode up to {:,} rows with Google'.format(max_google_requests))

4,149 out of 4,149 rows lack reverse-geocode results
We will attempt to reverse-geocode up to 2,500 rows with Google


## Now reverse-geocode the location history with the Google API

In [18]:
unique_latlngs = df['latlng'].dropna().sort_values().unique()
place_dict = {}

In [19]:
for latlng in unique_latlngs:
    place_dict[latlng] = reverse_geocode(latlng, reverse_geocode_google)

In [20]:
for component in ['country', 'state', 'city', 'neighborhood']:
    df[component] = df['latlng'].apply(get_place_by_latlng, args=(component,))

In [21]:
mask = pd.isnull(df['country']) & pd.isnull(df['state']) & pd.isnull(df['city']) & pd.isnull(df['neighborhood'])
ungeocoded_rows = df[mask]
print('{:,} out of {:,} rows still lack reverse-geocode results'.format(len(ungeocoded_rows), len(df)))

0 out of 4,149 rows still lack reverse-geocode results


## Next reverse-geocode missing rows with the Nominatim API

If use_second_geocoder is True, use OSM Nominatum API to reverse-geocode any remaining missing rows. Only do this on the final pass. This is useful for places like Kosovo that Google does not return results for.

In [22]:
if use_second_geocoder:
    unique_latlngs = ungeocoded_rows['latlng'].dropna().sort_values().unique()
    for latlng in unique_latlngs:
        place_dict[latlng] = reverse_geocode(latlng, reverse_geocode_nominatim)
    for component in ['country', 'state', 'city', 'neighborhood']:
        df[component] = df['latlng'].apply(get_place_by_latlng, args=(component,))

In [23]:
mask = pd.isnull(df['country']) & pd.isnull(df['state']) & pd.isnull(df['city']) & pd.isnull(df['neighborhood'])
ungeocoded_rows = df[mask]
print('{:,} out of {:,} rows still lack reverse-geocode results'.format(len(ungeocoded_rows), len(df)))

0 out of 4,149 rows still lack reverse-geocode results


## Done: Save to CSV

In [24]:
df.tail()

Unnamed: 0,lat,lon,timestamp_s,datetime,city,state,country,neighborhood,latlng
4144,33.999645,-118.429866,1356999532,2013-01-01 00:18:52,Los Angeles,California,United States,Culver - West,"33.9996450,-118.4298660"
4145,-25.94435,131.9484,1465096999,2016-06-05 03:23:19,Petermann,Northern Territory,Australia,,"-25.9443505,131.9484000"
4146,34.016627,-118.822084,1451757098,2016-01-02 17:51:38,Malibu,California,United States,Western Malibu,"34.0166268,-118.8220838"
4147,34.135661,-117.589982,1451521198,2015-12-31 00:19:58,Rancho Cucamonga,California,United States,,"34.1356612,-117.5899818"
4148,38.452652,-121.849864,1429388120,2015-04-18 20:15:20,Dixon,California,United States,,"38.4526520,-121.8498639"


In [25]:
# save cache to disk
save_cache_to_disk(geocode_cache, geocode_cache_filename)

# save the entire data set to the working file
df.to_csv(working_filename, encoding='utf-8', index=False)

# save the useful columns to a final output file
cols_to_retain = ['datetime', 'neighborhood', 'city', 'state', 'country', 'lat', 'lon']
df[cols_to_retain].to_csv(final_output_filename, encoding='utf-8', index=False)