# Reverse geocoding

This is the prototype for a simple tool to bin entities into a geography based on their lats and lons.

In addition to prototyping the binner and testing it with the places data from CrunchBase, we will use this script to download the shapefiles for NUTS2 and LEPS that we will be using in the project


## Preamble

In [1]:
%run ../notebook_preamble.ipy

In [2]:
import geopandas as gp
from data_getters.core import get_engine
from zipfile import ZipFile
from io import StringIO, BytesIO
from shapely.geometry import Point

In [3]:
def get_daps_data(table,connection,chunksize=1000):
    '''
    Utility function to get data from DAPS with less faff
    
    Args:
        -table is the SQL table in DAPS that we are extracting
        -connection is the database connection we are using
        -Chunksize are the chunks to download
    
    Returns:
        -A dataframe with the data we have collected
    
    '''
    #Get chunks
    chunks = pd.read_sql_table(table, connection, chunksize=chunksize)
    
    #Create df
    df = pd.concat(chunks)
    
    #Return data
    return(df)

In [4]:
def get_shapes(url,file_name,path='../../data/shapefiles/'):
    '''
    Utility function to extract and save a bunch of shapefiles from the ONS open geography portal
    
    Arguments:
        url: url for the shapefile zip
        file_name: name of the file where we want to extract the data
    
    '''
    #Get the data
    print(f'getting {file_name}...')
    req = requests.get(url)
    
    #Parse the content
    z = ZipFile(BytesIO(req.content))
    
    #Save
    print(f'saving {file_name}...')
    z.extractall(f'{path}{file_name}')

# Load data

### Setup

In [5]:
# Download CrunchBase data using DAPS

my_config = '../../mysqldb_team.config'

#Create connection with SQL
con = get_engine(my_config)

## Places

In [6]:
places_df = get_daps_data('geographic_data',con)

In [7]:
places_df.head()

Unnamed: 0,id,city,country,country_alpha_2,country_alpha_3,country_numeric,continent,latitude,longitude,done
0,'s-graveland_netherlands,'s-graveland,Netherlands,NL,NLD,528,EU,52.246038,5.130486,1
1,'s-gravendeel_netherlands,'s-gravendeel,Netherlands,NL,NLD,528,EU,51.767282,4.598556,1
2,'s-gravenhage_netherlands,'s-gravenhage,Netherlands,NL,NLD,528,EU,52.074946,4.26968,1
3,'s-gravenzande_netherlands,'s-gravenzande,Netherlands,NL,NLD,528,EU,51.996606,4.158805,1
4,'s-heerenberg_netherlands,'s-heerenberg,Netherlands,NL,NLD,528,EU,51.881981,6.250319,1


## Shapes

Download shapefiles from the [Open Geography Portal](https://geoportal.statistics.gov.uk/)

In [8]:
nuts_url = 'https://opendata.arcgis.com/datasets/48b6b85bb7ea43699ee85f4ecd12fd36_1.zip?outSR=%7B%22latestWkid%22%3A27700%2C%22wkid%22%3A27700%7D'

leps_url = 'https://opendata.arcgis.com/datasets/d4d519d1d1a1455a9b82331228f77489_1.zip?outSR=%7B%22latestWkid%22%3A27700%2C%22wkid%22%3A27700%7D'

In [9]:
for url,name in zip([nuts_url,leps_url],['nuts_2_2018','leps_2017']):
    get_shapes(url,name)

getting nuts_2_2018...
saving nuts_2_2018...
getting leps_2017...
saving leps_2017...


## Write reverse geo_coder

The reverse geocoder does a point of polygon merge between a df and a boundary file

In [10]:
def reverse_geocoder(place_df,shape_path,place_id,
                     coord_names= ['longitude','latitude']):
    '''
    The reverse geocoder takes a df with geographical coordinates and does a spatial merge with a shapefile.
    
    Args:
        place_df (df). A dataframe where every row is an entity we want to reverse geocode
        shape_path (str): The path for the shapefile (note we will need to project this to WGS84)
        place_id (str): the name of the variable with the place id in place dfs
        coord_names (list): Names for the lon and lat variables in the place_df
        
    Returns:
        A spatially merged df with the location ids and their 
    
    
    '''
    
    #Read the shapefile    
    print('Reading shapefile...')
    
    shape = gp.read_file(shape_path)
    
    #Change its projection so it can deal with lats and lons
    shape = shape.to_crs({'init':'epsg:4326'})
    
    #Create a place_holder df (ho ho) where the index is the place id
    place_holder = gp.GeoDataFrame(index=place_df[place_id],crs={'init':'epsg:4326'})
    
    #Create the geo field for spatial merge using Point
    place_holder['geometry'] = [Point(x[coord_names[0]],x[coord_names[1]]) for rid,x in place_df.iterrows()]
    
    print('Joining...')
    
    #Spatial join: looks for points inside the polygons
    joined = gp.sjoin(place_holder,shape,op='within')
    
    #Return the joined df
    return(joined)

In [11]:
#We loop this function over different shapefiles
nuts_path = '../../data/shapefiles/nuts_2_2018/NUTS_Level_2_January_2018_Full_Extent_Boundaries_in_the_United_Kingdom.shp'
leps_path = '../../data/shapefiles/leps_2017/Local_Enterprise_Partnerships_April_2017_Full_Extent_Boundaries_in_England.shp'

nuts_geo,leps_geo = [reverse_geocoder(places_df,path,'id') for path in [nuts_path,leps_path]]

Reading shapefile...
Joining...
Reading shapefile...
Joining...


### Some quick observations

In [12]:
print(len(places_df))

30469


In [13]:
print(len(nuts_geo))

3151


In [14]:
print(len(leps_geo))

2918


There are lots of places that we aren't reverse geocoded because they are not in the UK, and more NUTS than LEPS because LEPS are only in England

## Outputs

We merge and save `nuts_geo` and `leps_geo`

In [15]:
rev_geo = pd.merge(nuts_geo.reset_index(drop=False)[['index','nuts218cd','nuts218nm']],leps_geo.reset_index(drop=False)[['index','lep17cd','lep17nm']],
                   left_on='index',right_on='index',how='outer')

In [16]:
rev_geo.head()

Unnamed: 0,index,nuts218cd,nuts218nm,lep17cd,lep17nm
0,abbots-langley_united-kingdom,UKH2,Bedfordshire and Hertfordshire,E37000017,Hertfordshire
1,ampthill_united-kingdom,UKH2,Bedfordshire and Hertfordshire,E37000041,South East Midlands
2,ardeley_united-kingdom,UKH2,Bedfordshire and Hertfordshire,E37000017,Hertfordshire
3,arlesey_united-kingdom,UKH2,Bedfordshire and Hertfordshire,E37000041,South East Midlands
4,ashwell_united-kingdom,UKH2,Bedfordshire and Hertfordshire,E37000017,Hertfordshire


In [17]:
rev_geo.rename(columns={'index':'location_id'},inplace=True)

In [19]:
rev_geo.to_csv(f'../../data/processed/crunchbase/{today_str}_rev_geocoded_places',index=False)