# Geopandas spatial join

*This is not part of the learning objectives for this session and will not be assessed. This is included for those interested in an example of how location information (latitude and longitude) can be handled when joining between databases.*

### Running this code

**TO BE ABLE TO RUN THIS CODE YOU WOULD NEED TO CREATE A NEW ENVIRONMENT WITH THE FOLLOWING PACKAGES**

Explicitly downloaded packages using Anaconda:
- geopandas
- sqlalchemy
- jupyterlab (optional)

Doing this through Anaconda should handle dependencies for these modules but if you create your environment in a different way, you may need to install additional packages.

**You would also need to download additional data - see below**

## Find county information for solar panel database

For our UKPVGeo database, we have details about the location of each solar panel/farm but we don't explicitly know which county this is a part of. One way we can find that information is by using our latitude and longitude co-ordinates and match this to the areas of the counties.

One way this can be achieved in Python is by using a module called `geopandas` which is based on the `pandas` module but also allows for spatial points and shapes to be defined. By using this module we can perform a spatial join, where we can check which solar panel locations are within each county area.

### Extracting the county details

Details of the spatial outlines of the counties are made available at "data.gov.uk" and were downloaded from here:

https://data.gov.uk/dataset/11302ddc-65bc-4a8f-96a9-af5c456e442c/counties-and-unitary-authorities-december-2016-full-clipped-boundaries-in-england-and-wales

This data is available is multiple formats. I used the GeoJSON format as a nice modern format that works well with the [`geopandas` library](https://geopandas.org/index.html) ([User Guide](https://geopandas.org/docs/user_guide.html)).

*To run this code download the "Counties_and_Unitary_Authorities_(December_2016)_Boundaries.geojson" file and save this in the "data" sub-folder*

In [1]:
import geopandas

# Extract counties from a GEOJson file as a geopandas DataFrame
filename = "data/Counties_and_Unitary_Authorities_(December_2016)_Boundaries.geojson"
counties = geopandas.read_file(filename)

In [2]:
counties

Unnamed: 0,objectid,ctyua16cd,ctyua16nm,ctyua16nmw,bng_e,bng_n,long,lat,st_areashape,st_lengthshape,geometry
0,1,E06000001,Hartlepool,,447157,531476,-1.27023,54.676159,9.355951e+07,71707.330231,"MULTIPOLYGON (((-1.26846 54.72612, -1.26858 54..."
1,2,E06000002,Middlesbrough,,451141,516887,-1.21099,54.544670,5.388858e+07,43840.846371,"MULTIPOLYGON (((-1.24390 54.58936, -1.24426 54..."
2,3,E06000003,Redcar and Cleveland,,464359,519597,-1.00611,54.567520,2.448203e+08,97993.287164,"MULTIPOLYGON (((-1.13758 54.64581, -1.13743 54..."
3,4,E06000004,Stockton-on-Tees,,444937,518183,-1.30669,54.556911,2.049622e+08,119581.507757,"MULTIPOLYGON (((-1.31729 54.64480, -1.31756 54..."
4,5,E06000005,Darlington,,428029,515649,-1.56835,54.535351,1.974757e+08,107206.282926,"POLYGON ((-1.63768 54.61714, -1.63800 54.61720..."
...,...,...,...,...,...,...,...,...,...,...,...
169,170,W06000020,Torfaen,Torfaen,327459,200480,-3.05101,51.698360,1.262399e+08,82544.788702,"POLYGON ((-3.10490 51.79504, -3.10498 51.79508..."
170,171,W06000021,Monmouthshire,Sir Fynwy,337812,209231,-2.90280,51.778271,8.502717e+08,223993.032044,"MULTIPOLYGON (((-3.05202 51.97282, -3.05215 51..."
171,172,W06000022,Newport,Casnewydd,337897,187433,-2.89769,51.582321,1.905248e+08,153277.172069,"MULTIPOLYGON (((-2.83665 51.64942, -2.83682 51..."
172,173,W06000023,Powys,Powys,302328,273254,-3.43533,52.348629,5.195321e+09,610110.322350,"POLYGON ((-3.15512 52.89799, -3.15502 52.89805..."


### Preparing the solar panel database

We can read the whole UKPVGeo database by passing the table name "pv" explicitly (rather than a full SQL query string). We then need to clean the data (remove entries with unknown locations) and make sure this is in the right format so we can join this with the county database defined above.

In [3]:
import pandas as pd

# Read all data from the pv table for our database
ukpvgeo = pd.read_sql("pv", "sqlite:///data/ukpvgeo.db")

In [4]:
ukpvgeo

Unnamed: 0,index,osm_objtype,osm_id,repd_id,repd_site_name,capacity_repd_MWp,capacity_osm_MWp,latitude,longitude,area_sqm,...,osm_tag_start_date,num_modules,repd_status,repd_operational_date,old_repd_id,osm_cluster_id,repd_cluster_id,source_capacity,source_obj,match_rule
0,0,way,179653409.0,1019.0,Manor Farm,,,50.335310,-4.828931,29896.5,...,NaT,,Operational,2011-07-25,A0503,9361639.0,1019.0,,,1
1,1,way,179653410.0,1019.0,Manor Farm,,,50.337199,-4.831966,33670.4,...,NaT,,Operational,2011-07-25,A0503,9361639.0,1019.0,,,1
2,2,way,293012396.0,1019.0,Manor Farm,,,50.337513,-4.829606,35991.1,...,NaT,,Operational,2011-07-25,A0503,9361639.0,1019.0,,,1
3,3,relation,9458140.0,1021.0,Howton Farm,,,50.446881,-4.290731,80822.7,...,NaT,,Operational,2011-07-22,A0506,550725977.0,1021.0,,,1
4,4,way,681567916.0,1039.0,Langage Solar Farm,,,50.384755,-4.007153,11375.7,...,NaT,,Operational,2011-07-15,AA255,7885568.0,1039.0,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
265412,265412,way,745490213.0,5848.0,St Helen's Lane,5.0,5.000,54.673844,-3.527576,153339.7,...,NaT,,Operational,2017-03-28,,745490213.0,5848.0,repd,,1
265413,265413,way,103721031.0,5942.0,Barvills Solar Farm (resubmission),5.0,5.000,51.472969,0.420086,65162.7,...,NaT,,Operational,2017-03-29,,103721031.0,5942.0,repd,,25
265414,265414,relation,10844205.0,6759.0,Burneside Mill,0.5,0.459,54.356079,-2.760666,1811.5,...,NaT,,Operational,2019-11-01,,10844205.0,6759.0,,,25
265415,265415,way,682004205.0,7497.0,Brook Farm,25.0,4.000,51.854426,-1.131482,125345.6,...,NaT,,Application Submitted,NaT,,682004205.0,7497.0,repd,,1


In [5]:
# Drop any NaN values within the latitude and longitude columns
ukpvgeo = ukpvgeo.dropna(subset=["latitude", "longitude"], how="any")

To be able to join the data we need to create a common spatial column. In our UKPVGeo dataset we have longitude and latitude positions for every entry, so we can use these columns to turn our pandas DataFrame into a geopandas GeoDataFrame (see [Example - Creating a GeoDataFrame from a DataFrame with coordinates](https://geopandas.org/gallery/create_geopandas_from_pandas.html)).

In [6]:
# Turn our ukpvgeo into a GeoDataFrame by making our "latitude", "longitude" values into a geometry column
ukpvgeo_spatial = geopandas.GeoDataFrame(
    ukpvgeo, geometry=geopandas.points_from_xy(ukpvgeo.longitude, ukpvgeo.latitude, crs="EPSG:4326"))

In [7]:
ukpvgeo_spatial

Unnamed: 0,index,osm_objtype,osm_id,repd_id,repd_site_name,capacity_repd_MWp,capacity_osm_MWp,latitude,longitude,area_sqm,...,num_modules,repd_status,repd_operational_date,old_repd_id,osm_cluster_id,repd_cluster_id,source_capacity,source_obj,match_rule,geometry
0,0,way,179653409.0,1019.0,Manor Farm,,,50.335310,-4.828931,29896.5,...,,Operational,2011-07-25,A0503,9361639.0,1019.0,,,1,POINT (-4.82893 50.33531)
1,1,way,179653410.0,1019.0,Manor Farm,,,50.337199,-4.831966,33670.4,...,,Operational,2011-07-25,A0503,9361639.0,1019.0,,,1,POINT (-4.83197 50.33720)
2,2,way,293012396.0,1019.0,Manor Farm,,,50.337513,-4.829606,35991.1,...,,Operational,2011-07-25,A0503,9361639.0,1019.0,,,1,POINT (-4.82961 50.33751)
3,3,relation,9458140.0,1021.0,Howton Farm,,,50.446881,-4.290731,80822.7,...,,Operational,2011-07-22,A0506,550725977.0,1021.0,,,1,POINT (-4.29073 50.44688)
4,4,way,681567916.0,1039.0,Langage Solar Farm,,,50.384755,-4.007153,11375.7,...,,Operational,2011-07-15,AA255,7885568.0,1039.0,,,1,POINT (-4.00715 50.38475)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
265412,265412,way,745490213.0,5848.0,St Helen's Lane,5.0,5.000,54.673844,-3.527576,153339.7,...,,Operational,2017-03-28,,745490213.0,5848.0,repd,,1,POINT (-3.52758 54.67384)
265413,265413,way,103721031.0,5942.0,Barvills Solar Farm (resubmission),5.0,5.000,51.472969,0.420086,65162.7,...,,Operational,2017-03-29,,103721031.0,5942.0,repd,,25,POINT (0.42009 51.47297)
265414,265414,relation,10844205.0,6759.0,Burneside Mill,0.5,0.459,54.356079,-2.760666,1811.5,...,,Operational,2019-11-01,,10844205.0,6759.0,,,25,POINT (-2.76067 54.35608)
265415,265415,way,682004205.0,7497.0,Brook Farm,25.0,4.000,51.854426,-1.131482,125345.6,...,,Application Submitted,NaT,,682004205.0,7497.0,repd,,1,POINT (-1.13148 51.85443)


### Spatial join

Now we have two GeoDataFrames we can perform a spatial join where geopandas can use the latitude, longitude positions from our UKPVGeo dataset and match them to the county outlines we downloaded.

In [8]:
# Performing a spatial join - this can take a few minutes
ukpvgeo_and_county = geopandas.sjoin(ukpvgeo_spatial, counties, how="inner", op='intersects')

In [9]:
ukpvgeo_and_county

Unnamed: 0,index,osm_objtype,osm_id,repd_id,repd_site_name,capacity_repd_MWp,capacity_osm_MWp,latitude,longitude,area_sqm,...,objectid,ctyua16cd,ctyua16nm,ctyua16nmw,bng_e,bng_n,long,lat,st_areashape,st_lengthshape
0,0,way,1.796534e+08,1019.0,Manor Farm,,,50.335310,-4.828931,29896.5,...,51,E06000052,Cornwall,,212501,64494,-4.64249,50.450230,3.549501e+09,1.198960e+06
1,1,way,1.796534e+08,1019.0,Manor Farm,,,50.337199,-4.831966,33670.4,...,51,E06000052,Cornwall,,212501,64494,-4.64249,50.450230,3.549501e+09,1.198960e+06
2,2,way,2.930124e+08,1019.0,Manor Farm,,,50.337513,-4.829606,35991.1,...,51,E06000052,Cornwall,,212501,64494,-4.64249,50.450230,3.549501e+09,1.198960e+06
3,3,relation,9.458140e+06,1021.0,Howton Farm,,,50.446881,-4.290731,80822.7,...,51,E06000052,Cornwall,,212501,64494,-4.64249,50.450230,3.549501e+09,1.198960e+06
6,6,way,5.512869e+08,1022.0,East Langford Farm,5.0,,50.866482,-4.441850,105138.9,...,51,E06000052,Cornwall,,212501,64494,-4.64249,50.450230,3.549501e+09,1.198960e+06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
146494,146494,node,7.275038e+09,,,,,51.596686,-0.350482,0.0,...,107,E09000015,Harrow,,515356,189736,-0.33603,51.594669,5.046331e+07,3.373546e+04
146495,146495,node,7.275038e+09,,,,,51.603306,-0.341302,0.0,...,107,E09000015,Harrow,,515356,189736,-0.33603,51.594669,5.046331e+07,3.373546e+04
146496,146496,node,7.275038e+09,,,,,51.598759,-0.321518,0.0,...,107,E09000015,Harrow,,515356,189736,-0.33603,51.594669,5.046331e+07,3.373546e+04
265177,265177,node,9.912487e+08,,,,0.00903,51.603661,-0.297684,0.0,...,107,E09000015,Harrow,,515356,189736,-0.33603,51.594669,5.046331e+07,3.373546e+04


The columns are now a combination of columns from both tables where this has been matched. We can choose which columns to keep and rename them appropriately.

In [10]:
ukpvgeo_and_county.columns

Index(['index', 'osm_objtype', 'osm_id', 'repd_id', 'repd_site_name',
       'capacity_repd_MWp', 'capacity_osm_MWp', 'latitude', 'longitude',
       'area_sqm', 'located', 'orientation', 'osm_power_type',
       'osm_tag_start_date', 'num_modules', 'repd_status',
       'repd_operational_date', 'old_repd_id', 'osm_cluster_id',
       'repd_cluster_id', 'source_capacity', 'source_obj', 'match_rule',
       'geometry', 'index_right', 'objectid', 'ctyua16cd', 'ctyua16nm',
       'ctyua16nmw', 'bng_e', 'bng_n', 'long', 'lat', 'st_areashape',
       'st_lengthshape'],
      dtype='object')

In [11]:
# Drop additional columns from the county table that we don't need
drop_columns = ["index_right", 'objectid','ctyua16nmw', 'bng_e', 'bng_n', 'long', 'lat', 'st_areashape', 'st_lengthshape']
ukpvgeo_new = ukpvgeo_and_county.drop(columns=drop_columns)

In [12]:
# Rename the county columns to be something more understandable
ukpvgeo_new = ukpvgeo_new.rename(columns={"ctyua16cd": "county_id", "ctyua16nm": "county"})

In [13]:
ukpvgeo_new

Unnamed: 0,index,osm_objtype,osm_id,repd_id,repd_site_name,capacity_repd_MWp,capacity_osm_MWp,latitude,longitude,area_sqm,...,repd_operational_date,old_repd_id,osm_cluster_id,repd_cluster_id,source_capacity,source_obj,match_rule,geometry,county_id,county
0,0,way,1.796534e+08,1019.0,Manor Farm,,,50.335310,-4.828931,29896.5,...,2011-07-25,A0503,9.361639e+06,1019.0,,,1,POINT (-4.82893 50.33531),E06000052,Cornwall
1,1,way,1.796534e+08,1019.0,Manor Farm,,,50.337199,-4.831966,33670.4,...,2011-07-25,A0503,9.361639e+06,1019.0,,,1,POINT (-4.83197 50.33720),E06000052,Cornwall
2,2,way,2.930124e+08,1019.0,Manor Farm,,,50.337513,-4.829606,35991.1,...,2011-07-25,A0503,9.361639e+06,1019.0,,,1,POINT (-4.82961 50.33751),E06000052,Cornwall
3,3,relation,9.458140e+06,1021.0,Howton Farm,,,50.446881,-4.290731,80822.7,...,2011-07-22,A0506,5.507260e+08,1021.0,,,1,POINT (-4.29073 50.44688),E06000052,Cornwall
6,6,way,5.512869e+08,1022.0,East Langford Farm,5.0,,50.866482,-4.441850,105138.9,...,2011-07-25,A0507,5.512869e+08,1022.0,,,1,POINT (-4.44185 50.86648),E06000052,Cornwall
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
146494,146494,node,7.275038e+09,,,,,51.596686,-0.350482,0.0,...,NaT,,7.275038e+09,,,,,POINT (-0.35048 51.59669),E09000015,Harrow
146495,146495,node,7.275038e+09,,,,,51.603306,-0.341302,0.0,...,NaT,,7.275038e+09,,,,,POINT (-0.34130 51.60331),E09000015,Harrow
146496,146496,node,7.275038e+09,,,,,51.598759,-0.321518,0.0,...,NaT,,7.275038e+09,,,,,POINT (-0.32152 51.59876),E09000015,Harrow
265177,265177,node,9.912487e+08,,,,0.00903,51.603661,-0.297684,0.0,...,NaT,,9.912487e+08,,,London DataStore,,POINT (-0.29768 51.60366),E09000015,Harrow


In [13]:
# Looking at the unique entries within the new county column
ukpvgeo_new.county.unique()

array(['Cornwall', 'Devon', 'Somerset', 'Buckinghamshire', 'Lincolnshire',
       'Derbyshire', 'Gloucestershire', 'Cambridgeshire', 'Oxfordshire',
       'Pembrokeshire', 'Monmouthshire', 'Northamptonshire', 'Dorset',
       'Wiltshire', 'Suffolk', 'Isle of Wight', 'Surrey', 'Hampshire',
       'Norfolk', 'Carmarthenshire', 'Essex', 'North Somerset', 'Kent',
       'South Gloucestershire', 'Darlington', 'Leicestershire',
       'North Yorkshire', 'East Sussex', 'West Berkshire', 'Swansea',
       'Shropshire', 'Rutland', 'Cheshire West and Chester',
       'Nottinghamshire', 'Barnsley', 'Doncaster', 'Staffordshire',
       'Cheshire East', 'Cumbria', 'West Sussex', 'Newport', 'Swindon',
       'Wrexham', 'Herefordshire, County of', 'Telford and Wrekin',
       'Thurrock', 'Central Bedfordshire', 'Bedford', 'Bridgend',
       'Vale of Glamorgan', 'Ceredigion', 'Milton Keynes',
       'Neath Port Talbot', 'Warrington', 'Worcestershire',
       'Rhondda Cynon Taf', 'Windsor and Maidenhea

### Outputting to a database

We can output our new joined DataFrame to an SQLite database if we wish.

In [15]:
# Turn our geopandas GeoDataFrame back into a pandas DataFrame and drop the geometry column we created
ukpvgeo_df = pd.DataFrame(ukpvgeo_new).drop(columns=["geometry"])

In [16]:
ukpvgeo_df

Unnamed: 0,index,osm_objtype,osm_id,repd_id,repd_site_name,capacity_repd_MWp,capacity_osm_MWp,latitude,longitude,area_sqm,...,repd_status,repd_operational_date,old_repd_id,osm_cluster_id,repd_cluster_id,source_capacity,source_obj,match_rule,county_id,county
0,0,way,1.796534e+08,1019.0,Manor Farm,,,50.335310,-4.828931,29896.5,...,Operational,2011-07-25,A0503,9.361639e+06,1019.0,,,1,E06000052,Cornwall
1,1,way,1.796534e+08,1019.0,Manor Farm,,,50.337199,-4.831966,33670.4,...,Operational,2011-07-25,A0503,9.361639e+06,1019.0,,,1,E06000052,Cornwall
2,2,way,2.930124e+08,1019.0,Manor Farm,,,50.337513,-4.829606,35991.1,...,Operational,2011-07-25,A0503,9.361639e+06,1019.0,,,1,E06000052,Cornwall
3,3,relation,9.458140e+06,1021.0,Howton Farm,,,50.446881,-4.290731,80822.7,...,Operational,2011-07-22,A0506,5.507260e+08,1021.0,,,1,E06000052,Cornwall
6,6,way,5.512869e+08,1022.0,East Langford Farm,5.0,,50.866482,-4.441850,105138.9,...,Operational,2011-07-25,A0507,5.512869e+08,1022.0,,,1,E06000052,Cornwall
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
146494,146494,node,7.275038e+09,,,,,51.596686,-0.350482,0.0,...,,NaT,,7.275038e+09,,,,,E09000015,Harrow
146495,146495,node,7.275038e+09,,,,,51.603306,-0.341302,0.0,...,,NaT,,7.275038e+09,,,,,E09000015,Harrow
146496,146496,node,7.275038e+09,,,,,51.598759,-0.321518,0.0,...,,NaT,,7.275038e+09,,,,,E09000015,Harrow
265177,265177,node,9.912487e+08,,,,0.00903,51.603661,-0.297684,0.0,...,,NaT,,9.912487e+08,,,London DataStore,,E09000015,Harrow


In [17]:
import pandas as pd

#ukpvgeo_df.to_sql("pv", 'sqlite:///data/ukpvgeo_county.db')

---