# Manipulating Geospatial Data

## Introduction

In this tutorial, you'll learn about two common manipulations for geospatial data: geocoding and table joins.

## Geocoding

Geocoding is the process of converting the name of a place or an address to a location on a map. If you have ever looked up a geographic location based on a landmark description with Google Maps, Bing Maps, or Baidu Maps, for instance, then you have used a geocoder!

<center>
<img src="great_pyramid.jpg" width="1000" align="center"><br/>
</center>

We'll use **GeoPandas** to do all of our geocoding.


In [12]:
import numpy as np
import pandas as pd
import geopandas as gpd
from geopandas.tools import geocode

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

To use the **geocoder**, we need only provide:
- the name or address as a Python string, and
- the name of the provider. To avoid having to provide an API key, we'll use the **OpenStreetMap Nominatim** geocoder.

If the geocoding is successful, it returns a **GeoDataFrame** with two columns:
- the "**geometry**" column contains the (latitude, longitude) location, and
- the "**address**" column contains the full address.

In [2]:
result = geocode("The Great Pyramid of Giza", provider="nominatim",
                 user_agent='jupyter', timeout=4)
result

Unnamed: 0,geometry,address
0,POINT (31.13422 29.97915),"هرم خوفو, Cause way, كوم الأخضر, الجيزة, محافظ..."


The entry in the "**geometry**" column is a **Point** object, and we can get the latitude and longitude from the **x** and **y** attributes, respectively.

In [3]:
point = result.geometry[0]
print("Latitude:", point.y)
print("Longitude:", point.x)

Latitude: 29.9791509
Longitude: 31.134219302763587


### Geocoding of mutiple addresses

It's often the case that we'll need to geocode many different addresses. For instance, say we want to obtain the locations of 100 top universities in Europe.


In [15]:
universities = pd.read_csv("../input/top_universities.csv")
universities = universities.head(20)
universities.head()

Unnamed: 0,Name
0,University of Oxford
1,University of Cambridge
2,Imperial College London
3,ETH Zurich
4,UCL


Then we can use a **lambda** function to apply the geocoder to every row in the DataFrame. We use a **try/except** statement to account for the case that the geocoding may be unsuccessful.

In [16]:
def my_geocoder(name):
    try:
        result = geocode(name, provider="nominatim",
                         user_agent='jupyter', timeout=2).iloc[0]
        point = result.geometry
        address = result.address
        return pd.Series({'address': address, 'geometry': point,
                          'latitude': point.y, 'longitude':point.x})
    except:
        return None
    
universities[['address', 'geometry', 
              'latitude', 'longitude']] = universities.apply(lambda row: my_geocoder(row.Name), 
                                                             axis=1)
universities.head(5)

Unnamed: 0,Name,address,geometry,latitude,longitude
0,University of Oxford,"Oxford University Museum of Natural History, P...",POINT (-1.255668482609204 51.75870755),51.758708,-1.255668
1,University of Cambridge,"Fitzwilliam Museum, Trumpington Street, Newnha...",POINT (0.1197386574107438 52.1998523),52.199852,0.119739
2,Imperial College London,"Imperial College London, Exhibition Road, Brom...",POINT (-0.1746287460939027 51.4989834),51.498983,-0.174629
3,ETH Zurich,"ETH Merchandise Store, 3, Sonneggstrasse, Ober...",POINT (8.5475089 47.3773269),47.377327,8.547509
4,UCL,"Petrie Museum of Egyptian Archeology, Malet Pl...",POINT (-0.1329564742107762 51.5235641),51.523564,-0.132956


In [17]:
print("{}% of addresses were geocoded!".format(
    (1 - sum(universities.address.apply(lambda a: not a)) / len(universities)) * 100))

90.0% of addresses were geocoded!


In [18]:
# Drop universities that were not successfully geocoded
universities = universities.loc[~np.isnan(universities.latitude)]
universities = gpd.GeoDataFrame(universities, geometry=universities.geometry)
universities.crs = {'init': 'epsg:4326'}
universities.head()

Unnamed: 0,Name,address,geometry,latitude,longitude
0,University of Oxford,"Oxford University Museum of Natural History, P...",POINT (-1.25567 51.75871),51.758708,-1.255668
1,University of Cambridge,"Fitzwilliam Museum, Trumpington Street, Newnha...",POINT (0.11974 52.19985),52.199852,0.119739
2,Imperial College London,"Imperial College London, Exhibition Road, Brom...",POINT (-0.17463 51.49898),51.498983,-0.174629
3,ETH Zurich,"ETH Merchandise Store, 3, Sonneggstrasse, Ober...",POINT (8.54751 47.37733),47.377327,8.547509
4,UCL,"Petrie Museum of Egyptian Archeology, Malet Pl...",POINT (-0.13296 51.52356),51.523564,-0.132956


Next, we visualize all of the locations that were returned by the geocoder. Notice that a few of the locations are certainly inaccurate, as they're not in Europe!

In [19]:
import folium

# Create a map
m = folium.Map(location=[54, 15], tiles='openstreetmap', zoom_start=3)

# Add points to the map
for idx, row in universities.iterrows():
    folium.Marker([row.geometry.y, row.geometry.x], popup=row['Name']).add_to(m)
m

## Table joins

Now, we'll switch topics and think about how to combine data from different source

In [31]:
# This dataset is provided in GeoPandas
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
europe = world.loc[world.continent.isin(['Europe'])]
europe.crs = {'init' :'epsg:4326'}
europe.head()

Unnamed: 0,pop_est,continent,name,iso_a3,gdp_md_est,geometry
18,142257519,Europe,Russia,RUS,3745000.0,"MULTIPOLYGON (((178.725 71.099, 180.000 71.516..."
21,5320045,Europe,Norway,-99,364700.0,"MULTIPOLYGON (((15.143 79.674, 15.523 80.016, ..."
43,67106161,Europe,France,-99,2699000.0,"MULTIPOLYGON (((-51.658 4.156, -52.249 3.241, ..."
110,9960487,Europe,Sweden,SWE,498100.0,"POLYGON ((11.027 58.856, 11.468 59.432, 12.300..."
111,9549747,Europe,Belarus,BLR,165400.0,"POLYGON ((28.177 56.169, 29.230 55.918, 29.372..."


### Attribute join

You already know how to use `pd.DataFrame.join()` to combine information from multiple DataFrames with a shared index. We refer to this way of joining data (by simpling matching values in the index) as an **attribute join**.

When performing an attribute join with a **GeoDataFrame**, it's best to use the `gpd.GeoDataFrame.merge()`. 

To illustrate this, we'll work with the **GeoDataFrame** `europe_boundaries` containing the boundaries for every country in Europe. The first five rows of this **GeoDataFrame** are printed below.

In [32]:
europe_boundaries = europe.loc[:, ['name', 'geometry']]
europe_boundaries.head()

Unnamed: 0,name,geometry
18,Russia,"MULTIPOLYGON (((178.725 71.099, 180.000 71.516..."
21,Norway,"MULTIPOLYGON (((15.143 79.674, 15.523 80.016, ..."
43,France,"MULTIPOLYGON (((-51.658 4.156, -52.249 3.241, ..."
110,Sweden,"POLYGON ((11.027 58.856, 11.468 59.432, 12.300..."
111,Belarus,"POLYGON ((28.177 56.169, 29.230 55.918, 29.372..."


We'll join it with a DataFrame europe_stats containing the estimated population and gross domestic product (GDP) for each country.

In [33]:
europe_stats = europe.loc[:, ['name', 'pop_est', 'gdp_md_est']]
europe_stats.head()

Unnamed: 0,name,pop_est,gdp_md_est
18,Russia,142257519,3745000.0
21,Norway,5320045,364700.0
43,France,67106161,2699000.0
110,Sweden,9960487,498100.0
111,Belarus,9549747,165400.0


We do the attribute join in the code cell below. 

The **on** argument is set to the column name that is used to match rows in **europe_boundaries** to rows in **europe_stats**.

In [34]:
# Use an attribute join to merge data about countries in Europe
europe_join = europe_boundaries.merge(europe_stats, on="name")
europe_join.head()

Unnamed: 0,name,geometry,pop_est,gdp_md_est
0,Russia,"MULTIPOLYGON (((178.725 71.099, 180.000 71.516...",142257519,3745000.0
1,Norway,"MULTIPOLYGON (((15.143 79.674, 15.523 80.016, ...",5320045,364700.0
2,France,"MULTIPOLYGON (((-51.658 4.156, -52.249 3.241, ...",67106161,2699000.0
3,Sweden,"POLYGON ((11.027 58.856, 11.468 59.432, 12.300...",9960487,498100.0
4,Belarus,"POLYGON ((28.177 56.169, 29.230 55.918, 29.372...",9549747,165400.0


### Spatial join

Another type of join is a spatial join. With a spatial join, we combine GeoDataFrames based on the spatial relationship between the objects in the "**geometry**" columns. 

For instance, we already have a GeoDataFrame universities containing geocoded addresses of European universities. Then we can use a spatial join to match each university to its corresponding country.

We do this with `gpd.sjoin()`.

In [35]:
# Use spatial join to match universities to countries in Europe
european_universities = gpd.sjoin(universities, europe)

# Investigate the result
print("We located {} universities.".format(len(universities)))
print("Only {} of the universities were located in Europe (in {} different countries).".format(
    len(european_universities), len(european_universities.name.unique())))

european_universities.head()

We located 18 universities.
Only 17 of the universities were located in Europe (in 5 different countries).


Unnamed: 0,Name,address,geometry,latitude,longitude,index_right,pop_est,continent,name,iso_a3,gdp_md_est
0,University of Oxford,"Oxford University Museum of Natural History, P...",POINT (-1.25567 51.75871),51.758708,-1.255668,143,64769452,Europe,United Kingdom,GBR,2788000.0
1,University of Cambridge,"Fitzwilliam Museum, Trumpington Street, Newnha...",POINT (0.11974 52.19985),52.199852,0.119739,143,64769452,Europe,United Kingdom,GBR,2788000.0
2,Imperial College London,"Imperial College London, Exhibition Road, Brom...",POINT (-0.17463 51.49898),51.498983,-0.174629,143,64769452,Europe,United Kingdom,GBR,2788000.0
4,UCL,"Petrie Museum of Egyptian Archeology, Malet Pl...",POINT (-0.13296 51.52356),51.523564,-0.132956,143,64769452,Europe,United Kingdom,GBR,2788000.0
5,London School of Economics and Political Science,London School of Economics and Political Scien...,POINT (-0.11643 51.51459),51.514591,-0.116431,143,64769452,Europe,United Kingdom,GBR,2788000.0


The spatial join above looks at the "geometry" columns in both GeoDataFrames:
- If a **Point** object from the universities GeoDataFrame intersects a Polygon object from the europe DataFrame, the corresponding rows are combined and added as a single row of the european_universities DataFrame.
- Otherwise, countries without a matching university (and universities without a matching country) are omitted from the results.

The `gpd.sjoin()` method is customizable for different types of joins, through the **how** and **op** arguments. For instance, you can do the equivalent of a SQL left (or right) join by setting `how='left'` (or `how='right'`). 

We won't go into the details in this micro-course, but you can learn more in the [GeoPandas documentation](https://geopandas.org/reference/geopandas.sjoin.html)