# Joins and merging data

DATE: 11 June 2020, 18:00 - 21:00 UTC

AUDIENCE: Intermediate

INSTRUCTOR: Martin Bentley, Digital Geoscientist, [Agile](https://agilescientific.com/)

Many times we are interested in combining two datasets, but we may only want the area that is covered by another one. Spatial joins are a relatively complex topic, so this will give a brief overview that will hopefully be useful.

In [None]:
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [None]:
provinces = gpd.read_file('zip://../data/ne_10m_admin_1_states_provinces.zip')
za_provinces = provinces[provinces['sov_a3'] == 'ZAF']
employment = pd.read_csv('../data/za_province_employment.csv')

Pandas differentiates between spatial joins and attribute joins. An **attribute** join works more or less the same as in standard pandas, and uses values that are in common to both. A **spatial** join uses the geometry of each dataframe.

## Attribute Joins

This works by finding a common value in two dataframes and creating a new dataframe using the common value to add values to an existing feature. In pandas one uses the `merge` function to do this. These are very common when you have existing data that needs to be combined to existing geometry. A common example would be adding demographic data to administrative districts.

In this case, we can see that the employment data has a `Province` attribute. We can link that to the `name` attribute in `za_provinces`. Further, there is no geometry associated with the employment data, so we have no way of seeing if there are any spatial trends to the data, unless we have a very good mental image of South Africa's provinces.

In [None]:
employment

In [None]:
ax = employment.plot(kind='bar')

In [None]:
za_provinces

We can now merge on these, which will give us a dataframe that adds the value associated with a given province in the employment data to each province.

In [None]:
merged_provinces = za_provinces.merge(employment, left_on='name', right_on='Province')
merged_provinces

As we can see, we have added the columns from `employment` to our `za_provinces` geodataframe, which we can treat normally. Also worth noting is that we have lost the row with the totals of each class, because there is no province named 'Total'. Since this is now a standard geodataframe, we can easily make a plot based on the employment data for each province.

In [None]:
ax = merged_provinces.plot('Total', scheme='NaturalBreaks', k=5, legend=True, edgecolor='white', cmap='cividis')
ax.set_ylim(-35.2, -21.8)  # These limits are to ignore the Gough Islands
ax.set_xlim(16, 33.2)  # These limits are to ignore the Gough Islands
the_legend = ax.get_legend()
the_legend.set_bbox_to_anchor((1.7,1))
plt.title('Population in South Africa per Province')
ax2 = merged_provinces.plot(merged_provinces['Unemployed']/merged_provinces['Total']*100, scheme='NaturalBreaks', k=5, legend=True, edgecolor='white', cmap='cividis')
ax2.set_ylim(-35.2, -21.8)  # These limits are to ignore the Gough Islands
ax2.set_xlim(16, 33.2)  # These limits are to ignore the Gough Islands
the_legend = ax2.get_legend()
the_legend.set_bbox_to_anchor((1.45,1))
plt.title('Percentage unemployed in each province')

## Spatial Joins

These work by looking at the geometry of two different geodataframes and relating them to each other.

For example, this river dataset has no information on which country or province a river is in, but that may be of interest for some reason.

In [None]:
rivers = gpd.read_file('zip://../data/ne_10m_rivers_lake_centerlines_trimmed.zip')
rivers

Geopandas offers us the `sjoin` method to spatially joing two different geodataframes.

This takes two geodataframes (the first is `'left` and the second is `right`.)

The `op` parameter controls how things are related to each other, using the `shapely` library's [binary predicates](https://shapely.readthedocs.io/en/latest/manual.html#binary-predicates):
* `intersects` - True if the objects have any boundary or interior point in common.
* `contains` - True if no points of other lie in the exterior of the object and at least one point of the interior of other lies in the interior of object.
* `within` - True if the object’s boundary and interior intersect only with the interior of the other

The `how` parameter controls which geometry is kept:
* `'left'` uses keys from left; retains only the left geometry column

In [None]:
za_rivers_left = gpd.sjoin(za_provinces, rivers, how="left", op='intersects')
base = za_provinces.plot(color='lightgrey', edgecolor='black')
za_rivers_left[za_rivers_left['sov_a3'] == 'ZAF'].plot(ax=base)

base.set_ylim(-35.2, -21.8)  # These limits are to ignore the Gough Islands
base.set_xlim(16, 33.2)  # These limits are to ignore the Gough Islands
za_rivers_left

* `'right'` use keys from right; retains only right geometry column (note that this means that all the rivers will still be present, but only those which can be matched to a province in `za_provinces` will have values from `za_provinces`.

Note that we have rivers that extend beyond the border, because we are only looking for intersecting geometries. Try a different `op` ('contains' or 'within') to see what effect that has. Note also that some rivers are now present twice, because they are within multiple provinces, so get selected more than once.

In [None]:
za_rivers_right = gpd.sjoin(za_provinces, rivers, how="right", op='intersects')
base = za_provinces.plot(color='lightgrey', edgecolor='black')
za_rivers_right[za_rivers_right['sov_a3'] == 'ZAF'].plot(ax=base)

base.set_ylim(-35.2, -21.8)  # These limits are to ignore the Gough Islands
base.set_xlim(16, 33.2)  # These limits are to ignore the Gough Islands
#za_rivers[za_rivers['sov_a3'] == 'ZAF']

* `'inner'` use intersection of keys from both geodataframes; retain only the left geometry column

In [None]:
za_rivers_inner = gpd.sjoin(za_provinces, rivers, how="inner", op='intersects')

za_rivers_inner

Comparing all three then:

In [None]:
print(f'Left: {za_rivers_left.shape}\nRight: {za_rivers_right.shape}\nInner: {za_rivers_inner.shape}')

Note that for these datasets, we expect Left and Inner to be the same. The main difference is in whether we keep records that are only in right or not.
<hr />
<img src="https://avatars1.githubusercontent.com/u/1692321?v=3&s=200" style="float:center" width="40px" />
<p><center>© 2020 <a href="http://www.agilegeoscience.com/">Agile Geoscience</a> — <a href="https://creativecommons.org/licenses/by/4.0/">CC-BY</a></center></p>