# Intro to Geopandas plotting vector data

DATE: 12 June 2020, 08:00 - 11:00 UTC

AUDIENCE: Intermediate

INSTRUCTOR: Martin Bentley, Digital Geoscientist, [Agile](https://agilescientific.com/)

Not all the data that we want to deal with is simply numeric. Much of it will need to be located in space as well. Luckily, there are numerous tools to handle this sort of data. For this notebook, we will focus on vector data. This is data consisting of points, lines and polygons, not gridded data. The tutorials by Leo Uieda and Joe Kington deal more with raster data and should be a good complement to this tutorial. 

There are a number of common spatial tasks that are often done in GIS software, such as adding buffers to data, or manipulating and creating geometries. This notebook is focused more on the basics of using existing data and plotting it, but not making many changes specific to spatial data.

#### Prerequisites

You should be reasonably comfortable with `pandas` and `matplotlib.pyplot`.

Beyond that, this is aimed at relative beginners to working with spatial data.

#### A Note on Shapefile

Shapefiles are a common file format used when sharing and storing georeferenced data. A single shapefile has a number of components that are required for it to work correctly.
These are mandatory:
- `<name>.shp` the feature geometry.
- `<name>.shx` is the shape index.
- `<name>.dbx` contains the attributes in columns, for each feature.

There are a number of additional files that may also be present, of which these are the most common (in the author's experience).
- `<name>.prj` is the projection of the data.
- `<name>.sbx` and `<name>.sbn` are a spatial index.
- `<name>.shp.xml` is a metadata file.

While shapefiles are very common on desktop systems, they tend not to be used present data on the web, although they are often offered as a download option.

### Pandas and Geopandas

Pandas gives us access to a data structure called a DataFrame, which is very well suited for the sort of data that is usually in spreadsheets, with rows and columns. Geopandas is an expansion of that, to allow for the data to be geographically located in a sensible way. It does this by adding a `geometry` column, a , and adding some methods for some spatially useful tests, while still allowing the usual `DataFrame` methods from pandas.

In addition, we will use `cartopy` to handle projections. `mapclassify` is optional, but allows easy binning of our data.

In [None]:
#import cartopy.crs as ccrs
import geopandas as gpd
import mapclassify as mc
import numpy as np
import pandas as pd

## Creating a geodataframe

Loading a shapefile (or a number of other formats) is as simple as calling `read_file` with the right location.

Geopandas uses `fiona` in the background, so anything that can be handled by `fiona` can be handled with geopandas. Note that some formats can be read, but not written.

In [None]:
fname = '../data/cleaned/offshore_wells_2011_Geographic_NAD27.shp'

well_locations = gpd.read_file(fname)
well_locations.head()

We can also load data as a standard DataFrame and convert it by using any existing geometry that we know about.

We will load up some data available regarding issues identified at artisinal mines in Zimbabwe by the International Peace Information Service ([IPIS](http://ipisresearch.be/)).

In [None]:
fname = '../data/zwe_mines_curated_all_opendata_p_ipis.csv'

artisinal_mines = pd.read_csv(fname)
artisinal_mines.head()

We can see that there is a `geom` column in this CSV, where every point is a Well-Known Text ([WKT](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry)) string describing the geometry. To make the geodataframe aware of this, we will use the `shapely` library (that geopandas uses under the hood).

In [None]:
from shapely import wkt
artisinal_mines['geom'] = artisinal_mines['geom'].apply(wkt.loads)
mines = gpd.GeoDataFrame(artisinal_mines, geometry='geom')
mines.head()

This does not look very different, but we have now created a geodataframe from our existing dataframe. We could do something very similar with a CSV with separate columns of latitude and longitude.

When creating a new geodataframe like this, we should also set the Coordinate Reference System (CRS) of the data, since geopandas does not know where the coordinates actually are on the Earth's surface. Some operations will still work, but relating one geodataframe to another is not possible. We are working with straight decimal degrees of longitude and latitude, so the WGS84 datum is a good option.

In [None]:
mines.crs = "EPSG:4326"

One of the simplest way to see how a geodataframe differs from a standard dataframe is by simply calling the `plot` method.

In [None]:
artisinal_mines.plot()

In [None]:
mines.plot()

As we can see, the geodataframe plots our coordinates, while the standard dataframe plots the numerical values according to their index.

### Exercise 1

The following should be easily possible with a working knowledge of `pandas`. Using the well dataset:
1. Which well is the deepest? (`df.sort_values('column')` may be useful.)
1. How many wells are operated by Canadian Superior?

In [None]:
# The deepest well is:


In [None]:
# The deepest well is:
well_locations.sort_values('Dpth_m').tail(1)

In [None]:
# How many well were operated by Canadian Superior?


In [None]:
# How many well were operated by Canadian Superior?
len(well_locations[well_locations['Owner'] == 'Canadian Superior'])

### Geographic plots

We can take a quick look at where these wells are in relation to each other.

In [None]:
well_locations.plot()

The data that we imported uses latitude and longitude. We can also easily import projected data, if we have it.

In [None]:
fname = '../data/cleaned/offshore_wells_2011_UTM20_NAD83.shp'
well_locations_utm = gpd.read_file(fname)
# We are going to use the 'Spud_Date' and 'Well_Termi' column for some stuff, so we will turn it into a proper datetime column
well_locations_utm['Spud_Date'] = pd.to_datetime(well_locations_utm['Spud_Date'])
well_locations_utm['Well_End'] = pd.to_datetime(well_locations_utm['Well_End'])
well_locations_utm.replace('None', np.NaN, inplace=True)
well_locations_utm.plot()
well_locations_utm.head(5)

Notice that the axes are completely different between the two datasets. We can therefore not plot these two datasets in the same plot unless we use the same coordinate reference system. `cartopy` is the tool we will use to do this.

First, let us see what CRS the different datasets have.

In [None]:
print(f'Wells: {well_locations.crs}\nWells (UTM): {well_locations_utm.crs}')

If we want to plot the two datasets on the same plot, then they need to use the same CRS. One of the easiest ways is by using EPSG codes and the `to_crs` method. [epsg.io](https://epsg.io) and [spatialreference.org](https://spatialreference.org) are good places to find a suitable EPSG code for your data if you are not sure how the CRS relates to it.

In [None]:
well_locations_utm_reproj = well_locations_utm.to_crs(epsg="4326")
ax = well_locations_utm_reproj.plot(markersize=15)
well_locations.plot(ax=ax, color='red', markersize=5, alpha=0.4)

We can see that these datasets now plot on top of each other, as they should.

## Styling

Just plotting these points on their does not tell us very much, so we should style the data to show us what is happening. We will classify the data by total depth of each well, breaking the column into 6 bins with a natural break as the upper and lower bound.

We do this by using the `scheme` parameter which will be used in the background by MapClassify to bin the values of a column. A number of binning options are available, such as NaturalBreaks, Quantiles, StdMean, Percentiles.

In [None]:
well_locations_utm.plot(column='Dpth_m',
                        scheme='Percentiles', k=6,
                        legend=True,
                        markersize=10, cmap='cividis_r', figsize=(10,10))#.legend(bbox_to_anchor=(2,1))

The `scheme` keyword passes through to `mapclassify`, and only makes sense for some data. In other cases, we can just rely on the raw data.

In [None]:
well_locations_utm.plot(column='Well_Type', legend=True,
                        markersize=10, cmap='Set1',
                        figsize=(10,10))#.legend(bbox_to_anchor=(1,1))

We may also be interested in only a section of the data within certain extents, such as the dense cluster south-east of centre. Geopandas offers a `cx` method for a coordinate index which can be used for slicing based on coordinate values.

In [None]:
main_field = well_locations_utm.cx[650000:800000, 4825000:4925000]
print(main_field.shape)
main_field.plot(column='Owner', legend=True, markersize=15, cmap='tab20', figsize=(10,10))

### Exercise 2

The data contains columns for the start and end of when a well was active.

1. Which well was operating for the longest time and how long was this? (Hint: use the `datetime` columns from earlier ('Spud_Date' and 'Well_End'). A useful pattern is `df.loc[df['column'] == value]`.)
2. Plot a histogram of the days of operation for the wells in the dataset. You may need to drop invalid data (where some columns are NaN or NaT).
3. Using the above histogram to determine a suitable cut-off, is there an area of the field that has wells that were in operation for longer than others? (Hint: you might want to extract a useful time interval from a `Series` of `timedelta`s to plot.)

In [None]:
# Which well was operating for the longest time and how long was this?



In [None]:
# Which well was operating for the longest time and how long was this?
well_locations_utm['Operating'] = well_locations_utm['Well_End'] - well_locations_utm['Spud_Date']
well_locations_utm[well_locations_utm['Operating'] == well_locations_utm['Operating'].max()]

In [None]:
# Plot a histogram of the days of operation for the wells in the dataset.


In [None]:
# Plot a histogram of the days of operation for the wells in the dataset.
well_locations_utm['Operating'].dt.days.plot(kind='hist', bins=30)

In [None]:
# Using the above histogram to determine a suitable cut-off, is there an area of the field that has
# wells that were in operation for longer than others?







In [None]:
# Using the above histogram to determine a suitable cut-off, is there an area of the field that has
# wells that were in operation for longer than others?

#well_locations_utm['Operating_Days'] = well_locations_utm[well_locations_utm['Operating'].dt.days.notna() == True]
well_locations_utm['Operating_Days'] = well_locations_utm['Operating'].dt.days
long_wells = well_locations_utm[well_locations_utm['Operating'].dt.days > 150]
base = well_locations_utm.plot(color='grey', figsize=(10,10), markersize=8)
long_wells.plot(column='Operating_Days', scheme='Quantiles',
                cmap='viridis', alpha=0.9,
                ax=base, legend=True)

## Saving geodataframes

While we can create maps and similar things in geopandas, sometimes we want to use files in something else. Geopandas, uses `fiona` in the background to read and write files. If we want this geodataframe as a GeoJSON file, for example, this is easily done by using the correct argument to the `driver` parameter. (Note that GeoJSON only accepts the WGS84 datum, so I am reprojecting the geodataframe first.)

By default, without an explicit driver, `to_file` will create a Shapefile.

In [None]:
fname = '../data/geojson-offshore_wells_Geographic_NAD27.geojson'
well_locations.to_crs(epsg=4326).to_file(fname, driver='GeoJSON')

Changing this to a GML file (a flavour of XML) is as simple as changing the driver parameter appropriately:

In [None]:
fname = '../data/gml-offshore_wells_Geographic_NAD27.gml'
well_locations.to_file(fname, driver='GML')

<hr />
<img src="https://avatars1.githubusercontent.com/u/1692321?v=3&s=200" style="float:center" width="40px" />
<p><center>© 2020 <a href="http://www.agilegeoscience.com/">Agile Geoscience</a> — <a href="https://creativecommons.org/licenses/by/4.0/">CC-BY</a></center></p>