<center><h1>7SSG2059 Geocomputation 2018/19</h1></center>

<h1><center>Practical 8: Making Maps</h1></center>


# Working with GeoPandas and PySAL

We've worked -- and will continue to work -- a lot with pandas, but by default pandas doesn't help us much when want to start working with explicitly geographical data. Ways of working _computationally_ with things like distance and location are increasingly important not only to geographers, but also to data scientists, and this is where we start to move away from purely aspatial statistical analysis into the foundations of a more _geographic_ data science.

There are a huge number of modules (a.k.a. packages, a.k.a. libraries) in Python designed to help you work with geodata, but we are going to focus on the two most important higher-level libraries (since they also provide 'wrappers' around some of the lower-level libraries):

1. [GeoPandas](http://geopandas.org/) -- which offers a pandas-like interface to working with geodata. Think of this as your tool for basic data manipulation and transformation, much like pandas. You will almost certainly want to [bookmark the documentation](http://geopandas.org/data_structures.html#geodataframe).
2. [PySAL](http://pysal.readthedocs.io/en/latest/) -- the Python Spatial Analysis Library  provides the spatial analytic functions that we'll need for everything from classification, clustering and point-pattern analysis to autocorrelation-based tools.
3. There is some overlap between the two libraries: both can do plotting but, for reasons we'll see later, we'll normally do this in PySAL and both can do classification (remember, we did _quantiles_ with pandas!) but (again), for reasons we'll see later, we'll often do this in PySAL from here on out.

PySAL is complicated enough that the best way to understand how it fits together is to use an image:

![PySAL Logo](http://darribas.org/gds_scipy16/content/figs/pysal.png)

We're going to concentrate primarily on the ESDA (Exploratory _Spatial_ Data Analysis) and Weights components of PySAL in this module, but you should know about the other bits!

## A (Semi-Brief) Discourse on Families & Inheritance

GeoPandas objects are deliberately designed to resemble Pandas objects. There are two good reasons for this: 

1. Since Pandas is well-known, this makes it easier to learn how to use GeoPandas.
2. GeoPandas _inherits_ functionality from Pandas. 

The concept of **inheritance** is something we've held off from mentioning until now, but it's definitely worth understanding if you are serious about learning how to code. In effect, geopandas '_imports_' pandas and then _extends_ it so that the more basic class (pandas in this case) learns how to work with geodata... pandas doesn't know how to read shapefiles or make maps, but geopandas does. Same for GeoJSON.

### The 'Tree of Life'

Here's a simple way to think about inheritance: think of the 'evolutionary trees' you might have seen charting the evolution of organisms over time. At the bottom of the tree is the single-celled animal, and at the other end are humans, whales, wildebeest, etc. We all _inherit_ some basic functionality from that original, simple cell. In between us and that primitive, however, are a whole series of branches: different bits of the tree evolved in different directions and developed different 'functionality'. Some of us have bones. Some have  cartilege. Some are vegetarian, and some are carnivorous. And so on. When you get to the primates we all share certain common 'features' (binocular vision, grasping hands, etc.), but we are _still_ more similar to gorillas than we are to macaques. So gorillas and humans _extend_ the primitive 'primate functionality' with some bonus features (bigger brains, greater strength, etc.) that are useful, while macaques extend it with a slightly different set of features (tails, etc.).

![Tree of Life](http://palaeos.com/systematics/tree/images/treeolif.jpg)

### The 'Tree of Classes'

Inheritance in code works in a similar way: *all* Python classes (lists, pandas, plots, etc.) inherit their most basic functionality from a single primitive 'object' class that itself does very little except to provide a template for what an object should look like. As you move along the inheritance tree you will find more and more complex objects with increasingly advanced features: GeoPandas inherits from Pandas, Bokeh and Seaborn inherit from matplotlib, etc. 

I can't find an image of Python base class inheritance, but I've found an equally useful example of how _anything_ can be modelled using this 'family tree' approach... consider the following:

![Vehicle Inheritance](http://www.mkonar.org/dogus/wiki/lib/exe/fetch.php/python/vehicle.png?w=750&tok=4c7ed7)

If we were trying to implement a vehicle registration scheme in Python, we would want to start with the most basic category of all: _vehicle_. The vehicle class itself might not do much, but it gives us a template for _all_ vehicles (e.g. it must be registered, it must have a unique license number, etc.). We then _extend_ the functionality of this 'base class' with three intermediate classes: two-wheeled vehicles, cars, and trucks. These, in turn, lead to eight actual vehicle types. These might have _additional_ functionality: a bus might need have a passenger capacity associated with it, while a convertible might need to be hard- or soft-top. All of this could be expressed in Python as:

```python
class vehicle(object): # Inherit from base class
    def __init__(self):
        ... do something ...

class car(vehicle): # Inherit from vehicle
    def __init__(self):
        ... do other stuff ...

class sedan(car): # Inherit from car
    def __init__(self):
        ... do more stuff ...
```

This way, when we create a new `sedan`, it automatically 'knows' about vehicles and cars, and can make use of functions like `set_unique_id(<identification>)` even if that function is _only_ specified in the base vehicle class! The thing to remember is that programmers are _lazy_: if they can avoid reinventing the wheel, they will. Object-Oriented Programming using inheritance is a good example of _constructive_ laziness: it saves us having to constantly copy and paste code (for registering a new vehicle or reading in a CSV file) from one class to the next since we can just import it and _extend_ it! 

### Advantages of Inheritance \#1

This also means that we are less likely to make mistakes: if we want to update our vehicle registration scheme then we don't need to update lots of functions all over the place, we just update the base class and _all_ inheriting classes automatically gain the update because they are making use of the base class' function. 

So if pandas is updated with a new 'load a zip file' feature then geopandas automatically benefits from it! The _only_ thing that doesn't gain that benefit immediately is our ability to make use of specifically geographical data because pandas doesn't know about that type of data, only 'normal' tabular data.

### Advantages of Inheritance \#2

Inheritance also means that you can always use an instance of a 'more evolved' class in place of one of its ancestors: simplifying things a _bit_, a sedan can automatically do anything that a car can do and, by extension, anything that a vehicle can do. 

Similarly, since geopandas inherits from pandas if you need to use a geopandas object _as if_ it's a pandas object then that will work! So everything you learned last term for pandas can still be used in geopandas. Kind of cool, right?

### Designing for Inheritance

Finally, looking back at our example above: what about unicycles? Or tracked vehicles like a tank? This is where _design_ comes into the picture: when we're planning out a family tree for our work we need to be careful about what goes where. And there isn't always a single right answer: perhaps we should distinguish between pedal-powered and motor-powered (in which case unicycles, bicycles and tricycles all belong in the same family)? Or perhaps we need to distinguish between wheeled and tracked (in which case we're missing a pair of classes [wheeled, tracked] between 'vehicle' and 'two-wheel, car, truck')? These choices are tremendously important but often very hard to get right.

OK, that's enough programming theory, let's see this in action...

## Required Preamble

It makes life a lot easier if you make all of the library import commands and configuration information (here having to do with `matplotlib`) the first exectuable code in a notebook or script. That way it's easy to see what you need to have installed before you get started!

In [None]:
import matplotlib as mpl
mpl.use('TkAgg')
%matplotlib inline
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.cm as cm

In [None]:
import warnings 
warnings.simplefilter('ignore')

The two new packages we're using this week are PySAL and GeoPandas:

In [None]:
import pysal as ps
import geopandas as gpd

## Creating a GeoPandas DataFrame

There are two primary ways in which we can create a GeoPandas DataFrame:
1. Use an existing Pandas DataFrame that contains a `geometry` Series
2. Load a [shapefile](https://en.wikipedia.org/wiki/Shapefile) (containing geometry) and `merge` with a Pandas DataFrame

As the LSOA data we have been using already has a `geometry` series (column), we'll use approach 1. first, then look at 2. later.

So, first let's read our LSOA Data as a Pandas DataFrame as usual: 

In [None]:
pdf = pd.read_csv(
    'https://github.com/kingsgeocomp/geocomputation/blob/master/data/LSOA%20Data.csv.gz?raw=true',
    compression='gzip', low_memory=False) # The 'low memory' option means pandas doesn't guess data types

print(pdf.columns)

Notice in our Pandas DataFrame that one of our columns (Series) is named _geometry_. This contains information on the shape of each LSOA; let's take a look at it:

In [None]:
print(pdf.geometry.head(10))

So each element of this Series has text indicating the type of shape the geometry applies to (e.g. _POLYGON_) followed by a bunch of numbers. These numbers are truncated in the view above, so let's look in a little more detail at a single entry in the _geometry_ column:    

In [None]:
print(pdf.geometry.iloc[0])

Now we see that what we have is pairs of numbers, each separated by a comma. Each pair of values is a set of corodinates for each the [vertex](https://en.wikipedia.org/wiki/Vertex_(geometry)) of the [polygon](https://en.wikipedia.org/wiki/Polygon). GeoPandas can use this information about the type of shape and the co-ordinates to plot the data as a map (and perform other spatial analysis functions).

So now we can create a GeoPandas DataFrame directly from our Pandas DataFrame:

In [None]:
gdf = gpd.GeoDataFrame(pdf)

Next, we need to tell GeoPandas which Series contains the spatial information: 

In [None]:
gdf = gdf.set_geometry('geometry')

Did you get an error? Can you work out why? [This thread](https://gis.stackexchange.com/questions/267801/csv-to-geodataframe-how-to-have-valid-geometry-objects) might help.

The problem here is that _geometry_ column is a `string` type, but it needs to be a `shapely.geometry` type (details [here](http://toblerity.org/shapely/manual.html) if you want to find out more). To make this conversion we used the following code (you don't need to understand how this works for now, just remember that you'll need to do it again in future):

In [None]:
from shapely.wkt import loads
gdf['geometry'] = gdf['geometry'].apply(lambda x: loads(x))

We can check what has changed by comparing the `type` of the _geometry_ columns in `gdf` vs `pdf`: 

In [None]:
print('\tGeometry type: ' + str(type(gdf.geometry)))
print('\n')
print('\tGeometry type: ' + str(type(pdf.geometry)))

Because the _geometry_ column is now of types `GeoSeries` we should now be able to set the geometry of the GeoPandas DataFrame without problems:

In [None]:
gdf = gdf.set_geometry('geometry')

Hopefully there was no error!

### GeoPandas Inheritance

We'll make our first map soon, but let's check we understand how GeoPandas _inherits_ functionality from Pandas. First, let's check what data type `gdf` is using the [`isinstance`](https://www.w3schools.com/python/ref_func_isinstance.asp) function:

In [None]:
if isinstance(gdf, gpd.GeoDataFrame): # Is gdf a GeoDataFrame object?
    print("\tI'm a geopandas data frame!")

if isinstance(gdf, pd.DataFrame): # Is gdf *also* a DataFrame object?
    print("\tI'm also a pandas data frame!")

So this is interesting: `gdf` is _both_ a Pandas DataFrame _and_ a GeoPandas DataFrame. This means we can use both Pandas and GeoPandas methods on `gdf`. Here are some Pandas methods:

In [None]:
print("What is the lsoa11nm column type: ")
print('\tNAME type: ' + str(type(gdf.LSOA11NM)))

In [None]:
print(gdf.LSOA11NM.describe())

In [None]:
print(gdf.sample(3))

And now to show that GeoPandas methods will work only on our GeoPandas DataFrame and _not_ our original Pandas DataFrame, let's try making a map!

With GeoPandas we do this simply with `plot()`:

In [None]:
gdf.plot()

Your first map! Nice. 

If we tried the same with our **Pandas** DataFrame:

In [None]:
pdf.plot()

Not quite the same...

## More maps!

Have a look at that first map you made above. What do you think it is showing? 

Not much really. Every LSOA polygon is coloured the same shade of blue (with a white outline; making larger LSOAS look 'more blue'). We could make something much more interesting by telling GeoPandas to colour the polygons according to some value, like one of our variables in the data: 

In [None]:
gdf.plot(column='HHOLDS')

Cool! But what do the colours mean? We can add a legend quite simply:

In [None]:
gdf.plot(column='HHOLDS', legend=True)

Adding things like a plot title or legend labels isn't quite so easy. We'll take a look at how to do this below, but if you take this simple approach for plots in your reports you will need to make sure you are very clear in your figure captions what the figure shows (e.g. including what the units for the numbers are).   

We can also change the shade of colours used using the `cmap` (colormap) option: 

In [None]:
gdf.plot(column='HHOLDS', legend=True, cmap='OrRd')

For a full list of colormaps possible, see the [matplotlib website](http://matplotlib.org/users/colormaps.html). There are [numerous resources online](https://tilemill-project.github.io/tilemill/docs/guides/tips-for-color/) to help you think about the most appropriate map colouring for the data you want to visualise. The [colorbrewer](http://colorbrewer2.org) tool is particularly useful and [this overview](https://www.e-education.psu.edu/geog486/node/1867) is also helpful. The viridis colormap is [used by default](http://bids.github.io/colormap/) and one of the best all-round colour schemes for [numerous](https://stats.stackexchange.com/a/223324) [reasons](https://www.youtube.com/watch?v=xAoljeRJ3lU). 

The four main matplotlib colormaps are:

[![colors](https://matplotlib.org/users/plotting/colormaps/lightness_00.png)](https://matplotlib.org/users/colormaps.html)

But there are many others:

[![colors](http://www.scipy-lectures.org/_images/plot_colormaps_1.png)](https://matplotlib.org/users/colormaps.html)

Alternative to the continuous colouring approach used above, we can also classify the way colour maps are scaled using the `scheme` option. For example:

In [None]:
gdf.plot(column='HHOLDS', legend=True, scheme='quantiles', k=3)

The `scheme` option uses functionality from the [PySAL](https://pysal.readthedocs.io/en/latest/library/esda/mapclassify.html) package and can be one of the following three types:
1. `equal_interval`
2. `quantiles`
3. `fisher_jenks`

In each case, we can provide an additional parameter `k` to the plot method to specify how many classes should be created. For example, above we used `k=3`.

You'll note that the legend is overlapping the map, which is not ideal, but we'll deal with that later. 

### Exercises

1. Create an equal interval map with 6 classes for the _SocialRented_ variable using the _magma_ colormap. Do not include a legend.


2. Create a map of the _Owned_ variable, classified into quintiles. Use the default colormap but include a legend.

## Revisiting Classification

Classification matters. So before we continue with learning about the various mapping options we need to think more about this. 

I realise that this topic can seem a little... _dry_, but perhaps this video will give you some sense of how the choices you make about representing your data can significantly alter your understanding of the data.

[![Do maps lie?](http://img.youtube.com/vi/G0_MBrJnRq0/0.jpg)](http://www.youtube.com/watch?v=G0_MBrJnRq0)

As we saw above, PySAL has a classification 'engine' that we can use to bin data based on attribute values. Although it is possible _both_ to do some classification _without_ PySAL and to do some mapping _without_ GeoPands, the combination of the two is simplest: PySAL has additional classification methods that are _specific_ to geographic analysis problems, and GeoPandas is just plain easier to work with. 

Let's see how (spatial) classification in PySAL is equivalent to the (a-spatial) classification we've done previously in Pandas by calculating and mapping quintiles. 

First, calculate the quintiles using the Pandas `quantiles` method:

In [None]:
Quintiles = pdf.Owned.quantile([0,0.2,0.4,0.6,0.8,1])
print(Quintiles)

Remember that quntiles split the data into five parts containing roughly equal proportions of the data. 

So from the output above you should be able to see, for example, that in the lowest 20% of LSOAs fewer than `199` households are owned by their occupier.

We could also [calculate these values using PySAL](https://pysal.readthedocs.io/en/latest/library/esda/mapclassify.html#pysal.esda.mapclassify.Quantiles) (as `GeoPandas` does; `PySAL` expects `numpy` arrays instead of a Series, but fortunately it's easy to convert one to the other since `Pandas`/`GeoPandas` is built on top of `numpy`...)

In [None]:
OwnedA = np.array(pdf.Owned) # Convert the DataSeries to an array
Ownedq5 = ps.esda.mapclassify.Quantiles(OwnedA, k=5) # Classify into 5 quantiles
print(Ownedq5) # Print summary metrics

There's more information here and we can see how many LSOAs have fallen into each class - it's roughly the same number in each class, which is good because that's the point of quantiles. 

To visualise this distribution numerically, let's plot a histogram of these quantiles using a Seaborn `displot` with the `Pandas` DataFrame:

In [None]:
sns.distplot(pdf.Owned, bins=Quintiles, norm_hist=True, kde=False)

In this histogram we forced the y-axis to be the 'frequency density' so that we can see the relative frequency of each of the quintile classes (known as 'bins' in a histogram and shown as the height of the bars). For example, we can see that the bin with highest frequency is the third quintile that contains all the LSOAs with values of 'Owned' between `291` and `366` (where did I get those values from)? 

You should also be able to see where the first bin stops at around `199` (because remember we saw above that the first qunatile is all values up to `199`). 

What if we had plotted the absolute frequency (i.e. the count) of LSOAs in each bin on the y-axis instead of the frequency density: 

In [None]:
sns.distplot(pdf.Owned, bins=Quintiles, kde=False)

Think about why all the bars are roughly the same height... it's because by definition each quintile should contain roughly the same number of LSAOs (i.e. 20% in each bin). 

It's difficult to see where one bar stops and another begins, so let's add some lines to help:

In [None]:
sns.distplot(pdf.Owned, bins=Quintiles, kde=False)

for b in Quintiles:
    plt.vlines(b, 0, 1000, color='red', linestyle='--') 

Okay so, we've now throughly examined the _numerical_ distribution of data for the _Owned_ variable, but **where** are the LSOAs in each of those five bins? What is the **spatial** distribution? Where are the LSOAs in the lowest quintile (i.e. LSOAs with _Owned_ values of 5 to 199) vs those in the highest quintile (i.e. LSOAs with _Owned_ values of 452 to 710)?

Well, if you completed Exercise 2 above, you've already answered these questions using the following code:

In [None]:
gdf.plot(column='Owned', scheme='quantiles', k=5, legend=True)

Remember that `GeoPandas` is using the `PySAL` functionality to do its classification. 

See how the values in the legend match those in the quintiles we calculated above! And so now we can talk about how the numeric distribution we saw in the histograms above has fallen out spatially. 

### Task

a) Use prose in the box below to describe the spatial distribution of 'Owned' households across London using the quantiles maps 

Your answer here:


But is this the best way to create our classes for visualisation?

One of the best-known geographical classification is Fisher-Jenks (also sometimes known as 'Natural Breaks'), which groups data into bins based on the sum of squared deviations between classes: in other words, the algorithm iteratively looks for ways to group the data into a specified number of bins such that moving a data point from one group to another would increase the total within-class deviation observed in the data.

Let's try the same as above but using (five) Fisher Jenks breaks. Pandas cannot do this so we can only [use PySAL here](https://pysal.readthedocs.io/en/latest/library/esda/mapclassify.html#pysal.esda.mapclassify.Fisher_Jenks) (_**this may take some time depending on your laptop but probably at least one minute**_):

In [None]:
#OwnedA = np.array(pdf.Owned) #we don't need to create this again
OwnedFJ = ps.esda.mapclassify.Fisher_Jenks(OwnedA, k=5) # Classify into Natural Breaks
print(OwnedFJ) # Print summary metrics

Hopefully you can see how the five classes created using this method are different from the quantiles methods we used above. Note how they are unequal in size, for example.

Let's create a histogram to visualise the numeric distribution of the Fisher Jenks classification. First we need to get the values to create the bins in the histogram. Looking at the example in [the documentation for the PySAL `mapclassify.Fisher_Jenks` function](https://pysal.readthedocs.io/en/latest/library/esda/mapclassify.html#pysal.esda.mapclassify.Fisher_Jenks) we can see that we can access the bins direct from the object

In [None]:
print(OwnedFJ.bins)

But note that these are the upper limits of the bin, so we need to add the very lowest value. [We can do this](https://stackoverflow.com/a/36998277/10219907) using the `numpy` `insert` function:

In [None]:
print(type(OwnedFJ.bins))  #just to prove this is a numpy array
OwnedFJbins = np.insert(OwnedFJ.bins, 0, pdf.Owned.min())  #insert the minimum value into the zeroth position of the array 
print(OwnedFJbins)   #print to check what it looks like

In [None]:
sns.distplot(pdf.Owned, bins=OwnedFJbins, norm_hist=True, kde=False)

Now let's create a map using the Fisher Jenks classification (remember the map will take a while to plot because GeoPandas is using the `fisher_jenks` function from PySAL):

In [None]:
gdf.plot(column='Owned', scheme='fisher_jenks', k=5)

Compare the map you just made to the one we made using quintiles. Do you think their representation of the spatial distribution is very different? 

There are differences, but I'd say they look pretty similar... this is also reflected in the fact that the shape of their histograms is also very similar.  

Let's see if a map created using an `equal_interval` classification looks any different. We'll run the code all in one cell to output the classification, histogram and map:

In [None]:
OwnedEQ = ps.esda.mapclassify.Equal_Interval(OwnedA, k=5) #classfy into 5 equal area
print(OwnedEQ) # Print summary metrics

OwnedEQbins = np.insert(OwnedEQ.bins, 0, pdf.Owned.min())  #insert the minimum value into the zeroth position of the array 
sns.distplot(pdf.Owned, bins=OwnedEQbins, kde=False)  #plot histogram

gdf.plot(column='Owned', scheme='equal_interval', k=5) #plot map

This visualisation of the spatial distribution of _Owned_ households using the Equal Interval classification looks a little more different (compared to the using the Fisher Jenks or quantiles classifications). This is mainly because there are fewer LSOAs classified in the highest (and lowest) classes, meaning that the extreme values are more obvious on the map. You should also be able to see the difference between the shape of the histogram for a Equal Interval classification versus the other classifications.  

### Tasks

For the next two tasks, you have already produced all the information you need (so no more code needed). Hopefully, answering these tasks will reinforce your understanding about how the three classifications we have tried differ.

b) In the next cell, report the number of LSOAs in the highest class for each of the three classifications:

Your answer here:
    

c) In the next cell, report the interval of values that composes the highest class for each of the three classifications:

Your answer here:
    

### Exercises

3. Create three more maps for the _Owned_ variable using the three classifications, but this time use **seven** classes in each classification.  

4. Create three maps for _GreenspaceArea_ using each of the three classifications (use five groups for each classification). For each of these three maps, also output histograms. 

Why are the differences between the three maps for _GreenspaceArea_ more obvious than for the _Owned_ variable we looked at above? 

## Map Modifications

GeoPandas uses much of its plotting functionality from matplotlib, and we can use this to modify and hopefully improve how well our maps communicate. Seaborn also uses matplotlib so hopefully some of the code patterns below will look familiar. 

First, in the classified choropleth maps above when we included a legend we found that it overlapped with the actual map. Legends are still pretty rudimentary in GeoPandas and moving them is not straight-forward (if you can find an easy way, please share!). So currently the easiest way to ensure the legend does not overlap the map itself is to make the size of our plot larger using the `figsize` option when creating a subplot:

In [None]:
fig, ax1 = plt.subplots(1, figsize=(15, 12))   # create figure and axes for Matplotlib 
gdf.plot(column='Owned', scheme='quantiles', k=5, legend=True, ax=ax1)  #include ax argument!
plt.show()

Alternatively, we could keep the plot smaller but change the axis limits of the plot so that there is more white space:

In [None]:
fig, ax1 = plt.subplots(1, figsize=(8, 6))   # create figure and axes for Matplotlib
gdf.plot(column='Owned', scheme='quantiles', k=5, legend=True, ax=ax1, ) #include ax argument!
ax1.set_xlim(500000,586000)   #play with these values
plt.show()

In the code above, I just used trial-and-error until I found the right values for `set_xlim` given the size of the plot I wanted. These workarounds aren't the most elegant or efficient but they work for making single maps. 

Using more functionality from matplotlib we can make our map look much cleaner: 

In [None]:
fig, ax1 = plt.subplots(1, figsize=(15, 12))   
gdf.plot(column='Owned', scheme='quantiles', k=5, legend=True, ax=ax1, edgecolor='grey', linewidth=0.2)  #change line style
ax1.axis('off')  #don't plot the axes (bounding box)
ax1.set_title('Ownership', fontdict={'fontsize': '20', 'fontweight' : '3'})  #provide a title
ax1.annotate('Source: London Datastore (2011)',xy=(0.1, 0.1),  xycoords='figure fraction', horizontalalignment='left', verticalalignment='top', fontsize=12, color='#555555')  #add source info on the image itself
ax1.get_legend().set_title("Households")  #set the legend title
plt.show()

In the code above check you can see where we:
1. change the colour and width of LSOA outlines so they are easier to see
2. set the figure title to communicate the variable being mapped
3. set the legend title to communicate the units of the data being mapped
4. add an annotation show the source of the data for the map.

Here's another trick to modify the legend labels:

In [None]:
fig, ax1 = plt.subplots(1, figsize=(15, 12))   
ax1 = gdf.plot(column='Owned', scheme='quantiles', k=5, legend=True, ax=ax1, edgecolor='grey', linewidth=0.2)  #change line style
ax1.axis('off')  #don't plot the axes (bounding box)
ax1.set_title('Ownership', fontdict={'fontsize': '20', 'fontweight' : '3'})  #provide a title
ax1.annotate('Source: London Datastore (2011)',xy=(0.1, 0.1),  xycoords='figure fraction', horizontalalignment='left', verticalalignment='top', fontsize=12, color='#555555')  #add source info on the image itself

#edit legend labels
leg = ax1.get_legend()
leg.get_texts()[0].set_text('5 - 199')
leg.get_texts()[1].set_text('199 - 291')
leg.get_texts()[2].set_text('291 - 366')
leg.get_texts()[3].set_text('366 - 452')
leg.get_texts()[4].set_text('452 - 710')
leg.set_title("Households")

plt.show()

Above we saw how we can change the axis limits to zoom out to make room for the legend. But we can also use this to zoom in on our map to a specific location we want to show in more detail:

In [None]:
fig, ax1 = plt.subplots(1, figsize=(8, 6))   
gdf.plot(column='Owned', scheme='quantiles', k=5, legend=True, ax=ax1, edgecolor='grey', linewidth=0.2) 
ax1.set_xlim(535000,550000)   #play with these values
ax1.set_ylim(177000,190000)   #play with these values
plt.show()

## Saving maps!

There are a few  ways you can get your maps out of a notebook and into a Word document (e.g. to use in your reports). One might be to right-click (or Cmd-click on Mac), select 'copy image' and then paste directly into Words. However, this doesn;t always work very well. So better is to save your map to file and then insert. 

You could save your map to file by again right-click (or Cmd-click on Mac) and select 'Save Image As'. Then you could insert the saved file into your Word document (by going to Insert -> Pictures). 

However, probably the best way to save a map to file is to use some code (see final line of code below:

In [None]:
fig, ax1 = plt.subplots(1, figsize=(8, 6))   
gdf.plot(column='Owned', scheme='quantiles', k=5, legend=True, ax=ax1, edgecolor='grey', linewidth=0.2) 
ax1.set_xlim(535000,550000)   
ax1.set_ylim(177000,190000)   

#and here is the key line of code
fig.savefig('map_export.png', dpi=300)

Go and check your working directory (usually where you have saved this notebook) to check the image was created!

You can [read more about the `savefig` method](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html) to see more options available, but `dpi` is the argument that will change the size of your output image. Remember, files will be saved to whatever your current working directory is, but you can include a path if you want to save elsewhere. 

### Map Layers 

So far we have looked at only a single variable on a map, but it is possible to present data for multiple variables by layering them on top of one another. To examine how to do this we will download a shapefile containing data on the location of parks and greenspace areas in London. 

Here's a function to do the download the data, write it to a directory (creating it if it does not exist) and unzipping the data:

In [None]:
import requests
import zipfile
import re
import os

#function to download data and extract to a directory
def download_gsa_geodata(src_dir, dst_dir, ddir):

    if not os.path.exists(dst_dir):                          
        if not os.path.exists(os.path.dirname(dst_dir)):     
            os.makedirs(os.path.dirname(dst_dir))
        r = requests.get(src_dir)
        with open(dst_dir,'wb') as f:
            f.write(r.content)

    if not os.path.exists(ddir):
        os.makedirs(os.path.dirname(ddir))
    
    zp = zipfile.ZipFile(dst_dir, 'r')
    zp.extractall(ddir)
    zp.close()    

    print("Done.")

See how this function allows us to download data to a particular directory:

In [None]:
src = 'https://github.com/kingsgeocomp/geocomputation/blob/master/data/Greenspace.zip?raw=true'
dst = 'shapes/Greenspace.zip'
zpd = 'shapes/'
download_gsa_geodata(src, dst, zpd)

If this worked you should now have a folder named _shapes_ in your working direcotry (i.e. wherever you have saved this notebook) and within that another folder named _Greenspace_ containing some shapefile data. 

So now we should be able to load this greenspace shapefile as a GeoPandas DataDrame:

In [None]:
shp_path = os.path.join('shapes','Greenspace','GLA Greenspace.shp')  #this uses the os package to ensure paths work on any operating system
print("Loading data from: " + shp_path)
green_df = gpd.read_file(shp_path)
green_df.head(3)

If the data read successfully, you should be able to see the top three lines of the data, with a _geometry_ column on the far right. 

Let's see if we can plot this:

In [None]:
green_df.plot()

Hopefully you got something that looked a bit like a map (well, maybe just a bunch of blue blobs...). If so, then we have loaded our data properly as a GeoPandas DataFrame. 

So, let's try to plot our greenspace (parks) data on top of the Owned data (we'll also zoom in so we can see the result in a little more detail):

In [None]:
fig, ax1 = plt.subplots(1, figsize=(12, 10))  #setup the figure

base = gdf.plot(ax=ax1, column='Owned', scheme='quantiles', k=5, cmap='OrRd')       #plot the Owned data layer on ax1
parks = green_df.plot(ax=ax1, color=(0.1, 0.5, 0.2), alpha=0.75, edgecolor='green') #plot the parks data later on ax1

ax1.set_xlim(535000,550000)   #zoom
ax1.set_ylim(177000,190000)   #zoom
plt.show()

The key here is that both plots use `ax=ax1` ensuring that they are plotted on the same set of axes. The first plot is on the bottom, then additional plots are layered on top. 

Try plotting but removing `ax=ax1` from one of the plot calls to see what happens when we don't plot on the same axes: 

In [None]:
fig, ax1 = plt.subplots(1, figsize=(12, 10))  #setup the figure

base = gdf.plot(ax=ax1, column='Owned', scheme='quantiles', k=5, cmap='OrRd')       #plot the Owned data layer on ax1
parks = green_df.plot(color=(0.1, 0.5, 0.2), alpha=0.75, edgecolor='green')         #plot the parks data layer but NOT on ax1

ax1.set_xlim(535000,550000)   #zoom
ax1.set_ylim(177000,190000)   #zoom
plt.show()

## Merging Spatial and Non-Spatial Data

Finally, let's consider the situation where we don't have an existing (single) dataset that contains geometry, but rather we have two files:
1. a shapefile containing geometry for spatial elements
2. a datafile containing data for spatial elements but no geometry

In this case, as long as we have a common identifier for each row in the two files, we can use the Pandas `merge` function to join them together. 

For example, let's assume the data we have are:
1. A shapefile delineating LSOAS in and around London 
2. A csv file containing air quality data for each of our London LSOAs

First, let's download the shapefile data (using the `download_gsa_geodata` function we defined above):

In [None]:
# re-use the function from above but with new variables
src = 'https://github.com/kingsgeocomp/geocomputation/blob/master/data/LSOAs.zip?raw=true'
dst = 'shapes/LSOAs.zip'
zpd = 'shapes/'
download_gsa_geodata(src, dst, zpd)

And now read the downloaded data as a GeoPandas DataFrame

In [None]:
#next line uses the os package to ensure paths work on any operating system
shp_path = os.path.join('shapes','lsoas','Lower_Layer_Super_Output_Areas_December_2011_Generalised_Clipped__Boundaries_in_England_and_Wales.shp')

print("Loading data from: " + shp_path)
lsoa_gdf = gpd.read_file(shp_path)

Let's write the first few lines of the shapefile and plot it to see what it looks like:

In [None]:
print(lsoa_gdf.head(3))
lsoa_gdf.plot()

Looks okay, but note the map - it looks like there might be more LSOAs than in our previous maps? Let's check the dimensions of this DataFrame:

In [None]:
print(lsoa_gdf.shape)

Hopefully you can see that there are more rows in this DataFrame than the one we have been using before. 

Next, let's load the csv data file as a Pandas DataFrame and look at the top few lines:  

In [None]:
AQ_pdf = pd.read_csv(
    'https://github.com/kingsgeocomp/geocomputation/blob/master/data/LSOA_AirQuality.csv.gz?raw=true',
    compression='gzip', low_memory=False) # The 'low memory' option means pandas doesn't guess data types

print(AQ_pdf.head(3))

And check the dimensions of this DataFrame:

In [None]:
print(AQ_pdf.shape)

There's certainly more rows in one DataFrame than the other. 

But look at the column names of our two data files; can you see a common column between them?

The _LSOA11CD_ and _lsoa11cd_ columns contain the same codes so we can use these to merge the data into a single DataFrame. We'll do an _inner join_ (see week 7) with the Air Quality data on the left; this is because there are more rows in the shapefile than the Air Quality data, so we only want to retain rows from the shapefile where we have air quality data: 

In [None]:
LSOA_AQ = pd.merge(AQ_pdf, lsoa_gdf, how="inner", left_on='LSOA11CD', right_on='lsoa11cd')

Let's check what the top of our new DataFrame looks like:

In [None]:
print(LSOA_AQ.head())

And let's check the dimensions of our new DataFrame:

In [None]:
print(LSOA_AQ.shape)

So we have the same number of rows as the AQ_pdf but the number of columns is the total of the two original DataFrames (check you can from the outputs above how we know this). 

And finally, let's check what _type_ of DataFrame we've created:

In [None]:
print(type(LSOA_AQ))

We've created a Pandas DataFrame. So we need to convert to a GeoPandas DataFrame if we want to make some maps; to do this we need to specify what column contains the geometry:

In [None]:
#in this case we don't need the following lines, we can just set the geometry from the raw data
#from shapely.wkt import loads
#LSOA_AQ['geometry'] = LSOA_AQ['geometry'].apply(lambda x: loads(x))

LSOA_AQ = LSOA_AQ.set_geometry('geometry')
print(type(LSOA_AQ))

And now let's try to plot it spatially:

In [None]:
LSOA_AQ.plot()

Hopefully that was successful! Does the this map have the same shape as our previous maps?

And now we have created this new GeoPandas DataFrame we can start to examine the Air Quality data spatially. For example: 

In [None]:
fig, ax1 = plt.subplots(1, figsize=(15, 12))   
ax1 = LSOA_AQ.plot(column='NOxmean', 
                   scheme='quantiles', 
                   k=7, 
                   legend=True, 
                   ax=ax1, 
                   edgecolor='grey',    
                   linewidth=0.2)       

### Final Exercises

5. Create a map of Ownership using five equal intervals, but only for LSOAS with total number of households greater than 800 (you'll need to do some data selection for this). Map these LSOAs on a grey background showing the area of London (you'll need to do some layering for this). The final map should look something like that below. 

![ownership](https://kingsgeocomputation.files.wordpress.com/2018/11/ownership800.png)

6. Create a map of _Mean PM10_ for central London with roads layered on top. Classify the air quality data into five quantiles and zoom in towards central London. A roads data shapefile is available at: https://github.com/kingsgeocomp/geocomputation/blob/master/data/Roads.zip?raw=true (use the `download_gsa_geodata` function above to download and extract the data). Your final map should look something like the map below.

_How well does the (incomplete) road map align with air quality would you say?_

In [None]:
#code here


Comment here:


![meanPM10](https://kingsgeocomputation.files.wordpress.com/2018/11/pm10central.png)

7. Create two maps of GreenspaceArea and two maps of log-transformed GreenspaceArea (you will need to create this transformed data - see week 6). For each variable create maps using quantiles and equal interval classifications (both with 5 classes). If you can, plot these four maps in a single figure composed of four sub-plots with appropriate sub-titles. The final map should look something like that below. 

_Looking at the four maps answer the following questions_:
- Why do the two quantile maps looks the same visually?
- Why do the two equal interval maps look different? 

Comment here:


In [None]:
#code here


![gsaclass](https://kingsgeocomputation.files.wordpress.com/2018/11/gsaclassmaps.png)

## Borough-level Mapping

Finally, if you need a borough-level shapefile to make borough maps in your final report, one is available here: https://github.com/kingsgeocomp/geocomputation/blob/master/data/Boroughs.zip

In [None]:
src = 'https://github.com/kingsgeocomp/geocomputation/blob/master/data/Boroughs.zip?raw=true'
dst = 'shapes/Boroughs.zip'
zpd = 'shapes/'
download_gsa_geodata(src, dst, zpd) #re-use function from above

b_path = os.path.join('shapes','Boroughs','London_Borough_Excluding_MHW.shp')
b_gdf = gpd.read_file(b_path)

print(b_gdf.head())
b_gdf.plot(column="GSS_CODE")

#### More Fun!

You can find some more nice examples and applications from [GeoHackWeek](https://geohackweek.github.io/vector/04-geopandas-intro/)!

### Getting More Help/Applications

A great resource for more help and more examples is Dani Arribas-Bel's _Geographic Data Science_ module: he has put all of his [module practicals online](http://darribas.org/gds17/) (as we have too), and you might find that something that he does makes more sense to you than what we've done... check it out!

## Credits!

#### Contributors:
The following individuals have contributed to these teaching materials: Jon Reades (jonathan.reades@kcl.ac.uk), James Millington (james.millington@kcl.ac.uk)

#### License
These teaching materials are licensed under a mix of [The MIT License](https://opensource.org/licenses/mit-license.php) and the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0/).

#### Acknowledgements:
Supported by the [Royal Geographical Society](https://www.rgs.org/HomePage.htm) (with the Institute of British Geographers) with a Ray Y Gildea Jr Award.

#### Potential Dependencies:
This notebook may depend on the following libraries: pandas, matplotlib, seaborn