# Python for Spatial Analysis
## Second part of the module of GG3209 Spatial Analysis with GIS.
### Notebook to learn and practice Data Visualisation Libraries.

---
Dr Fernando Benitez -  University of St Andrews - School of Geography and Sustainable Development - First Iteration 2023 v.1.0 

### Introduction 

After practicing **Pandas**, this notebook aims to work with different libraries for graphing and visualizing data using **Python**. Specifically, we will investigate **matplotlib**. We will further explore **Seaborn** and **Pandas**, which make use of matplotlib library.

### Content

* Prepare data for use in a graph.
* Make basic graphs using matplotlib, Seaborn, and Pandas.
* Plot images using matplotlib.
* Understand the differences between and how to use figures and axes or subplots.
* Refine graphs using matplotlib.
* Save graphics to vector or raster graphic files.





## matplotlib


As the name describes, **matplotlib** is based on the graphing functions available in MatLab. 

It allows for the generation of a wide variety of graph types and data visualizations. Further, graphs can be edited, customized, and saved using Python code. 

Here is the link for the [documentation](https://matplotlib.org/stable/index.html) in case you need more information.

This library has been already included in our **py4sa.yml** environment file, so you have already installed it into your MiniConda environment. 

To make graphs, you will work with the **pyplot module** specifically, so it is common to call in that specific module and assign it an alias name. To have the graphs plot in a Jupyter Notebook, you will need to include "%maplotlib inline" in your code. You can also change default parameters. 

## Datasets

In order to provide examples of a wide variety of graph types, We will use the datasets avaliable in the data folder, if you need details about the source of the data, take a look at the **metadata.md** integrated in that folder.



In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # alias is plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 8] #Change the default plot size.

In [None]:
# The Scottish Index of Multiple Deprivation (SIMD) 2020 
simd = pd.read_csv("data/SG_SIMD_2020.csv", header=0)
simd.columns =[column.replace(" ", "_") for column in simd.columns]  # Removing any spaces in the column names using list comprehension.
simd.tail(5)

In [None]:
simd.columns

In [None]:
# House sales prices in Scotland 
houses_prices = pd.read_csv("data/house-sales-prices.csv", header=0)
houses_prices.columns =[column.replace(" ", "_") for column in houses_prices.columns] # Removing any spaces in the column names using list comprehension.
houses_prices.head(5)

In [None]:
# Runoff data per month
runoff = pd.read_csv("data/runoff_data_by_month.csv", header=0)
runoff.columns =[column.replace(" ", "_") for column in runoff.columns] # Removing any spaces in the column names using list comprehension.
runoff.head(10)

In [None]:
# Latest Earthquakes in Feb 2023
earthquakes = pd.read_csv("data/Latest_earthquake_world.csv", header=0)
earthquakes.columns =[column.replace(" ", "_") for column in earthquakes.columns] # Removing any spaces in the column names using list comprehension.
earthquakes.head(5)

## Basic Graphs


Before you can create complex and well-refined graphs, we need to know the basics. Here we can see how to generate simple graphs using the basic matplotlib syntax.

In the first example below I have produced a basic scatter plot to visualize the relationship between mean elevation and mean annual temperature at the county-level in the high plains states as an example of a bivariate graph. The first argument, elevation, is plotted to the x-axis and the second argument, temperature, is plotted to the y-axis. Here, I used dot notation; however, bracket notation is also acceptable. The graph is saved to a variable then the show() method from the pyplot module is used to plot the graph. It isn't necessary to provide the graph name as an argument as the last created graph will be plotted by default.

Although this graph is adequate to simply visualize the data and the relationship between the two variables, it is still a bit rough to include in a presentation or report. We will cover refining the output later in this module.

In [None]:
sp1 = plt.scatter(earthquakes.mag, earthquakes.depth)
#plt.scatter(hp["elev"], hp["temp"])
plt.show(sp1)

Histograms are also a useful chart we can create and help us to do some exploratory analysis within the data we have loaded. In the following example you are creating an Histogram of the magnitude attribute of the earthquakes dataset. Since this is a histogram, which is a univariate graph, only one variable has to be provided.

In [None]:
hist1 = plt.hist(earthquakes['mag'])
plt.show(hist1)

Histograms accept an additional, optional bins parameter to specify the number of data bins to include.

In the following example we can learn about the subplots or axes in matplotlib. 

A figure represents the entire space in which graphs are generated. Figures are further divided into **subplots or axes**, which allow you to place multiple plots in the same graph space.

There are multiple ways to implement this. In the following example, I am creating a figure called fig1 that contains two rows and two columns. The positions of the subplots within the figure are defined relative to the rows and columns defined using indexes relative to axs, or the subplots.

So, in this example, you have placed a separate graph, representing a different bin width, in each of the four available row/column combinations.

If you do not define multiple subplots, a plot will take up the entire graph or figure space by default.

In [None]:
fig1, axs = plt.subplots(2,2)
axs[0,0].hist(earthquakes.mag, bins=5)
axs[1,0].hist(earthquakes.mag, bins=10)
axs[0, 1].hist(earthquakes.mag, bins=15)
axs[1, 1].hist(earthquakes.mag, bins=20)
plt.show(fig1)

The example below shows how to generate a **boxplot** to visualize the distribution of a continuous variable, in this case house prices in 2015.

In [None]:
bp1 = plt.boxplot(x=houses_prices['2015'])
plt.show(bp1)

Now, let's combine a set of different graphs as a single figure using multiple **subplots**. 

Make sure you understand how indexes are used to reference each subplot within the figure. The plot here are just an example, so you can play with them and bring another variables, or maybe try out to increase the number of plots in the figure. 

In [None]:
plt.rcParams['figure.figsize'] = [20, 12] #Change the default plot size.
fig1, axs = plt.subplots(3,3)
axs[0,0].boxplot(simd.Quintilev2)
axs[0,1].boxplot(simd.EmpNumDep)
axs[0,2].boxplot(simd.IncNumDep)
axs[1,0].boxplot(simd.EduRank)
axs[1,1].scatter(simd.GAccPTGP, simd.HlthRank)
axs[1,2].scatter(simd.GAccPTGP, simd.IncNumDep)
axs[2,0].hist(simd.CrimeCount)
axs[2,1].hist(simd.HouseNumOC)
axs[2,2].hist(simd.HouseNumNC)
plt.show(fig1)

You should now have a basic understanding of how to generate simple plots using matplotlib and how to include subplots within a figure. 

There are way too many customization options to cover in detail and frankly does not make any sense to learn every single command, what's important initially is getting familiar with the library and get the basics to plot a dataset, once you understand that, if you need a more customized plot, you can visit the documentation and get the guidance for the option you are looking for. For more details and to investigate specific options, please consult the [matplotlib documentation](https://matplotlib.org/).

There are certain rules or best practices.

1. Know your audience
2. Identify what you need to tell with your plot
3. Adapt the figure, so Do not trust the defaults
4. Captions are not optional
5. Use color appropriately
7. Do not mislead the reader, describe what the data says
8. Avoid "chartjunk", create the ones that support your story
9. Create a plot is not always easy. 



## Some useful cheat-sheets 

It is not necessary that you learn all the syntax of the library, but once you understand how the library work you can use the following Cheat Sheets to get the guidance to create any type of plot you need for your project or code.

Source: https://matplotlib.org/cheatsheets/

![image.png](attachment:image.png)

## Seaborn

Seaborn is a Python library based on **matplotlib** and simplifies the generation of a variety of graph types and data visualizations.

It is also an easier way to interact with matplotlib. Here is a link to the Seaborn [documentation](https://seaborn.pydata.org/). This library has been already integrated in our python environment, so you just need to import it.

In [None]:
import seaborn as sns

Now run the following cell, but also take some time to update the attributes/parameters and create different plots, For example in the following example, we are plotting Depth Vs Magnitude. 

In [None]:
sns.relplot(x="depth", y="mag", data=earthquakes)


It is possible to map variables to other graphical parameters, other than just the position along the x-axis and y-axis. In this example, we could map the magSource to the point color as a qualitative or categorical variable using the hue parameter. Seaborn automatically chooses unordered colors, or a qualitative color scheme. If you want you can customise that with additional parameters

In [None]:
sns.relplot(x="depth", y="mag", hue="magSource", data=earthquakes)


Additionally, you could also define a point symbol or style as opposed to the color or hue. Mapping to the point symbol would not make sense for a continuous variable, since there is no implied order for symbols. However, a continuous variable could be mapped to the symbol color or size.

In [None]:
sns.relplot(x="depth", y="mag", style="status", data=earthquakes)

In order to refine or improve Seaborn plots you can use methods made available by Seaborn and/or methods made available by matplotlib, since Seaborn is built on top of matplotlib. However there are some differences that are out of the approach of this course, but we will cover in a future module.

**Just for your curiosity**:  One complexity is that some Seaborn graphs are produced as a FacetGrid object and cannot be placed into a subplot within a figure. **relplot()** is an example. Other plot methods generate the data at an axes or subplot level, such as **scatterplot()**.

In [None]:
fig, axs = plt.subplots(1, 1)
sns.scatterplot(ax=axs, x="mag", y="depth", hue="magSource", data=earthquakes)
axs.set_title("Magnitude vs. Depth", fontsize=20, color="#000000")
handles, labels = axs.get_legend_handles_labels()
axs.legend(handles=handles[1:], labels=labels[1:], title="Mag Source", title_fontsize=12) # Optional
axs.grid(True)
fig.patch.set_facecolor('#adb7c7')
plt.show(fig)

Now, here I will describe the way you can create the most popular plots using seaborn.
- Line Plot
- Histograms
- Bar Plot
- Box Plot
- Heatmap
- Pair Plot
- Violin Plot
- Swarm Plot
- Join Plot

Using a dataset incorporated in the library we can describe how you can create this type of graphs. 

In [None]:
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [None]:
# Load example data from the seaborn library
tips = sns.load_dataset('tips')
iris = sns.load_dataset('iris')

Iris dataset contains five columns such as Petal Length, Petal Width, Sepal Length, Sepal Width and Species Type. Iris is a flowering plant, the researchers have measured various features of the different iris flowers and recorded them digitally.

In [None]:
tips
#iris #You can try to

A line plot is a graph that displays data points connected by straight lines.
It is useful for showing trends over time or comparing changes in different groups.
Seaborn’s lineplot() function can be used to  create line plots,with optional customization of colors, markers, and line styles.

In [None]:
plt.figure(figsize=(4, 3))
sns.lineplot(data=tips, x='day', y='total_bill', hue='sex')
plt.title('Line Plot')
plt.show()

Like we see in the previous section a histogram is a graphical representation of the distribution of a dataset.
It divides the range of values in to a set of intervals,and then counts the number of values that fall into each interval.

In [None]:
plt.figure(figsize=(4, 3))
sns.histplot(data=tips, x='total_bill', bins=20)
plt.title('Histogram')
plt.show()

A bar plot is a graph that displays categorical data with rectangular bars.It is useful for comparing the values of different categories.

In [None]:
plt.figure(figsize=(4, 3))
sns.barplot(data=tips, x='day', y='total_bill', hue='sex')
plt.title('Bar Plot')
plt.show()

A boxplot is a graph that displays the distribution of a dataset using five summary statistics:

-   minimum

-    first quartile

-    median

-    third quartile,

-    maximum.

It is useful for identifying outliers and comparing the distributions of different groups.

In [None]:
plt.figure(figsize=(4, 3))
sns.boxplot(data=tips, x='day', y='total_bill', hue='sex')
plt.title('Box Plot')
plt.show()

A heatmap is a graphical representation of a matrix of values, where the values are represented as
colors. It is useful for visualizing patterns in large datasets.

In [None]:
plt.figure(figsize=(4, 3))
sns.heatmap(data=iris.corr(), annot=True)
plt.title('Heatmap')
plt.show()

A pair plot is a graph that displays pair wise relationships between variables in a dataset.It is
useful for exploring the correlations between different variables

In [None]:
plt.figure(figsize=(4, 3))
sns.pairplot(data=iris, hue='species', markers=['o', 's', 'D'])
plt.title('Pair Plot')
plt.show()

A violin plot is a combination of a box plot and a kernel density plot.It displays the distribution
of a data set with a combination of a box plot and a density plot, where the density plot shows the
estimated probability density function.

In [None]:
plt.figure(figsize=(4, 3))
sns.violinplot(x="day", y="total_bill", data=tips)
plt.title('Violin Plot')
plt.show()

A swarm plot is a graph that displays the distribution of a dataset by placing each data point on a line.It is useful for showing the density of data points and highlighting potential outliers.

In [None]:
plt.figure(figsize=(4, 3))
sns.swarmplot(data=tips, x='day', y='total_bill', hue='sex')
plt.title('Swarm Plot')
plt.show()

A joint plot is a graph that displays the relationship between two variables using both a scatter
plot and a histogram.It is useful for exploring the correlations between different variables and
identifying patterns in the data.

In [None]:
plt.figure(figsize=(4, 3))
sns.jointplot(data=tips, x='total_bill', y='tip', kind='reg')
plt.title('Joint Plot')
plt.show()

## Pandas 

Another option for generating graphs is to make use of the graphing functionality built into Pandas. Similar to Seaborn, this is based on matplotlib. In this course we will not discuss graphing with Pandas in more detail. But is required you learn the basics. 

The documentation for the [data visualization components of Pandas can be found here](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html).

For example you can easily plot a histogram based on the attributes of your pandas dataframe.

In [None]:
earthquakes.mag.plot.hist(alpha=0.5,bins=25)

Pandas and matplotlib include some default styles. In the example below, I am using the "ggplot" style based on the "ggplot2" R package. Note that the default style can be restored using "default".

In [None]:
plt.style.use("ggplot")
earthquakes.mag.plot.kde()
plt.style.use('default')

This is an example of a scatterplot. Instead of calling a method relative to a specific variable, the x and y values are defined relative to variables stored in a DataFrame.

In [None]:
sample_simd = simd[(simd["LAName"]=="West Lothian")] #subsetting a small portion of our previously created dataframe
sample_simd.plot.scatter(x='IncNumDep', y='EmpNumDep')


This is an example of a boxplot generated using Pandas. The by argument defines the grouping variable (the sub-region of the country in this case).

In [None]:
sample_simd.boxplot(column='CrimeRate', by='Quintilev2')

Similar to Seaborn, Pandas plots can be customized and edited using matplotlib.

## Export Graphics


Once a graph is produced using matplotlib, Seaborn, and/or Pandas, it can be exported using the savefig() method from matplotlib. You may find that you need to do additional editing to further refine a graph. This can best be accomplished using a vector graphics editing software, such as Adobe Illustrator or the free and open-source Inkscape software. PDF files can also be imported into vector graphics software.

There are a variety of export options. I recommend reading through the documentation for [savefig()](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.savefig.html).




In [None]:
plt.savefig("data/image_svg.svg", dpi=300, format="svg")
plt.savefig("data/image_png.png", dpi=300, format="png")