# Mapping and Exploring Spatial Data
**NCG613: Data Analytics Project - Practical 6**

This notebook focuses on the concepts of data exploration and geovisualization, mostly in the context of choropleth maps for areal data, which we'll explore in ```geopandas``` and ```PySAL``` (using ```mapclassify```) using the London boroughs dataset on social/demographic data, and a sample of the London house price dataset on housing prices and characteristics. We'll also use a few other packages - like ```seaborn``` and ```pandas``` - to help us parse and classify the data.  We'll focus on two primary aspects of geovisualization: __quantitative data classification__ and __colour theory__ for choropleth maps, and then we'll use these techniques to explore some important __considerations for areal data__.

*Note: Setup and installation instructions are in the repository README.*

First, start by installing any new packages by typing ```conda install [package name]``` in the command line, or uncommenting and running the cell below.

In [None]:
# !conda install pysal
# !conda install contextily
# !conda install geopandas

Next, we load the required packages, setting ```matplotlib``` to plot figures inline.

In [None]:
# Import required packages
import matplotlib as mpl
from matplotlib import colors

%matplotlib inline
mpl.rcParams['figure.figsize'] = (15, 10) #this increases the inline figure size to 15 tall x 10 wide

import seaborn
import pandas as pd
import geopandas as gpd
import pysal
import numpy as np
import mapclassify
import matplotlib.pyplot as plt
import contextily as cx
import plotly.express as px

import warnings
warnings.filterwarnings('ignore') # Change settings so that warnings are not displayed

Now we load the __boroughs__ and their social/demographic data __profiles__, as well as the sample of 2023 __house price__ points (make sure you have these saved in the same folder as your notebook). We merge the profiles and boroughs into one GeoDataFrame object.

In [None]:
# Load profiles and boroughs geojson and merge
profiles = pd.read_csv('../data/lb_profiles.csv')
lb = gpd.read_file('../data/london_boroughs.geojson')
lb_map = lb.merge(profiles)

# Load housing price data
hp = pd.read_csv('../data/hpdemo.csv')
hp2 = gpd.GeoDataFrame(
    hp, geometry=gpd.points_from_xy(x=hp.xcoord, y=hp.ycoord)
)
hp2.crs = lb.crs

np.random.seed(12)
hp2 = hp2.sample(n=2000) # Take a random sub-sample of the housing points

Take a glance at the columns in both files:

In [None]:
list(hp2.columns)

In [None]:
list(lb_map.columns)

To provide a slightly better understanding of property price behaviour, we'll add some general information from the [2017 English Housing Survey](https://maynoothuniversity.sharepoint.com/:b:/s/ncg613adataanalyticsproject202425semester2moodle/EZtXK2lgVfBCiYkD6Q5V8zABKwzxSyQe9dpAFg_FqbmloA?e=3WCkHh) which provides some basic data on the average floor area for different types of housing (Figure 3.2). 

In [None]:
hp2['fl_area'] = np.where(hp2['type']== 'T', ((81+86)/2), 
                          np.where(hp2['type']== 'D', 152,
                                  np.where(hp2['type']== 'S', 93,
                                           np.where(hp2['type']== 'F', ((65+55)/2), 
                                                    ((86+81+93+152+77+65+55)/7)))))

In [None]:
hp2

Next, we'll employ a spatial join to aggregate the house points to the boroughs in a number of ways - mean price, summed price, summed floor area (to create an average price per floor area), and count of house points in each borough. We will analyse the distribution of these variables - and their relationship to some of the other borough characteristics - later on in the practical.

In [None]:
hp3 = gpd.sjoin(hp2,lb_map)
hp4 = hp3.groupby('NAME').agg({'price':['mean','median','sum'],'fl_area':'sum','ID':'count'})
hp4.columns = ['Mean_Price','Median_Price','Sum_Price','Sum_Floor_Area','House_Count']
hp4 = hp4.reset_index()
hp4['Ave_Area_Price'] = hp4['Sum_Price']/hp4['Sum_Floor_Area']

Now we merge the spatially-joined file (hp4) to the boroughs dataset and calculate a housing density based on the counts of house price points in the sample. We multiply the value of hectares by .01 to obtain a final measure in terms of units per km$^{2}$ (this has the added advantage of producing a value in a more comprehensible numerical range). We will start the practical by analysing the spatial distribution of house point density, and also use it as a test case for exploring standard issues with using aggregated areal data.

In [None]:
lb2 = lb_map.merge(hp4)
lb2['House_Dens'] = lb2['House_Count']/(lb2['HECTARES']*.01)

## Quantitative Data Classification
Choropleth maps rely on changing colour *shade* (brightness/darkness) and *hue* (colour) in order to differentiate quantitative data values - each colour shade or hue is assigned to a bin of values, much like a histogram, except that in this case the histogram is mapped and coloured. Before comparing choropleth map classification schemes, it is useful to look at the underlying distribution of the variable of interest using a histogram. In this case, let's look at the 'House_Dens' variable that we constructed from aggregating the count of the house points by borough:

In [None]:
h = seaborn.distplot(lb2['House_Dens'], bins=5, rug=True)

As you can see, we have a somewhat *right-skewed* distribution, which means that the mean is greater than the median (i.e., a relatively small number of observations have very high values of house density, pulling the mean to the right). This will influence the way that different classification schemes reflect the underlying distribution of the data. Let's go through four of the most common classification methods: __equal interval__, __quantile__, __mean-standard deviation__, and __Fisher-Jenks__.
### Equal Interval
The basic concept of the equal interval scheme is that each bin contains an equal width ($w$) of the attribute value for a specified number of bins ($k$). In ```mapclassify``` we can specify our preferred $k$ value, and the ```EqualInterval``` function automatically divides the variable of interest (in this case, 'House_Dens') into $k$ bins.

In [None]:
ei5 = mapclassify.EqualInterval(lb2.House_Dens, k=5)
ei5

As you can see, each of the bins has the same width of $w$ = 0.8. This value of $k$ = 5 also corresponds directly to the default histogram displayed above. Let's take a look at what this looks like mapped using a simple single-hue 'Blues' colour palette (we'll discuss colour palettes in more detail below).

In [None]:
f, ax = plt.subplots(1, figsize=(15, 10)) #Subplots allows you to draw multiple plots in one figure
lb2.plot(ax=ax, column='House_Dens', legend=True, cmap='Blues', scheme='EqualInterval', k=5, edgecolor='white', aspect=1)
ax.set_axis_off() #Remove axes from plot 
ax.set_title('House Point Density (units per sq. km.)') #Plot title text
plt.axis('equal') #Set x and y axes to be equal size
plt.show()

### Quantiles
While equal interval schemes are straightforward to interpret, they can suffer from the problem of sparse classes, as we see in the fourth bin above, which only contains two observations. An alternative approach is to create bins of equal *numbers of observations* ($n$), rather than width, by dividing $n$ by $k$ and placing the breaks sequentially from the minimum to maximum value. Thus, for $k$ = 5, the first bin contains the smallest 20% of data values, while the last bin contains the largest 20% of data values, etc.

In [None]:
q5 = mapclassify.Quantiles(lb2.House_Dens, k=5)
q5

As you can see, the width of the bins now vary to accommodate equal slices of data, which sometimes pose its own problems with interpretation (as, in this case, the last bin encompasses a range that's almost 6x as large as the first bin).

In [None]:
f, ax = plt.subplots(1, figsize=(15, 10)) #Subplots allows you to draw multiple plots in one figure
lb2.plot(ax=ax, column='House_Dens', legend=True, cmap='Blues', scheme='Quantiles', k=5, edgecolor='white', aspect=1)
ax.set_axis_off() #Remove axes from plot 
ax.set_title('House Point Density (units per sq. km.)') #Plot title text
plt.axis('equal') #Set x and y axes to be equal size
plt.show()

Another issue to be aware of is that, when there are a large number of duplicate data values, classes can become hard to define and thus solely dependent on how the underlying function breaks ties. This means that some observations with the *same value* could be split between different bins in order to maintain the equal $n$ in each bin, which produces a misleading choropleth map. In the case of the default ```mapclassify```, ill-defined quantiles are produced without equal values of $n$ in each bin.

In [None]:
np.random.seed(12345)
x = np.random.randint(1,10,33)
x[0:10] = x.min()
x

In [None]:
a = pd.Series(x)
a = a.rename('Quantile')
lb3 = lb2.merge(a, left_index=True, right_index=True)

In [None]:
q5_2 = mapclassify.Quantiles(lb3.Quantile, k=5)
q5_2

### Mean-Standard Deviation
Quantile schemes produce visualizations that suggest an even distribution of values (since each bin contains an equal value of $n$). Often in spatial analysis, however, we are interested in better understanding the location and context of *outliers*, which can be done using the mean-standard deviation classifier. This scheme defines class boundaries as some distance from the attribute mean in terms of multiples of the standard deviation of the attribute. For the default $k$ = 5, the common definition is to set the upper and lower bins as two standard deviations above and below the mean (respectively), while the fourth and second bins are set to within one standard deviation of the attribute, and the middle bin straddles the attribute mean. Any values larger or smaller than two standard deviations from the mean are placed into the upper and lower bins (respectively).

In [None]:
msd = mapclassify.StdMean(lb2.House_Dens)
msd

As you can see, this classifier is best used when the data are relatively normally-distributed or when the attribute mean is a meaningful value of interest in its own right. The right skew in the house point density data lumps the vast majority of the observations into the middle bin. Given that, the scheme is still somewhat useful for picking out the upper and lower outliers (even compared to other observations that would have been grouped together using the quantile scheme).

In [None]:
f, ax = plt.subplots(1, figsize=(15, 10)) #Subplots allows you to draw multiple plots in one figure
lb2.plot(ax=ax, column='House_Dens', legend=True, cmap='Blues', scheme='StdMean', k=5, edgecolor='white', aspect=1)
ax.set_axis_off() #Remove axes from plot 
ax.set_title('House Point Density (units per sq. km.)') #Plot title text
plt.axis('equal') #Set x and y axes to be equal size
plt.show()

### Fisher-Jenks
The final classification scheme we'll look at is Fisher-Jenks, which is representative of a number of similar schemes that use a heuristic approach to optimize the breaks between bins by attempting to minimize the sum of absolute deviations around class means (ADCM), i.e., to increase within-group similarity as much as possible. From [Rey et al. (2020)](https://geographicdata.science/book/notebooks/05_choropleth.html#principles): 
> The approach begins with a prespecified number of classes and an arbitrary initial set of class breaks - for example using quintiles. The algorithm attempts to improve the objective function by considering the movement of observations between adjacent classes. For example, the largest value in the lowest quintile would be considered for movement into the second quintile, while the lowest value in the second quintile would be considered for a possible move into the first quintile. The candidate move resulting in the largest reduction in the objective function would be made, and the process continues until no other improving moves are possible.

Fisher-Jenks uses a dynamic programming approach that is guaranteed to produce an optimal classification for a prespecified number of classes ($k$) and is generally a good middle-ground approach (as it is data-driven) when you don't have any prior assumptions or inclinations about the pattern of attribute values in the dataset.

In [None]:
np.random.seed(12345) #Setting the seed ensures that any replications of the random class generation process will produce the same results.
fj5 = mapclassify.FisherJenks(lb2.House_Dens, k=5)
fj5

In [None]:
f, ax = plt.subplots(1, figsize=(15, 10)) #Subplots allows you to draw multiple plots in one figure
lb2.plot(ax=ax, column='House_Dens', legend=True, cmap='Blues', scheme='FisherJenks', k=5, edgecolor='white', aspect=1)
ax.set_axis_off() #Remove axes from plot 
ax.set_title('House Point Density (units per sq. km.)') #Plot title text
plt.axis('equal') #Set x and y axes to be equal size
plt.show()

### Comparing Classification Schemes
We can explicitly compare the performance of different classification schemes in terms of ADCM for a specified value of $k$ (in this case, 5).

In [None]:
class5 = ei5, q5, msd, fj5
fits = np.array([ c.adcm for c in class5])
data = pd.DataFrame(fits)
data['classifier'] = [c.name for c in class5]
data.columns = ['ADCM', 'Classifier']
ax = seaborn.barplot(y='Classifier', x='ADCM', data=data) #Creates bar graph

In this case we can see that Fisher-Jenks outperforms other methods in terms of minimising ADCM (which is encouraging, since it was designed explicitly to minimize this quantity). However, as discussed above, the objective of using a classification scheme may not always be to produce the minimal ADCM plot - as in the case of the mean-standard deviation scheme, we may explicitly be interested in identifying outliers in the data.  
We can also compare classification results by looking at the number of observations assigned to each bin (with $k$ = 5) for each of the different schemes side-by-side.

In [None]:
pd.DataFrame({c.name: c.counts for c in class5},
                 index=['Class-{}'.format(i) for i in range(5)])

## Colour Theory
As mentioned above, choropleth maps rely on changing colour *shade* (brightness/darkness) and *hue* (colour) in order to differentiate quantitative data values. They make an intuitive connection between changes in shade/hue and changes in data values, which makes them extremely useful as data visualizations. However, before choosing a colour palette to display your data, it's important to understand the limitations to this connection. First, as we discussed above, different classification schemes will produce (sometimes vastly) different patterns of colour for the same underlying data values. Often times classification schemes bunch the impression of the distribution of values so that the largest and smallest outliers appear smaller and larger (respectively) than they actually are, because they are grouped in the same colour as neighbouring, but potentially much less extreme, values. This is a particular problem for right-skewed attributes common in urban and economic analysis. Second, the size, shape, and arrangement of areal units is rarely uniform, which influences the implicit connection in the viewer's mind between value and colour - a very large areal unit dominates the visual field and likely skews the perception of the distribution of data values by presenting a mass of one colour (even though it only counts as one observation). Third, extremely small changes in shade are difficult for the human brain to perceive, which can pose difficulties for single-hue colour palettes with a large number of bins. Fourth, colours on a map have implicit subjective associations (e.g., red = "bad" or "negative", while blue = "good" or "positive"); these implicit associations may differ by discipline or use case.  

Overall, the most important point to remember is that the purpose of a choropleth map is to match the *numeric* distance between classes using the *visual* "distance" between colour hue/shade attached to each class *in the mind of the viewer*. This is ultimately a somewhat subjective process that involves iteration and an intuitive feel for the distribution of colours on a map. The default palette in ```geopandas``` (and ```matplotlib```) is 'viridis', a multi-hue sequential scheme that moves from dark blue to light yellow. Virdis was designed scientifically to cross the largest range of perceptible colour values in order to maximize visual clarity between class breaks (while still retaining some intuitive sense of "direction").

In [None]:
f, ax = plt.subplots(1, figsize=(15, 10)) #Subplots allows you to draw multiple plots in one figure
lb2.plot(ax=ax, column='House_Dens', legend=True, scheme='FisherJenks', k=5, edgecolor='white', aspect=1)
ax.set_axis_off() #Remove axes from plot 
ax.set_title('House Point Density (units per sq. km.)') #Plot title text
plt.axis('equal') #Set x and y axes to be equal size
plt.show()

### Sequential Colour Schemes
Data attributes that have a sequential nature - e.g., starting at 0 and progressing numerically - should generally be displayed using a sequential colour scheme. Viridis is an example of a multi-hue sequential colour scheme, because the colours move through hues from a "darker" hue/value (blue) to a "lighter" hue/value (yellow). Sometimes, we may prefer a single-hue sequential scheme that moves though *shades* of one colour from a "lighter" shade/value (white) to a "darker" shade/value of a single hue (e.g., red). The [documentation](https://matplotlib.org/2.0.2/users/colormaps.html) for ```matplotlib``` contains a very instructive description and display of relevant sequential colour palettes that can be used.![image.png](attachment:image.png)

In [None]:
f, ax = plt.subplots(1, figsize=(15, 10)) #Subplots allows you to draw multiple plots in one figure
lb2.plot(ax=ax, column='House_Dens', legend=True, cmap='Reds', scheme='FisherJenks', k=5, edgecolor='white', aspect=1)
ax.set_axis_off() #Remove axes from plot 
ax.set_title('House Point Density (units per sq. km.)') #Plot title text
plt.axis('equal') #Set x and y axes to be equal size
plt.show()

### Diverging Colour Schemes
Sometimes your attribute value has a natural mid-point that is important to highlight positive and negative movement around, e.g., 0, a national average, or the value in a previous year. In these cases, it is appropriate to use a diverging colour scheme that use two different sequential palettes for values above and below the attribute's mid-point. Often times a red-blue diverging palette is used to indicate negative values (reds) and positive values (blue), with values near zero appearing in white, but there are a variety of choices available. Diverging schemes are generally well-paired with mean-standard deviation classification schemes, because they are inherently designed to split values in terms of standard deviations above or below the attribute mean.

In [None]:
f, ax = plt.subplots(1, figsize=(15, 10)) #Subplots allows you to draw multiple plots in one figure
lb2.plot(ax=ax, column='House_Dens', legend=True, cmap='RdYlBu', scheme='StdMean', k=5, edgecolor='white', aspect=1)
ax.set_axis_off() #Remove axes from plot 
ax.set_title('House Point Density (units per sq. km.)') #Plot title text
plt.axis('equal') #Set x and y axes to be equal size
plt.show()

### Qualitative Colour Schemes
Qualitative or categorical variables (such as names, regions, or other types of classifications) can also be displayed on maps. In this case, the objective is to explicitly __not__ make any association between colour and data value movement, because categorical attributes have no inherent numeric properties. At the same time, of course, you want to choose a colour palette that is visually appealing. In the London boroughs dataset, the 'ONS_INNER' variable is a categorical marker delineating boroughs that are considered a part of "Inner London" vs. "Outer London":

In [None]:
f, ax = plt.subplots(1, figsize=(15, 10)) #Subplots allows you to draw multiple plots in one figure
lb2.plot(ax=ax, column='ONS_INNER', legend=False, cmap='Paired', edgecolor='white', aspect=1)
ax.set_axis_off() #Remove axes from plot 
ax.set_title('Inner London') #Plot title text
plt.axis('equal') #Set x and y axes to be equal size
plt.show()

## Considerations for Areal Data
Now that we have the basics of choropleth map-making down, we can turn to examining some of the most important underlying considerations for areal data that influence the veracity of analyses and visualizations created using aggregated areal data. Since we are using a variable that we created ourselves by aggregating house points to boroughs, we can play with the underlying properties in order to see how different aggregation choices - many of which are typically invisible to you as an end data user - change the resulting visual representation (and thus insights gained about the data and underlying process it represents).
### Standardization
Let's start with a fairly common situation - you've downloaded some data which is presented in the form of aggregated counts by areal unit - in this case, we'll use our aggregated 'House_Count' by London borough. If you display this variable in its raw form, without accounting for the underlying size of the borough, your result looks like this:

In [None]:
f, ax = plt.subplots(1, figsize=(15, 10)) #Subplots allows you to draw multiple plots in one figure
lb2.plot(ax=ax, column='House_Count', legend=True, cmap='BuPu', scheme='FisherJenks', k=5, edgecolor='white', aspect=1)
ax.set_axis_off() #Remove axes from plot 
ax.set_title('House Point Count (units per borough)') #Plot title text
plt.axis('equal') #Set x and y axes to be equal size
plt.show()

We see significant concentration on the edges of the region, particularly in the south. This doesn't really square with our knowledge of the urban geography of the city, though, considering that central London shows some of the lowest values here, despite containing some of the city's densest neighbourhoods. In fact, you can see that this map isn't reflecting a representative depiction of house point concentration, but simply a map where the largest boroughs - which tend to be suburban due to lower population densities - are marked with the highest values.

In [None]:
f, ax = plt.subplots(1, figsize=(15, 10)) #Subplots allows you to draw multiple plots in one figure
lb2.plot(ax=ax, column='House_Count', legend=True, cmap='BuPu', scheme='FisherJenks', k=5, edgecolor='white', aspect=1)
hp2.plot(ax=ax, color='#AFFF7F', markersize=3, aspect=1)
ax.set_axis_off() #Remove axes from plot 
ax.set_title('House Point Count (units per borough)') #Plot title text
plt.axis('equal') #Set x and y axes to be equal size
plt.show()

If we instead map house point density - controlling for the underlying size of each borough - we get a much more accurate picture of the underlying concentration of house points in the region.

In [None]:
f, ax = plt.subplots(1, figsize=(15, 10)) #Subplots allows you to draw multiple plots in one figure
lb2.plot(ax=ax, column='House_Dens', legend=True, cmap='BuPu', scheme='FisherJenks', k=5, edgecolor='white', aspect=1)
ax.set_axis_off() #Remove axes from plot 
ax.set_title('House Point Density (units per sq. km.)') #Plot title text
plt.axis('equal') #Set x and y axes to be equal size
plt.show()

### Spatial Concentration
However, if you look at the underlying pattern of points (above), you may notice that there are some distinct groupings _within_ boroughs that may be obscured even by mapping densities at the borough scale. One way to get at that explicitly is to create a hexagonal grid at a smaller scale and aggregate points to it.

In [None]:
f, ax = plt.subplots(1, figsize=(15, 10)) # Setup figure and axis
hb = ax.hexbin(hp2.xcoord, hp2.ycoord, gridsize=50, alpha=0.5, cmap='BuPu') # Add hexagon layer that displays count of points in each polygon
ax.set_axis_off() #Remove axes from plot 
ax.set_title('House Point Density') #Plot title text
plt.axis('equal') #Set x and y axes to be equal size
plt.colorbar(hb) # Add a colorbar (optional)
plt.show()

If we overlay that grid on the densities by borough, you start to see some really interesting patterns that are mismatched between scales - notice large concentrations in single hexagons in several outlying boroughs in the north, northwest, and east (despite low overall densities in those boroughs). This overlay also helps to illustrate the issues described by the __Modifiable Areal Unit Problem__ (MAUP) - the same underlying data (points), when aggregated according to a different zonal scheme (hexagons) demonstrates a very different spatial pattern - the concentrations no longer appear to be primarily in the south-central boroughs, but rather in specific hexagons somewhat evenly distributed across the entire urban area.  

Which zonal scheme provides a more "accurate" picture of the process? That depends on the characteristics of the process itself (e.g., are houses sold based on characteristics of a small hexagonal radius, or the larger neighbourhood? what is the scale of urban interaction?), complimentary data, and subjective interpretations. Generally smaller spatial units provide for a more fine-grained understanding of locational characteristics, but some urban economic processes inherently play out at larger scales (economic markets and commuting, for example, are generally characteristics of large urban regions). The important point is, even if you do not have access to the individual data units, make sure you understand *how* spatial concentration might be influencing the aggregate patterns being mapped.

In [None]:
f, ax = plt.subplots(1, figsize=(15, 10)) # Setup figure and axis
lb2.plot(ax=ax, column='House_Dens', legend=True, cmap='BuPu', scheme='FisherJenks', k=5, aspect=1)
hb = ax.hexbin(hp2.xcoord, hp2.ycoord, gridsize=50, alpha=0.5, cmap='BuPu') # Add hexagon layer that displays count of points in each polygon
ax.set_axis_off() #Remove axes from plot 
ax.set_title('House Point Density') #Plot title text
plt.axis('equal') #Set x and y axes to be equal size
plt.colorbar(hb) # Add a colorbar (optional)
plt.show()

### Spatial Representativeness
Of course, since we are using a sample of house points, it's not possible for us to know whether or not the pattern displayed above is actually truly representative of the total pattern of house points (we do have that data, and will use it in future practicals, but for now, let's imagine that we don't). That means that any other aggregated features that we derive from these house points - such as average area cost (summed price / summed floor area) by borough - may be subject to error.

Again, typically when using aggregated data, we don't have access to the underlying points, so we would download this house count or density data without any ability to interrogate its underlying representativeness. However, since we do have the individual aggregated counts in this case we can assess  representativeness using a population-based weighting scheme typical in the social sciences. The basic logic is to find each borough's proportion of total population (the "target" value) and then divide that  (for each borough) by the borough's proportion of house points (the "sample" value) (a good discussion of this methodology, often used in sample surveys, can be found here: [Weighting of responses in the Consumer Survey:
Alternative approaches â€“ Effects on variance and tracking
performance of the Consumer Confidence Indicator](https://web-archive.oecd.org/2013-11-18/256240-iobe%20m.vassileiadis_weighting%20of%20responses%20in%20the%20consumer%20survey-%20effects%20on%20the%20consumer%20confidence%20indicator_paper.pdf)). The resulting 'House_Weight' is a measure that represents the under- or over-representation of house points in the sample in each borough compared to the underlying distribution of population in each borough. Multiplying by 'House_Count' or 'House_Dens' yields a population-adjusted count/density value for the house points.

In [None]:
lb2['Pop'] = lb2['PopDens']*lb2['HECTARES'] #To get raw population from this dataset we first need to obtain it from the density measure, which is all we are given
lb2['Pop_Prop'] = lb2['Pop']/(lb2['Pop'].sum())
lb2['House_Prop'] = lb2['House_Count']/(lb2['House_Count'].sum())
lb2['House_Weight'] = lb2['Pop_Prop']/lb2['House_Prop']
lb2

Let's see how this works by applying the weight to our house point density measure. As a reminder, here is what the raw house point density distribution looks like:

In [None]:
f, ax = plt.subplots(1, figsize=(15, 10)) #Subplots allows you to draw multiple plots in one figure
lb2.plot(ax=ax, column='House_Dens', legend=True, cmap='BuPu', scheme='FisherJenks', k=5, edgecolor='white', aspect=1)
ax.set_axis_off() #Remove axes from plot 
ax.set_title('House Point Density (units per sq. km.)') #Plot title text
plt.axis('equal') #Set x and y axes to be equal size
plt.show()

Now, we'll create a "weighted" measure of housing density ('House_Dens_W') by multiplying this value by the 'House_Weight' we just calculated. Here is the resulting pattern:

In [None]:
lb2['House_Dens_W'] = lb2['House_Dens']*lb2['House_Weight']
f, ax = plt.subplots(1, figsize=(15, 10)) #Subplots allows you to draw multiple plots in one figure
lb2.plot(ax=ax, column='House_Dens_W', legend=True, cmap='BuPu', scheme='FisherJenks', k=5, edgecolor='white', aspect=1)
ax.set_axis_off() #Remove axes from plot 
ax.set_title('Population-Weighted House Point Density (units per sq. km.)') #Plot title text
plt.axis('equal') #Set x and y axes to be equal size
plt.show()

Somewhat different! In fact, we have coerced the house point count (and thus density) measure to follow the same distribution as the underlying population (and thus population density):

In [None]:
f, ax = plt.subplots(1, figsize=(15, 10)) #Subplots allows you to draw multiple plots in one figure
lb2.plot(ax=ax, column='PopDens', legend=True, cmap='BuPu', scheme='FisherJenks', k=5, edgecolor='white', aspect=1)
ax.set_axis_off() #Remove axes from plot 
ax.set_title('Population Density (per hectare)') #Plot title text
plt.axis('equal') #Set x and y axes to be equal size
plt.show()

In fact, we can quantify this error (e.g., lack of representativeness) by subtracting the new weighted house point density from the original value; if we divide by the original value, we can turn this into a percent error (i.e., percent over- or under-representation in the sampled data):

In [None]:
lb2['House_Dens_Error'] = lb2['House_Dens']-lb2['House_Dens_W']
lb2['House_Dens_Per_Error'] = lb2['House_Dens_Error']/lb2['House_Dens']

Mapping this quantity can shed some interesting light on the underlying representativeness of the sampled house point data we were given. 

In [None]:
norm = colors.TwoSlopeNorm(vmin=-1, vcenter=0, vmax=1) #Create a diverging color map with a midpont set at 0
f, ax = plt.subplots(1, figsize=(15, 10)) #Subplots allows you to draw multiple plots in one figure
lb2.plot(ax=ax, column='House_Dens_Per_Error', legend=True, norm=norm, cmap='RdYlBu', edgecolor='white', aspect=1)
ax.set_axis_off() #Remove axes from plot 
ax.set_title('Percent Error for House Point Density') #Plot title text
plt.axis('equal') #Set x and y axes to be equal size
plt.show()

## Multivariate Data Exploration
Up to this point in the practical, we've been visualising and analysing the distribution of house point density as a single (univariate) measure. Now we will start to explore the relationship between multiple variables. First, let's map a couple of additional variables with a basemap underneath for added spatial context. We start by changing the Coordinate Reference System (CRS)from the European Petroleum Survey Group (EPSG) 27700 'British National Grid -- UK Ordnance Survey' to EPSG 4326 'World Geodetic System 1984', which is a global-scale CRS that is drawn in units of decimal degrees rather than metres:

In [None]:
lb2 = lb2.set_crs(epsg=27700, allow_override=True)
lb2.to_crs(epsg=3857)

This is necessary because the basemap generator ```contextily``` requires spatial data files to be drawn in WGS/Psuedo-Meractor (EPSG 3857) in order to recognize them. [Contextily](https://contextily.readthedocs.io/en/latest/) is a very powerful tool that draws basemaps directly into the plot window (feel free to explore the documentation to find a variety of options for different basemaps). With our projections sorted, we can now plot a choropleth of median income by borough with a map of London underneath for context:

In [None]:
f, ax = plt.subplots(1, figsize=(15, 10)) #Subplots allows you to draw multiple plots in one figure
lb2.to_crs('EPSG:3857').plot(ax=ax, column='MedInc', legend=True, scheme='FisherJenks', k=5, edgecolor='white', alpha=.38, aspect=1)
ax.set_axis_off() #Remove axes from plot 
ax.set_title('Median Income') #Plot title text
plt.axis('equal') #Set x and y axes to be equal size
cx.add_basemap(ax, source=cx.providers.CartoDB.Voyager)

Let's look at 'GreenSpace' and 'Ave_Area_Price' as well:

In [None]:
f, axs = plt.subplots(1, 2, figsize=(25, 20))
ax1, ax2 = axs

lb2.to_crs('EPSG:3857').plot(ax=ax1, column='GreenSpace', legend=True, scheme='FisherJenks', k=5, edgecolor='white', alpha=.38, aspect=1)
ax1.set_axis_off() #Remove axes from plot 
ax1.set_title('Green Space') #Plot title text
cx.add_basemap(ax1, source=cx.providers.CartoDB.Voyager)

lb2.to_crs('EPSG:3857').plot(ax=ax2, column='Ave_Area_Price', legend=True, scheme='FisherJenks', k=5, edgecolor='white', alpha=.38, aspect=1)
ax2.set_axis_off() #Remove axes from plot 
ax2.set_title('2023 Average Price per Square Metre') #Plot title text
cx.add_basemap(ax2, source=cx.providers.CartoDB.Voyager)

We can start again with univariate statistics to understand each of these variables' distributions - while we could use a standard histogram, this time let's try another useful plots from ```seaborn```, the '[violin](https://en.wikipedia.org/wiki/Violin_plot#:~:text=A%20violin%20plot%20is%20a,density%20plot%20on%20each%20side.&text=While%20a%20box%20plot%20only,full%20distribution%20of%20the%20data)' plot. This is a useful a combination of the box plot and histogram/probability density graph - it displays both the inter-quartile range, median, and high/low outliers like a box plot along its central axis: the thick black line ends at either end  of the "box" in the traditional box plot, while the thin black line extends to the ends of the "whiskers"; the full extension of the violin body encompasses all of the data (i.e., outliers). At the same time, the body of the violin displays the probability density of the data, so you can quickly get a sense for the skew, range, and modality of a variable (more options [here](https://stackabuse.com/seaborn-violin-plot-tutorial-and-examples/)). In this case, we can clearly see that the distribution of median income is right-skewed, with a large number of positive outliers. Average price per square metre behaves similarly, while green space is more normally-distributed.

In [None]:
seaborn.set_theme(style="whitegrid")
bx = seaborn.violinplot(x=lb2["MedInc"])

In [None]:
seaborn.set_theme(style="whitegrid")
bx = seaborn.violinplot(x=lb2["GreenSpace"])

In [None]:
seaborn.set_theme(style="whitegrid")
bx = seaborn.violinplot(x=lb2["Ave_Area_Price"])

What is the relationship of these variables to one another? Scatterplots can be used to help shed some light on the context for the distribution or a variable by comparing it to another.

In [None]:
seaborn.regplot(data=lb2, x="MedInc", y="Ave_Area_Price")

This suggests a relatively strong positive correlation between median income and average price per square metre: as median income increases, so does average price. This makes sense conceptually - so much sense, in fact, that it likely isn't a very interesting hypothesis. A more interesting one might be that green space increases property values, which we can begin to explore in another scatterplot:

In [None]:
seaborn.regplot(data=lb2, x="GreenSpace", y="Ave_Area_Price")

Interestingly, there is a negative correlation between the two. This actually makes sense  if you refer back to the maps above - the boroughs with larger amounts of green space tend to be in the outlying areas of the city, which is the inverse of where the highest price per square metre is. However, perhaps we've made an error with how we're measuring green space - it would make more sense to standardise it by the underlying size of each borough to create a measure of green space per km$^{2}$:

In [None]:
lb2['GreenSpace_Dens'] = lb2['GreenSpace']/(lb2['HECTARES']*.01)

In [None]:
seaborn.regplot(data=lb2, x="GreenSpace_Dens", y="Ave_Area_Price")

Now we can see that there is a very clear positive relationship between the two! 

Another highly-useful exploratory tool is the [parallel coordinate plot](https://python-graph-gallery.com/parallel-plot/#:~:text=A%20parallel%20plot%20plot%20allows,to%20plot%20interactive%20versions%20though.) which allows you to trace observations (or groups of observations) across a number of different variables. The basic logic is that lines representing each observation are connected based on the range of values for each variable in a dataset. In the example below, we categorize observations into "very high" and "other" in terms of average house price per square metre just to illustrate how the parallel coordinate plot can be used to trace the relationships across variables for a subset of observations.

In [None]:
pd.DataFrame.iteritems = pd.DataFrame.items #workaround due to version changes of Databricks Runtime
lb2['V_High'] = (lb2['Ave_Area_Price'] > 11000).astype(int)
fig = px.parallel_coordinates(lb2, color="V_High",
                             dimensions=['Ave_Area_Price', 'MedInc', 'GreenSpace_Dens', 'House_Dens'])
fig.show()

In this case, we see that the "very high" boroughs have the highest average price per square metre (which of course makes sense, as that was how we defined the group), as well as very high median incomes, green space density, and house point density. We can also remove the classification and display the colour scheme continuously across all values of average price per square metre:

In [None]:
fig = px.parallel_coordinates(lb2, color="Ave_Area_Price",
                             dimensions=['Ave_Area_Price', 'MedInc', 'GreenSpace_Dens', 'House_Dens'])
fig.show()

These plots not only confirm the bivariate relationships between average house price per square metre, median income, and green space density that we ascertained through the scatterplots, it also allows us to trace correlations across all of the input dimensions. Thus we can actually pick out the patterns for individual observations (or groups of similar observations) - across all of the variables of interest rather than relying on a series of bivariate scatterplots.

Another very useful option for assessing relationships across multiple variables is to compute a correlation table, which shows the individual bivariate correlations between each combination of variables in a DataFrame:

In [None]:
lb2df = lb2.loc[:,'PopDens':'GreenSpace_Dens'] 
lb2df.corr()
# plt.matshow(lb2df.corr()) # Code to create a heatmap of the correlation table
# plt.show()

## Self-Test Exercises
1. Map the variables 'Unemp', 'Disabled', 'NonEng', and 'CrimeRate' using each of the four classification schemes shown here. Change $k$, explore new color palettes, and evaluate which classification scheme/colour palette combination performs best.  
2. Once you have a good classification scheme for each, compare the distribution of these four variables and look at relationships to one another and to median income, average price per square metre, and green space density - what insights can you draw about the relationship between these variables? Do these relationships confirm or contradict your personal, anecdotal hypotheses?