# Data Series & Indexing

*Remember*: DataFrames are composed of one or more Series (columns) that are indexed to enable finding and sorting. 

Ordinarily, the data constituting the Series are read directly from a file and the index is automatically set to the first available 'index-like' column in the file. But you are not bound by what pandas thinks is the 'right' index: you can set any column as an index, or even create one of your own!

For instance, let's say that you wanted a series containing only latitudes for British cities, you could create a new Series with this custom index as follows:
```python
myLatitudes = pd.Series(
    [7063197, 6708480, 6703134, 7538620], 
    index = ['Liverpool', 'Bristol', 'Reading', 'Glasgow']
)
```
In this case, the index is a list of cities and it would, generally, be quite quick to look up the latitude of any of the cities listed. You are never limited to _only_ looking up values by index, but this is always faster.

In [2]:
import pandas as pd
myLatitudes = pd.Series(
    [7063197, 6708480, 6703134, 7538620], 
    index = ['Liverpool', 'Bristol', 'Reading', 'Glasgow']
)
print "Type of myLatitudes: "      + str(type(myLatitudes))
print "Access like a dictionary: " + str(myLatitudes['Liverpool'])
print "Access like a method: "     + str(myLatitudes.Liverpool)

myLatitudes.Bristol = '555000'

print "Updated latitude: " + str(myLatitudes.Bristol)

Type of myLatitudes: <class 'pandas.core.series.Series'>
Access like a dictionary: 7063197
Access like a method: 7063197
Updated latitude: 555000


Let's also look at how you access data in a more 'data set-like' fashion – how do you select, say, rows in the range from 0 to 2 (i.e. <= 2)? Note that there is a [lot more](http://pandas.pydata.org/pandas-docs/version/0.18.1/indexing.html#selection-by-position) that you can do with this.

A simple mnemonic for `loc` and `iloc` is that `loc` is helping you to find the *location* of something in the data frame (like working with a dictionary using keys), while `iloc` is using the *integer location* (like working with a list using integers). It's obviously not *quite* that simple because the `loc` example below actually produces a range, but this at least helps you to remember right?

In [3]:
# Access like a list
print myLatitudes.iloc[0:2]

print "\n"

# Access a range
print myLatitudes.loc['Reading':]

Liverpool    7063197
Bristol       555000
dtype: int64


Reading    6703134
Glasgow    7538620
dtype: int64


# Adding a New Series

To add a new series to an existing data frame we use the dictionary-like syntax that we could _also_ use with the Data Series itself:
```python
df['NewSeriesName'] = pd.Series(...Series definition...)
``` 
See how familiar that syntax is? `df['NewSeriesName']` is _exactly_ like creating and assigning a new key/value pair to a dictionary! The only difference here is that the 'value' we store in the dictionary is a Series object, and not a simple variable (String, int, float).

# Transformation

One of the most frequent applications of adding a new series (i.e. column) to a data set is when we want to transform the original data in some way. Transformations are useful when a data series has features that make comparisons or analysis difficult, or that affect our ability to intuit meaningful difference. By manipulating the data using one or more mathematical operations we can sometimes make it more *tractable* for subsequent analysis.

Here's an example: let's say that we want to understand how student heights are distributed within a class. It's not at all easy if all you have to go on is a list of raw heights: 160cm, 158cm, 150cm, 185cm, 172cm, 175cm, 166cm...

An obvious first step to understanding this student data would be to calculate a mean ($\mu$) from the data since that tells us the _average_ height of all students in the class. But wouldn't it be handy to be able to examine in more detail how tall students are _relative_ to the mean? Is there someone from the basketball team taking this course? Or maybe the cox from a crew? How could we make it easy to compare the difference between each student and the overall class average in order to spot these 'special cases'?

Let's try writing this out as Python code, in other words:
```python
# Create an empty data frame to 
# hold our height data
df = pd.DataFrame() 

# Create and add a series
df['Heights'] = pd.Series(
    [160, 158, 150, 185, 172, 175, 166],
    index = ['Judy','Frank', 'Alice', 'Eve', 'Bob', 'Carlos', 'Dan']
)

# Look at the results
df.describe()
```
To recap: looking at the heights of the students (whether in code or in the notebook generally) it's hard to tell how far each student is from average, and who might be especially (*significantly*) tall or short. When this happens we can _transform_ the raw data in order to make it easier to see and interpret this variation.

## Subtracting the Mean

In many cases, the best way to make this comparison is to _subtract the mean_. Why is that? What have we achieved?

Let's think it out:
1. If a student is shorter _than average_ then their transformed height is less than 0
2. If a student is taller _than average_ then their transformed height is more than 0
3. The distance from 0 (e.g. -20 vs -3) gives us _some_ sense of how short or how tall someone is

In a mathematical form we'd write this transformation as:
$$
x - \mu
$$

In pandas we could write this transformation as:
```python
df['TransformedHeights'] = pd.Series(df.Heights-df.Heights.mean())
df.describe()
```

We can break this apart as:

* df.Heights – this is the _entire_ data series of student heights
* df.Heights.mean() – this calculates the mean ($\mu$) of student heights
* We perform this calculation and then use the results to create a new data series
* We assign this series to a new column in the data frame called `TransformedHeight`.

Pandas is smart enough to know that it needs to take _each_ student height and then subtract the mean height of all students from that value. So even though it looks like we're performing a single calculation, we're actually performing as many calculations as there are rows in the data frame but without needing to write any tricky code!

*Remember*: subtracting the mean is a linear transformation (unlike the log-transform).

# Looking at the Effect of a Transformation

Now let’s see what effect this transformation has on real-world data. This is data on Socioeconomic Class from the UK, and the code below is designed in a way that is _parameterised_, meaning that we can quickly change the series from one column to another:

In [4]:
import pandas as pd

# Create our own column names to replace
# the ones provided in the data
colnames = ['CDU','GeoCode','GeoLabel','GeoType','GeoType2','Total']
for i in range(1,8):
    colnames.append('Group' + str(i))
colnames.append('NC')
df = pd.read_csv('./Data/Data_NSSHRP_UNIT_URESPOP.csv', header=0, skiprows=[1], names=colnames)

# Show first few rows of the df
print "Summarising df..."
print df.head(1)
print "\n"

# Print summary for group
series = df.Group1
print "Summarising " + series.name + "..."
print series.describe()
print "\n"

# Or we can do pretty numbers!
print "Prettily formatted metrics from " + series.name + "..."
print "\tMedian:  {0:> 7.2f}".format(series.median())
print "\tLQ:      {0:> 7.2f}".format(series.quantile(0.25))
print "\tUQ:      {0:> 7.2f}".format(series.quantile(0.75))
print "\tRange:   {0:> 7.2f}".format(series.max()-series.min())

Summarising df...
                               CDU  \
9937 E02000001  City of London 001   

                                                         GeoCode GeoLabel  \
9937 E02000001  Middle Super Output Areas and Intermediate Zones   MSOAIZ   

                GeoType  GeoType2  Total  Group1  Group2  Group3  Group4  \
9937 E02000001     7187      2730   2246     543     497     224     308   

                Group5  Group6  Group7  NC  
9937 E02000001     212     178   249.0 NaN  


Summarising Group1...
count    8480.000000
mean      744.774292
std       255.340839
min       108.000000
25%       564.000000
50%       729.000000
75%       911.000000
max      2020.000000
Name: Group1, dtype: float64


Prettily formatted metrics from Group1...
	Median:   729.00
	LQ:       564.00
	UQ:       911.00
	Range:    1912.00


In the coding cell below, why don't you use what we've just seen above to calculate a transformed value for Group1, assign it to a new series called `Group1LessMean`, and then print out the pretty-printed summary?

In [5]:
# Calculate and assign a transformed variable
# to the data frame...
df['Group1LessMean'] = ???

# Which series are we working with?
???

# Or we can do pretty numbers!
print "Prettily formatted metrics from " + series.name + "..."
print "\tMedian:  {0:> 7.2f}".format(series.median())
print "\tLQ:      {0:> 7.2f}".format(series.quantile(0.25))
print "\tUQ:      {0:> 7.2f}".format(series.quantile(0.75))
print "\tRange:   {0:> 7.2f}".format(series.max()-series.min())

SyntaxError: invalid syntax (<ipython-input-5-dd3d77cbd830>, line 3)

If you compare the results how important are the changes?

### String Formatting for Pretty-Printed Numbers 

Notice also the `<string>.format()` command I’ve used here: `{0:> 7.4f}`. In order to understand how this works for formatting the results in a nice, systematic way you will need to read  [the documentation](http://www.python.org/dev/peps/pep-3101/)

The 'pep' tells you that:
* `{0}` tells Python to grab the first value inside the parentheses (`format(... values ...)`) and to stick it into the string at this point, but `:...` tells Python that we also format the string in a particular way specified in the `...`.
* `>` tells Python that the string should be right-aligned.
* The space (' ') next to the > says that any 'fill' should be done with whitespace (you could also do it with a 0).
* `7.4f` tells Python to treat anything it gets as a float (even if the variable is an int) and to format it for having 4 significant digits after the full-stop, and a total of 7 digits in all (which ties us back to the right-alignment up above). If you give it a number that has more than 3 digits to the left of the full-stop then it will still print them out, same as if it has less.

Here are some suggestions to better-understand what’s going on:
* Try changing the > to a <
* Try changing the .4 to a .0
Do this in the coding area below.

In [17]:
print "\tMedian:  {0:<9.0f}".format(series.median())
print "\tLQ:      {0:8.2f}".format(series.quantile(0.25))
print "\tUQ:      {0:>7.4f}".format(series.quantile(0.75))
print "\tRange:   {0:<6.6f}".format(series.max()-series.min())

	Median:  729      
	LQ:        564.00
	UQ:      911.0000
	Range:   1912.000000


## Logarithmic Transformation

Let’s do one more simple transformation: taking the natural logarithm of the Group1 values. If you don’t remember what a logarithm is try these:
* https://www.youtube.com/watch?v=zzu2POfYv0Y 
* https://www.youtube.com/watch?v=akXXXx2ahW0 
* https://www.youtube.com/watch?v=0fKBhvDjuy0 (made by Ray & Charles Eames, two of the 20th Century’s most famous designers).

Note that logarithms are non-linear transformations -- can you think why this is?

### Logarithmic Transforms in Pandas

To create a new series in the data frame containing the natural log of the original value it’s a similar process to what we've done before; since pandas doesn't provide a log-transform operator (i.e. you can’t call `df.Group1.log()` and it makes no sense why it would) we need to use the `numpy` package again:
```python
import numpy as np
df['Group1Log'] = pd.Series(np.log(df.Group1))
```
Try printing out the same summary measures as above in the coding area below. Is it more clear to you now why a log-transform is a non-linear transformation?

# Standardisation

## Proportional Standardisation

Clearly, a proportion (e.g. a percentage) is one way of standardising data since, unless you're measuring change, it limits the range to between 0% and 100%. Programmers and statisticians almost always write a proportion in a decimal format so the range is between 0.0 and 1.0. Mathematically, however, the notation is a little more forbidding:
$$
p_{i} = \sum_{i=1}^{n} x
$$

You might recognise that this is that dictionary style of key/value pairs again, so `'Group1Pct'` is the key, and new data series is the value.

Try printing out various summary metrics for the new column and comparing them to the raw values. What do you notice about the first few rows (e.g. for the City of London) when you use `df.head()`?

### Proportional Standardisation in Pandas

We can calculate the proportion of people in each area who come from Group 1 using a similar format to what we've seen before in the Transformation section:
```python
df['Group1Pct'] = pd.Series(df.Group1/df.Total)
```

## Z-Score Standardisation

The z-score is a common type of standardisation, but it's a little more complex than a simple proportion; however, both are designed to enable us to compare data across different groups of observations. We can easily compare two percentages to know which one is more, and which less (e.g. I got 80% on one exam and 70% on the other).

But let's think about that 'which exam did I do better on?' question a little bit more: what if you got 80% on an exam where everyone else got 85%? Suddenly that doesn't look quite so good right? And what if your 70% was on an exam where the average score was 50%? The z-score is designed to help you perform this comparison in a numerical way.

As a reminder, the z-score looks like this:
$$
z=(x-\mu)/\sigma
$$


### Z-Score Standardisation in Pandas

Let’s figure out how to translate this into a new columns called `Group1ZStd`!

We’ve already done the first part of this calculation up above in `Group1LessMean`. That series value is the same as $x-\mu$, so all we need to do is divide by the standard deviation.

So one way to do this is:
```python
df['Group1ZStd'] = 
     pd.Series( df.Group1LessMean/df.Group1.std() )
```
That works exactly the same way as what we did when we subtracted the mean in the first place: we’re just taking the results from the previous equation and passing them on to this one. 

We could also do it all in one go as:
```python
df['Group1ZStd'] = 
     pd.Series( 
         (df.Group1 - df.Group1.mean()) / df.Group1.std()
     )
```
Do you see how we can begin to build increasingly complicated equations into the process of creating a new data series.

### Using the Z-Score

In the previous notebook, there was a long digression about 'data generating processes'; here, we can start to bring this to life. The first thing to do with the z-score is to look at what it implies:

1. Subtracting the mean implies that the mean _is a useful measure of centrality_: in other words, the only reason to subtract the mean is if the mean is _meaningful_. If you are dealing with highly skewed data then the mean is not going to be very useful and, by implication, neither is the z-score.
2. Dividing by the Standard Deviation implies that this _is a useful measure of distribution_: again, if the data is skewed or follows some exponential distribution then the standard deviation is going to be near-useless as an analytical tool.

So the z-score is _most_ relevant when we are dealing with something that look vaguely like a normal distribution (the z-scored version of which has mean=0 and standard deviation=1). In those cases, anything with a z-score more than 1.96 standard deviations from the mean is in the 5% significance zone. 

But note: we can't really say _why_ one particular area has a high concentration of employment or why one individual is over 2m tall. All we are saying is that this standardised value is a pretty unlikely outcome _when compared to our expectation that employment is randomly distributed across the region_. We _know_ that employment isn't randomly distributed in the same way that we know that height isn't genuinely random becuase of the influence of genetics, nutrition, etc. But we need a way to pick out what counts as _**significant**_ over- or under-concentration (or height) from the wide range of 'a bit more' or 'a bit less' than 'normal'.

And here we get to the crux of the issue, most frequentist statistics boils down to this: subtracting **what you expected** from **what you got**, and then dividing by **some measure of spread** to control for the range of the data. We then look at what's _left_ to see if we think the gap between expectations and observations was _meaningful_ or whether it falls within the range of 'noise'.

It should be obvious that it's the _**expected**_ part of that equation that is absolutely crucial: we come up with a process that generates data that is _resembles_ the important dimensions of reality without thinking that the process has explained them. It's the first step, not the lsat.

## Location Quotient Standardisation (LQ)

This 'LQ' is used in geography to measure the concentration of something a sub-region to the overall concentration in the wider region of which that sub-region is a part. One of the most common applications is in economic geography where we are trying to find concentrations (or absences) of employment that seem significant. 

For example, let's pretend that we have a (very small) country composed of three regions:
- Region 1: 400 employees; 200 in steel
- Region 2: 200 employees; 150 in steel
- Region 3: 300 employees; 80 in steel
  
*Question*: which region has the greatest concentration of steelworkers?

*Answer*: depends what you mean by concentration. 

There are _more_ steelworkers in Region 1 overall, but their _density_ in Zone 2 is higher (because `150/200 > 200/400`). We use the density (which is a simple proportion) to ensure that we are comparing like-for-like; the proportion controls for the fact that each of the regions has a different number of total employees. 

Let's put it another way: cities like London and New York are gigantic. If you want to know where there are the _most_ bankers or _most_ bagel factories then your answer will almost always be... London and New York. But what if there's a small town where 95% of people are bankers, and another town where 80% of people are bakers? Surely that's pretty interesting too, no? What's going on that a place can support way more of these professions that we'd expect...

_Expect_, there's that word again. How do we define what we _expect_ so that we can compare it to what we _got_? Well, the mean is one way of defining our expectations -- if you had to guess the height of a new student in your class, your _expectation_ would be based on the heights of the existing students, and the best guess that you could make would be the _average_ height of those students because the majority of students are about that tall.

In the case of our little steel-producing country, we would need to define our expectation a little differently. We're already using a proportion (steelworkers in a region/all workers in a region) to control for the fact that our regions have different sizes. Why not use a _second_ proportion (all steelworkers in the country/all workers in the country) to set our expectations about how concentrated steel employment will be in each area?

That's the LQ.
$$
\frac{e_sr}{e_ar}/\frac{e_sA}{e_aA}
$$

What we do by using the proportion across the entire country is to _expect_ that employment should be distributed evenly everywhere. If steelworkers don't have any particular needs then they should be distributed in line with employment as a whole. So if we find areas that are way above or below this then that might be something worth digging into.

### Calculating the LQ in Pandas

In [13]:
# Notice how we can create a data frame 'by hand'
# Why do we set these up as floats, not ints?
d = {
  'allemp': pd.Series([400,200,300], index=['Region1','Region2','Region3']),
  'steel' : pd.Series([200,150,80], index=['Region1','Region2','Region3']),
}
df = pd.DataFrame(d)

# Look at what we've done
print df

print "\n"

print "Proportion in each region: "
print df.steel / df.allemp

print "\n"

print "Proportion in entire country: "
print df.steel.sum() / df.allemp.sum()

print "\n"

print "LQs for each region: "
print (df.steel / df.allemp) / (df.steel.sum() / df.allemp.sum())


         allemp  steel
Region1     400    200
Region2     200    150
Region3     300     80


Proportion in each region: 
Region1    0.500000
Region2    0.750000
Region3    0.266667
dtype: float64


Proportion in entire country: 
0


LQs for each region: 
Region1    inf
Region2    inf
Region3    inf
dtype: float64


#### Debugging

Huh, what's going on here? Why is the LQ `inf` (i.e. infinite)? That's not very helpful! Have a look at the results preceding the non-sensical LQ, can you see where this 'bug' arises? 

*Hint*: what do you get if you divide an integer by another integer in Python? What do you get if you divide an integer by a float (or a float by an integer)?

Fix the code block above so that you get three LQs in the range between 0.558 and 1.57.

It can be really useful to manually double-check the results of a calculation using a calculator or just by typing the numbers into a separate line in Python's interpreter -- if you were to get the LQ wrong here then all of the analysis that you do later would be wrong too. And while the replicability of code helps (i.e. you can fix the mistake and then re-run the entire analysis very quickly), you should never assume that you got it exactly right until you’ve double-checked.

### Interpreting Your Results

How do we interpret the LQ results? Let's start with the two simplest results: 0 and 1. If the LQ is 0 then that means that the denominator (top half of the LQ formula) was 0; there is no employment in the sector of interest in that region. If the LQ is 1 then that means that the top and bottom of the LQ equation were the same; the density of employment in the region is _the same_ as the density in the country as a whole. Anything more than one means a greater density than in the country as a whole.[1]

[1] _Note_: you don't have to do this analysis at the country level, you could have your 'country' be a city and your regions be neighbourhoods or districts... basically, any time you have smaller zones nested within a larger one.

From there, the most straightforward way to think about it more deeply is, perhaps surprisingly, using numbers. What do we get if the LQ is: $\frac{0.5}{0.25}$? In this case, we'd be saying that 50% of employment in the region of interest is in our sector of interest, while in the country as a whole we would expect 25% of employment to be in that sector. That gives us a LQ of 2, and we can read that directly as being _twice_ as concentrated! If we had the reverse: $\frac{0.25}{0.5}$ then we'd get 0.5 as the result and we'd know that employment was _half_ as concentrated...

# Exploratory Data Analysis & VDQI

If we weren't learning how to program at the same time as we learn to do data analysis then my recommendation would have been this: **start with a chart**. There is _no_ better tool for understanding what is going on in your data than to visualise it, but we couldn't show you how to make a plot without first teaching you how to load data and perform some basic operations on a data frame! Now we can get to grips with VDQI (the Visual Display of Quantitative Information) and how this supports our exploration of the data.

## Why Seaborn

For data visualisation we are going to use the [Seaborn package](http://stanford.edu/~mwaskom/software/seaborn/) because it provides a lot of quite complex functionality (and very pretty pictures) at quite low cost. There are, however, other options out there; the two that you are most likely to hear mentioned are: [Bokeh](http://bokeh.pydata.org/en/latest/) and matplotlib. Bokeh is, like, Seaborn designed to make it easy for you to create good-looking plots with minimal effort. 

Matplotlib is a different beast: it is actually the _underlying_ package that supports the majority of plotting (drawing graphs) in Python. So Seaborn and Bokeh both make use of the matplotlib library to create their plots, and if you want to customise a figure from either of these two libraries you will eventually need to get to grips with matplotlib. The reason we don't teach matplotlib is that it's much harder to make a good plot and the syntax is much more complex.

A more recent entrant is ŷhat's ggplot library, which deliberately mimics R’s ggplot2 (http://ggplot.yhathq.com) -- this has become the dominant way of creating plots in the R programming language and it uses a 'visualisation' grammar that many people find incredibly powerful and highly customisable. Unfortunately, ggplot on Python does not currently support mapping (which R does in ggplot2).

## Loading Seaborn 

As with other libraries that we’ve used, we’ll import Seaborn using an alias:
```python
import seaborn as sns

```
So to access Seaborn's functions we will now always just write `sns.<function name>()` (where `<function name>` would be something like `distplot`).

## Making a Distribution Plot

One of the most useful ways to get a sense of a data series is simply to look at its overall distribution. Something like this:
```python
%pylab inline
series = df.Group1Lq
sb.distplot(series)
sb.plt.show()
```
The `%pylab inline` command only need to be run _once_ in a jupyter notebook; it tells jupyter to show the plots as part of the web page, rather than trying to show them in a separate window. 

if all at went well you should see a nice distribution plot for the Group 1 Location Quotient! 

OK, I want you to take a second here: although there was a lot of setup work that needed to be done, we just created a distribution plot in one line of code. One line. This is a more sophisticated plot than you could ever create in Excel and you just created it in one line of code. Try producing similar plots for some of the other groups. Do you understand more about the data and its distribution now?

## Saving a Plot

Saving a plot isn’t quite as easy as creating one... but wouldn’t be it a lot easier if we could save our plot automatically and not have to even touch a button to do so? This is where we need to use matplotlib syntax (and where you'll see why we opted not to spend too much time on it):
```python
# Simple save
# Plotting library needed by seaborn
import matplotlib.pyplot as plt
fig = plt.figure(series.name)
sb.distplot(series)
fig = plt.gcf()
plt.savefig("{0}-Test.pdf".format(series.name), bbox_inches="tight"
plt.close()

```
To explain what's happening here: 
1. We import a package that gives us access to all kind of useful functions.
2. We create a 'figure' object into which Seaborn can 'print' its graph
3. We call Seaborn and ask it to print (it doesn't really need to know that it's printing to something, it just nees to know what 'device' is should use for output).

 
The plot should have been saved to your working directory with the name of the data series that you were working with (so that you can try doing the same for other data series without overwriting your results!). If you think back a little bit to where we started 6 weeks ago, you’ll see just how far we’ve come, and just how far you can go now: the ability to automatically load, clean, process (e.g. standardise and filter), and print out data in a variety of forms is incredibly powerful. You could repeat an entire analysis simply by changing the input file (assuming the format doesn’t change too much). It’s the automation component that sits at the heart of geocomputational techniques -- you still need individuals and their judgement to figure out what to do and what to produce, but once you’ve done that you can make the computer do the boring stuff while you focus on the interpretation and the meaning of the results!

# Automation with Loops & Functions

When we're undertaking an analysis of a data set, we often have to perform the same (or at least similar) tasks for each weather station or socio-economic class or ethnic group. We *could* copy and paste the code, and then just change the variable names to update the analysis... but that would be a definite instance of what Larry Wall would have called 'false laziness': it seems like a time-saving device in the short run, but in the long run you've made your code less readily maintainable (what if you want to _add_ to your analysis or find a bug?) and less easy to understand.

There are nearly always two things that you should look at if you find yourself repeating the same code: 
1. write a for loop; 
2. consider writing a function.

Why these two strategies?

## Automating Summaries

Let’s start with a for loop to generate summary statistics and a distribution plot for several of our data columns – you’ve already typed all of this code more than once so all we need to do is take the right bits of it and put them in a for loop like this:
```python
for series in ...:
    print 'Sumarizing ' + series.name
    
    series.desribe()
    
    print "\tMean:    {0:> 7.2f}".format(series.mean())
    print "\tMedian:  {0:> 7.2f}".format(series.median())
```
 
All you need to do is figure what goes into the for loop in the first place! You can get quite a lot from just looking at the code itself: where do we get a Series object from? Try typing this: type(df.Group1).
Automating Analysis
It is actually much, much harder to automate the analysis, but there are certain things that we can do to make life easier – transformation and standardisation is one of the easiest things to do as part of automation!
Here’s some code to generate the additional columns for our analysis on the pattern that we created for Group 1:
 
Do you see how that works – we use the group names as keys to access the data in the data frames (using the df[colName] syntax), but we also use them as strings to create the new column names by doing: groupName + typeOfTransformation.
Make sure that you understand how this works before moving on to the next bit, which is harder.
Improving our Automated Summaries
Seaborn is a handy imaging library because it makes a lot of difficult charts quite easy, but to help us make sense of the data it will be useful to add some additional information to our distribution plots: lines to show the location of the mean, median, and outlier thresholds.
To do this, we need to get at the library that Seaborn itself uses: matplotlib.
Find the code that you wrote to create and save a distribution plot (5.1.3), and update it so that it looks like this:
 
Remember that you type help(plt.vlines) to discover what parameters that function takes. 
Try changing the style of the median to a solid green line to make it easier to distinguish from the line marking the mean.
Why do you think I put an if condition on the outlier vertical lines? Would this always be appropriate? Why? Why not?
Bonus (Part 1)
Finally, let’s combine all of the code from 5.3.3 and 5.1.3 into a function – this will allow you to tidy up your code a lot because instead of having lots of copies of the same code all over your script, you can just call the function instead and it will generate the outputs you need.
Since this is bonus material I won’t give you a lot of pointers, but recall that you will want to put all of the code for 5.4.1 inside a function definition:
def generateSummary(series):
     … code here …
You could do the same thing for generating the transformed and standardised data columns for each of the original NS-SeC groups.
Bonus (Part 2)
Taking a Cut of the Data
For some types of plots we can be overwhelmed by the data – trying to show the distribution of Group 1 people as grouped by borough would be tricky.
So let’s take a ‘cut’ of the data by selecting only some boroughs for further analysis – we’re going to arbitrarily select K&C, Hackney and Barking because I know they’re quite different boroughs, but you could also use the data to make this selection, right? For instance, I could ask pandas to help me pick the three boroughs that are the furthest apart in terms of their Group 1 means…
Anyway, here’s the code to select a cut:
 
You’ll notice two unusual bits of code in there that need some explanation:
	To ‘select  multiple’ we need to write: dataFrame.Series.isin([…]), so we can’t just write dataFrame.Series==[…] unfortunately.
	We also have this line:
sdf.Borough = sdf.Borough.cat.remove_unused_categories()
Which should be fairly self-explanatory, but it’s because be default pandas doesn’t update the list of valid categories (i.e. boroughs) just because we filtered out boroughs that weren’t of interest. We therefore need to update the Series so that Seaborn doesn’t include a bunch of empty categories in when we make our plots.
Other Types of Plots
Let’s have a look at some other types of plots…

## 3D Plots
Finally, I wanted to show you how to create a 3D scatter plot – in this case the plot doesn’t add a lot to our understanding of the data, but there are cases where it might and it does illustrate how pandas, seaborn, and matplotlib work together to produce some pretty incredible outputs.
Here’s what you will get:
 
Here’s the code:
  
I would encourage you to look into the options in more detail:
	Can you change the colour map used to indicate which Borough each LSOA is drawn from?
	Can you change the icons used to mark each Borough so that they are different?
	Can you add a legend to indicate which marker is for which Borough?

# Additional Resources

* Geocomputation
