# Adding a Series

*Remember*: DataFrames are composed of one or more Series (columns) that are indexed to enable finding and sorting. 

Ordinarily, the data constituting the Series are read directly from a file and the index is automatically set to the first available 'index-like' column in the file. But you are not bound by what pandas thinks is the 'right' index: you can set any column as an index, or even create one of your own!

For instance, let's say that you wanted a df containing only latitudes for British cities, you could create a new Series and then assign it to a df as follows:
```python
myLatitudes = pd.Series(
    [7063197, 6708480, 6703134, 7538620], 
    index = ['Liverpool', 'Bristol', 'Reading', 'Glasgow']
)
```
In this case, the index is a list of cities and it would, generally, be quite quick to look up the latitude of any of the cities listed. You are never limited to _only_ looking up values by index, but this is always faster. There are similarities here to how databases work, and where indexes become _really_ useful is when your index is a compound of several other columns: e.g. <city>+<some other identifier>.

However, to add a new series to an existing data frame we need to change things around a little bit, and this introduces the _second_ way of interacting with Series in a DataFrame (we saw the first way last week). In this case the format is much more explicitly like using a dictionary:
```python
df['NewSeriesName'] = pd.Series(...Series definition...)
``` 
See how familiar that syntax is? `df['NewSeriesName']` is _exactly_ like assigning a key/value pair to a dictionary! The only difference here is that the value is a Series object, and not a simple variable (String, int, float) or more complex (but still simple) variable such as a list or dictionary.

# Transformation

## Removing the mean

Let’s start with a really simple new series: remember I said that just about the simplest possible transformation is x-μ? Let’s do that! In pandas we can write this transformation as:
```python
df['Group1LessMean'] = pd.Series(df.Group1-df.Group1.mean())
```
We can break this apart as:
* df.Group1 – this is the _entire_ data series for Group1
* df.Group1.mean() – this calculates the mean (μ) of Group1
* We perform this calculation inside the pd.Series (pd is our alias for pandas) to create a new data series that can be assigned to a column in a data frame.
* And then we assign this series to a new column called `Group1LessMean`.
Pandas is smart enough to know that it needs to take _each_ value of `df.Group1` and substract from that the Group 1 mean. So even though it looks like we're performing a single calculation, we're actually performing as many calculations as there are rows in the data frame!

*Remeber*: subtracting the mean is a linear transformation (unlike the log-transform).

## Looking at the Effect of a Transformation

Now let’s see what effect this transformation had on our data series by comparing Group1 and Group1LessMean. We _could_ hard-code all of this, but it will be easier in the long-run to do it in a way that is _parameterised_ (meaning that we can quickly change it) from one series to another:

In [2]:
import pandas as pd

colnames = ['CDU','GeoCode','GeoLabel','GeoType','GeoType2','Total']
for i in range(1,8):
    colnames.append('Group' + str(i))
colnames.append('NC')
df = pd.read_csv('./Data/Data_NSSHRP_UNIT_URESPOP.csv', header=0, skiprows=[1], names=colnames)

# Print summary for group
series = df.Group1
print("Summarising " + series.name + "...")

print series.describe()

print "\n"

# Or we can do pretty numbers!
print "\tMedian:  {0:> 7.2f}".format(series.median())
print "\tLQ:      {0:> 7.2f}".format(series.quantile(0.25))
print "\tUQ:      {0:> 7.2f}".format(series.quantile(0.75))
print "\tRange:   {0:> 7.2f}".format(series.max()-series.min())

Summarising Group1...
count    8480.000000
mean      744.774292
std       255.340839
min       108.000000
25%       564.000000
50%       729.000000
75%       911.000000
max      2020.000000
Name: Group1, dtype: float64


	Median:      729
	LQ:          564
	UQ:          911
	Range:    1912.0000


In the next coding cell, why don't you print out the same for `Group1LessMean`? It should require you to change exactly _one_ line of code.

If you compare the results how important are the changes?

Notice also the `str.format()` command I’ve used here: `{0:> 7.4f}`. In order to understand how this works for formatting the results in a nice, systematic way you will need to read  [the documentation](http://www.python.org/dev/peps/pep-3101/)

The ‘pep’ tells you that:
* `{0}` would do what I’ve been doing in earlier practicals – just grab the variable and stick it into the string at this point, but `{0:...}` tells Python that we want to give it some information about how to format the string.
* `>` tells Python that the string should be right-aligned.
* The space (' ') next to the > says that any 'fill' should be done with whitespace (you could also do it with a 0 – zero – or any other character!).
* `7.4f` tells Python to treat anything it gets as a float (even if the variable is an int) and to format it for having 4 significant digits after the full-stop, and a total of 7 digits in all (which ties us back to the right-alignment up above). If you give it a number that has more than 3 digits to the left of the period then it will still print them out, same as if it has less.

Here are some suggestions to better-understand what’s going on:
* Try changing the > to a <
* Try changing the .4 to a .0

In [17]:
print "\tMedian:  {0:<9.0f}".format(series.median())
print "\tLQ:      {0:8.2f}".format(series.quantile(0.25))
print "\tUQ:      {0:>7.4f}".format(series.quantile(0.75))
print "\tRange:   {0:<6.6f}".format(series.max()-series.min())

	Median:  729      
	LQ:        564.00
	UQ:      911.0000
	Range:   1912.000000


## Another Transformation

Let’s do one more simple transformation: taking the natural logarithm of the Group1 values. If you don’t remember what a logarithm is try these:
* https://www.youtube.com/watch?v=zzu2POfYv0Y 
* https://www.youtube.com/watch?v=akXXXx2ahW0 
* https://www.youtube.com/watch?v=0fKBhvDjuy0 (made by Ray & Charles Eames, two of the 20th Century’s most famous designers).

Note that logarithms are non-linear transformations – can you think why this is?

To create a new series in the data frame containing the natural log of the original value it’s a similar process to what we've done before; since pandas doesn't provide a log-transform operator (i.e. you can’t call `df.Group1.log()` and it makes no sense why it would) we need to use the `numpy` package again:
```python
import numpy as np
df['Group1Log'] = pd.Series(np.log(df.Group1))
```
Try printing out the same summary measures as above in the coding area below. Is it more clear to you now why a log-transform is a non-linear transformation?

# Standardisation

## Proportional Standardisation

Clearly, a proportion such as a percentage is one way of standardising data since, unless you're measuring change, it limits the range to between 0% and 100% (in decimal: 0.0 and 1.0). So we can calculate the proportion of people in each area who come from Group 1 using:
```python
df['Group1Pct'] = pd.Series(df.Group1/df.Total)
```
If you remember how we accessed column before (`df.Group1`) then this is obviously a different way to do this. Here, we are _creating_ a new series so the syntax has to look a little different and, if you think back, you'll recognise that this is that dictionary style of key/value pairs again! 

There are some other cases where we'll need to use this style of DataFrame access, but we'll get to those later in the notebook and sessions.

## Z-Score Standardisation

The z-score (as seen in lecture today) is a really common type of standardisation, meaning that it enables us to compare data across different groups of observations. As a reminder, the z-score looks like this:

z=(x-μ)/σ

Let’s figure out how to translate this into a new columns called `Group1ZStd`!

We’ve already done the first part of this calculation up above in `Group1LessMean`. That series value is the same as x-μ, so all we need to do is divide by the standard deviation.

So one way to do this is:
```python
df['Group1ZStd'] = 
     pd.Series( df.Group1LessMean/df.Group1.std() )
```
That works exactly the same way as what we did when we subtracted the mean in the first place: we’re just taking the results from the previous equation and passing them on to this one. 

We could also do it all in one go as:
```python
df['Group1ZStd'] = 
     pd.Series( 
         (df.Group1 - df.Group1.mean()) / df.Group1.std()
     )
```
Do you see how we can begin to build increasingly complicated equations into the process of creating a new data series.

### Using the Z-Score

In the previous notebook, there was a long digression about 'data generating processes'; here, we can start to bring this to life. The first thing to do with the z-score is to look at what it implies:

1. Subtracting the mean implies that the mean _is a useful measure of centrality_: in other words, the only reason to subtract the mean is if the mean is _meaningful_. If you are dealing with highly skewed data then the mean is not going to be very useful and, by implication, neither is the z-score.
2. Dividing by the Standard Deviation implies that this _is a useful measure of distribution_: again, if the data is skewed or follows some exponential distribution then the standard deviation is going to be near-useless as an analytical tool.

So the z-score is _most_ relevant when we are dealing with something that look vaguely like a normal distribution (the z-scored version of which has mean=0 and standard deviation=1). In those cases, anything with a z-score more than 1.96 standard deviations from the mean is in the 5% significance zone. 

But note: we can't really say _why_ one particular area has a high concentration of employment or why one individual is over 2m tall. All we are saying is that this standardised value is a pretty unlikely outcome _when compared to our expectation that employment is randomly distributed across the region_. We _know_ that employment isn't randomly distributed in the same way that we know that height isn't genuinely random becuase of the influence of genetics, nutrition, etc. But we need a way to pick out what counts as _**significant**_ over- or under-concentration (or height) from the wide range of 'a bit more' or 'a bit less' than 'normal'.

And here we get to the crux of the issue, most frequentist statistics boils down to this: subtracting **what you expected** from **what you got**, and then dividing by **some measure of spread** to control for the range of the data. We then look at what's _left_ to see if we think the gap between expectations and observations was _meaningful_ or whether it falls within the range of 'noise'.

It should be obvious that it's the _**expected**_ part of that equation that is absolutely crucial: we come up with a process that generates data that is _resembles_ the important dimensions of reality without thinking that the process has explained them. It's the first step, not the lsat.

## Location Quotient Standardisation (LQ)

This metric is commonly used in economic geography to measure the concentration of a particualr area relative to a wider region. We use proportions to control for the fact that small areas inside a larger region can be of very different sizes; so, to be able to compare them, we normally want to move away from using absolute numbers.

For example:
* If we have:
  * Zone 1: 400 employees; 200 in steel
  * Zone 2: 200 employees; 150 in steel
Q: Which zone has the greatest concentration of steelworkers 
A: Depends what you ask -- yes, there are _more_ steelworkers in Zone 1 overall, but their _density_ in Zone 2 is higher... The z-score would give us a higher value for the 150/200 than the 200/400.

What the LQ does which is so clever is that it standardised by the density of employment in the region. This is the same density calculat as used above, but now it i
The LQ is made up of two proportions: the proportion of Group 1 people in any LSOA divided by the proportion of Group 1 people in all of London.
So you’ve already got the numerator calculated from 3.1.2.
Now we just need to figure out the denominator: it will be the total number of Group 1 people in the data frame divided by the total number of all people in the data frame (which is given in the ‘Total’ column). Here’s what you need to add to your series calculation:
(float(df.Group1.sum())/df.Total.sum())
We need to add a  float() because the sum() calculation gives us an integer and integer division only gives us whole numbers.
We can always check our results for any arbitrary row by changing the value of check in the code block below:

Why don’t you see if you can work out the Location Quotient on your own?

It can be really useful to manually double-check the results of a calculation – if you were to get the LQ wrong here then all of the analysis that you do later would be wrong too. And while the replicability of code helps (i.e. you can fix the mistake and then re-run the entire analysis very quickly), you should never assume that you got it exactly right until you’ve double-checked.

# Exploratory Data Analysis & VDQI

## Using Seaborn

As I said above, we are going to be using the seaborn package (http://stanford.edu/~mwaskom/software/seaborn/) as the principal data visualisation library for the remainder of the term because it provides a lot of quite complex functionality (and very pretty pictures) at quite low cost. As with other libraries that we’ve used, we’ll import using an alias:
import seaborn as sb
In a number of the examples you’ll find online they use a different alias, but I thought sb was an easier mnemonic.
However, as you’ll see below in some cases we need to make use of other functions that are provided by libraries used by seaborn itself. The base library for most Python plotting is matplotlib which was designed to mimic the plotting functionality provided by MATLAB.
There are other options available to you: Bokeh (http://bokeh.pydata.org/en/latest/) looks quite interesting and ŷhat provides a library that mimics R’s ggplot2 (http://ggplot.yhathq.com) – the latter is fast becoming the default implementation ‘style’ for plotting, but does not currently support mapping (which R does in ggplot2).

### Making a distribution plot

One of the most useful ways to get a sense of the data is simply to look at its overall distribution. Let’s try this:
series = df.Group1Lq
sb.distplot(series)
sb.plt.show() # Necessary on GeoCUP, not Mac
You may need to minimise the Canopy window in order to see what has just happened… but if all at went well you should see a nice distribution plot for the Group 1 Location Quotient! 
OK, I want you to take a second here: although there was a lot of setup work that needed to be done, we just created a distribution plot in 1 line of code. 1 line. This is a more sophisticated plot than you could ever create in Excel and you just created it in one line of code.
Try producing similar plots for Group 4, Group 8, and the Not Classified group. Do you understand more about the NS-SeC data and its distribution now?

### Saving a Plot

Unfortunately, saving a plot isn’t quite as easy as creating one… 
It’s worth remember that you can always click on the save button on the figure pop-out (below, next to the check-mark on my version of Canopy) after resizing the chart to the layout that you like.
 
But wouldn’t be it a lot easier if we could save our plot automatically and not have to even touch a button to do so? Try this:
 
The plot should have been saved to your working directory – the one listed in Canopy between the Python ‘Starter’ file and the interpreter where you can type your expressions. If you wanted to save it somewhere else then you’d have to give it a full path (this is why we spend so much time in the Terminal – you will need to learn how to use paths quite naturally).
If you think back a little bit to where we started 6 weeks ago, you’ll see just how far we’ve come, and just how far you can go now: the ability to automatically load, clean, process (e.g. standardise and filter), and print out data in a variety of forms is incredibly powerful. And you could repeat this entire analysis with the 2001 Census data simply by changing the input file. You could repeat this entire analysis with the 2011 Census data simply by changing the input file (assuming the format doesn’t change too much). It’s the automation component that sits at the heart of geocomputational techniques – you still need individuals and their judgement to figure out what to do and what to produce, but once you’ve done that you can make the computer do the boring stuff while you focus on the interpretation and the meaning of the results!

# Automation

## Automating Summaries
Having said that, let’s see about simplifying what we’ve done so far so that we don’t have to copy-paste the code repeatedly in order to produce charts for each of the columns in the data frame. 
Recall that I said two things about what you should do if you find yourself duplicating the same code over and over again: 1) write a for loop; 2) consider writing a function.
Let’s start with a for loop to generate summary statistics and a distribution plot for Group1, Group1Pct, Group1Lq and Group1ZStd – you’ve already typed all of this code more than once so all we need to do is take the right bits of it and put them in a for loop like this:
 
All you need to do is figure what goes into the for loop in the first place! You can get quite a lot from just looking at the code itself: where do we get a Series object from? Try typing this: type(df.Group1).
Automating Analysis
It is actually much, much harder to automate the analysis, but there are certain things that we can do to make life easier – transformation and standardisation is one of the easiest things to do as part of automation!
Here’s some code to generate the additional columns for our analysis on the pattern that we created for Group 1:
 
Do you see how that works – we use the group names as keys to access the data in the data frames (using the df[colName] syntax), but we also use them as strings to create the new column names by doing: groupName + typeOfTransformation.
Make sure that you understand how this works before moving on to the next bit, which is harder.
Improving our Automated Summaries
Seaborn is a handy imaging library because it makes a lot of difficult charts quite easy, but to help us make sense of the data it will be useful to add some additional information to our distribution plots: lines to show the location of the mean, median, and outlier thresholds.
To do this, we need to get at the library that Seaborn itself uses: matplotlib.
Find the code that you wrote to create and save a distribution plot (5.1.3), and update it so that it looks like this:
 
Remember that you type help(plt.vlines) to discover what parameters that function takes. 
Try changing the style of the median to a solid green line to make it easier to distinguish from the line marking the mean.
Why do you think I put an if condition on the outlier vertical lines? Would this always be appropriate? Why? Why not?
Bonus (Part 1)
Finally, let’s combine all of the code from 5.3.3 and 5.1.3 into a function – this will allow you to tidy up your code a lot because instead of having lots of copies of the same code all over your script, you can just call the function instead and it will generate the outputs you need.
Since this is bonus material I won’t give you a lot of pointers, but recall that you will want to put all of the code for 5.4.1 inside a function definition:
def generateSummary(series):
     … code here …
You could do the same thing for generating the transformed and standardised data columns for each of the original NS-SeC groups.
Bonus (Part 2)
Taking a Cut of the Data
For some types of plots we can be overwhelmed by the data – trying to show the distribution of Group 1 people as grouped by borough would be tricky.
So let’s take a ‘cut’ of the data by selecting only some boroughs for further analysis – we’re going to arbitrarily select K&C, Hackney and Barking because I know they’re quite different boroughs, but you could also use the data to make this selection, right? For instance, I could ask pandas to help me pick the three boroughs that are the furthest apart in terms of their Group 1 means…
Anyway, here’s the code to select a cut:
 
You’ll notice two unusual bits of code in there that need some explanation:
	To ‘select  multiple’ we need to write: dataFrame.Series.isin([…]), so we can’t just write dataFrame.Series==[…] unfortunately.
	We also have this line:
sdf.Borough = sdf.Borough.cat.remove_unused_categories()
Which should be fairly self-explanatory, but it’s because be default pandas doesn’t update the list of valid categories (i.e. boroughs) just because we filtered out boroughs that weren’t of interest. We therefore need to update the Series so that Seaborn doesn’t include a bunch of empty categories in when we make our plots.
Other Types of Plots
Let’s have a look at some other types of plots…

## 3D Plots
Finally, I wanted to show you how to create a 3D scatter plot – in this case the plot doesn’t add a lot to our understanding of the data, but there are cases where it might and it does illustrate how pandas, seaborn, and matplotlib work together to produce some pretty incredible outputs.
Here’s what you will get:
 
Here’s the code:
  
I would encourage you to look into the options in more detail:
	Can you change the colour map used to indicate which Borough each LSOA is drawn from?
	Can you change the icons used to mark each Borough so that they are different?
	Can you add a legend to indicate which marker is for which Borough?

# Additional Resources

* Geocomputation
