# Making Sense of Data

Last week we did some basic EDA (Exploratory Data Analysis) with the MetOffice weather data, but it was limited to mainly looking at a few graphs right at the end of the practical. Thanks to that, depending on the data you got you may spotted a 'problem' or two (_e.g._ temperature readings at the start or end of the time series that were basically 0).

This week we want to tackle this in a more systematic way. We are going to switch data sets because the Socio-Economic data is easier to manipulate than the weather data and it has a few features that are particularly useful for demonstrating the value of transformation and standardisation.

## Statistics as Judgement

We hope that you'll go on to make lots of use of what you're learning here, but the single most important thing that you can take away from the remaining 6 weeks of class is the idea that statistics is not _truth_, it is _judgement_. And the first step towards making sense of your data so that can form a judgement of its utility is to make a graph of your data...

[![Are you above average?](http://img.youtube.com/vi/hQLCWHww9OQ/0.jpg)](http://www.youtube.com/watch?v=hQLCWHww9OQ)

### Data as Representation of Reality

The statistician George Box [once said](https://en.wikipedia.org/wiki/All_models_are_wrong) "all models are wrong but some are useful". Now you might think that 'wrong' _necessarily implies_ uselessness, but this aphorism should tell you that things are a lot more interesting than that: let's review the idea of statistics as the study of a '[data-generating process](http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9639.2012.00524.x/full)'. 

The data that we work with is a _representation_ of reality, it is not reality itself. Just because I can say that the height of human beings is normally distributed with a mean of, say, 170cm and standard deviation of 8cm doesn't mean that I've _explained_ that process. What I have done is to say that reality can be reasonably well approximated by a data-generating process that uses the normal distribution. 

Given that understanding of height distributions, I know that someone with a heigh of 2m is very, very improbable. Not impossible, just highly unlikely. So if I meet someone that tall then that's quite exciting! But my use of a normal distribution to represent the empirical reality of human height doesn't mean that I think it is _actually_ distributed randomly amongst all human beings in a gigantic lottery system: some parts of the world are typically shorter, other parts typically taller; some families are typically shorter, while others are typically taller...

So the _real_ reason for someone's height is to do with, I assume, a mix of genetics and nutrition. However, across large groups of people it's possible to _represent_ the cumulative impact of those individual realities with a simple normal distribution. And using that simplified data-generating process allows me to do things like estimate the likelihood of meeting someone more than 2m tall (which is why I'd be excited to do so, though not as excited as the guy in the next video...).

Here's a (genuinely terrifying) video that tries to explain this whole idea in a different way:

[![From reality to make-believe](http://img.youtube.com/vi/HAfI0g_S9oo/0.jpg)](https://www.youtube.com/watch?v=HAfI0g_S9oo)

In the same way, real individuals earn different incomes for all sorts of reasons: skills, education, negotiation ability... and, of course, systematic discrimination or bias. Because of wide variations in individual lived experience, it's quite hard to _prove_ that any _one_ person has been discriminated against unless you have the 'smoking gun' of an email or other direct evidence that this has happened. 

But if I have data on _many_ men and women (from a company, industry, or society) to work with, then I can take a look at what data-generting processes best describe what I've observed. And I can also create a data generating process that would describe what I'd _expect_ to see if no systematic discrimination were taking place at all. I make that sound simple but, of course, it's really hard to do this properly: do you account for the fact that discrimination has probably happened throughout someone's lifetime, not just in their most recent job? And so on.

### What is _Significant_?

But, when we have created a data-generating process that captures this at what we feel is an appropriate level, then the analytical process becomes about testing to see if there is a _significant_ difference between what I expected and what I observed. Once we've done that, then we can start to rule out claims that 'there are no good candidates' and the other defences of the indefensible. It is always theoretically _possible_ that a company had trouble finding qualified candidates, but as you put together the evidence from your model it may well  become increasingly _improbable_.

So, always remember that, while the data is not reality, it _is_ a very useful abstraction of reality that allows us to make certain claims about what we expected to see. Linking our observations to what we know about the characteristics of the data generating process then allows us to look at the _significance_ of our results or to search for outliers in the data that seem highly improbable and, consequently, worth further investigation.

----

# Transformation

Now let's look at _why_ we just talked some more about the data series. It is _very_ rare that you want to overwrite the raw data columns that you loaded from a file or read from an API. Why? Because I can pretty much guarantee you that at some point you'll wonder if you made a mistake and will need to check your results. Or someone _else_ will wonder if you made a mistake and want to check your results. Or you'll decide that your first approach wasn't the right one and will want to recalculate a derived variable... unless you've overwritten the original! So you will _usually_ want to write a transformed set of data values into a new column (last week it was `df3.Time` to hold the transformed data from `df3.ts`).

Anyway, transformations are useful when a data series has features that make comparisons or analysis difficult, or that affect our ability to intuit meaningful difference. By manipulating the data using one or more mathematical operations we can sometimes make it more *tractable* for subsequent analysis.

[![How tall is tall?](http://img.youtube.com/vi/-VjcCr4uM6w/0.jpg)](http://www.youtube.com/watch?v=-VjcCr4uM6w)

Here's an example: let's say that we want to understand how student heights are distributed within a class. It's not at all easy if all you have to go on is a list of raw heights: 160cm, 158cm, 150cm, 185cm, 172cm, 175cm, 166cm... and, of course, Mr. Bill Gates himself, to know how big a range you've got on your hands and where people fall relative to the mean.

Let's try writing this out as Python code:
```python
# Create an empty data frame to 
# hold our height data
df2 = pd.DataFrame() 
```
Here we've created an empty data frame – as yet it contains no data.
```python
# Create and add a series
df2['Height'] = pd.Series(
    [160, 158, 150, 185, 172, 175, 166, 168],
    index = ['Judy','Frank', 'Alice', 'Eve', 'Bob', 'Carlos', 'Dan', 'Bill G.']
)
```
Then we create a new data series: the data for the series is the heights, the index is the name of the student. We assign this new data series to 'Heights' in the data frame.
```python
# Look at the results
df2.describe()
```
And, as always, a good next step is to check that you got what you expected.

In [None]:
df2 = pd.DataFrame() 

# Create and add a series
df2['Height'] = pd.Series(
    [160, 158, 150, 185, 172, 175, 166, 168],
    index = ['Judy','Frank', 'Alice', 'Eve', 'Bob', 'Carlos', 'Dan', 'Bill G.']
)

# Look at the results
df2.describe()

Now let's add their wealth...

In [None]:
df2['Wealth'] = pd.Series([28300, 21258, 37234, 32748, 18536, 75093, 124382, 5124398742348], index=df2.index)

# Check the results
df2.describe()

## Subtracting the Mean

An obvious first step to understanding this student data would be to use the mean ($\mu$) from the data since that tells us the _average_ height and wealth of all students in the class. But wouldn't it be even _more_ interesting to be able to look at how tall students are _relative_ to the mean? Is there someone from the basketball team taking this course? Or maybe the cox from a crew? How could we make it easy to compare the difference between each student and the overall class average in order to spot these 'special cases'?

In many cases, the best way to make this comparison is to _subtract the mean_. Why is that? What does it achieve?

Let's think it out:
1. If a student is shorter _than average_ then their transformed height is less than 0
2. If a student is taller _than average_ then their transformed height is more than 0
3. The distance from 0 (e.g. -20 vs -3) gives us _some_ sense of how short or how tall someone is relative to that mean

In a mathematical form we'd write this transformation as:
$$
x - \mu
$$

In pandas we can express this transformation as:
```python
df['<new column name>'] = df.<column>-df.<column>.mean())
df.describe()
```

We can break this apart as:

* `df.<column>` – this is the _entire_ data series
* `df.<column>.mean()` – this calculates the mean ($\mu$) of the data series
* We perform this calculation and then use the results to create a new data series
* We assign this series to a new column in the data frame called `<new column name>`.

Pandas is smart enough to know that it needs to take _each_ student height and then subtract the mean height of all students from that value. So even though it looks like we're performing a single calculation, we're actually performing as many calculations as there are rows in the data frame but without needing to write any tricky code!

To recap: looking at the heights of the students (whether in code or in the notebook generally) it's hard to tell how far each student is from average, and who might be especially (*significantly*) tall or short. When this happens we can _transform_ the raw data in order to make it easier to see and interpret this variation.

*Remember*: subtracting the mean is a linear transformation (unlike the log-transform).

Try the transformation in the coding area below:

In [None]:
df2['TransformedHeights'] = df.???-df.???
df2['TransformedWealth']  = df.???-df.???
df2.describe()

What do the Transformed Heights mean? Does the notation make any sense to you? It may help to remember that Python, like nearly all programming languages, can rack up very, very minute errors when working with floating point numbers.

So the output is in something called scientific notation, where numbers are represented using an exponent for accuracy and consistency in the formatting. To change this, try a Google search on `"python data frame force non-scientific output"`. You should find a solution in the first few results. 

The concept of `lambda` is something we haven't seen before: previously, when we wanted to define a function we _had_ to use `def <function name>:`, but sometimes we just need a tiny snippet of code to do something useful in a function-like way and don't want to have to write a full definition. That's where `labmda` comes in: it is creating something called an *anonymous* function, meaning that we can define a function without giving it a name. Why is this useful? [Read more about lambda](https://pythonconquerstheuniverse.wordpress.com/2011/08/29/lambda_tutorial/).

# Looking at the Effect of a Transformation

Now let’s see what effect this transformation has on real-world data. Let's go back to 'real world' of the NS-SeC data about socioeconomic 'class' from the UK and print out a few summary metrics:

In [None]:
# Print pandas summary for group
series = df.Group1
print "Summarising " + series.name + "..."
print series.describe()
print "\n"

# Or we can print just what we want!
print "Prettily formatted metrics from " + series.name + "..."
print "\tMedian:  {0:> 7.1f}".format(series.median())
print "\tLQ:      {0:> 7.1f}".format(series.quantile(0.25))
print "\tUQ:      {0:> 7.1f}".format(series.quantile(0.75))
print "\tRange:   {0:> 7.1f}".format( (series.max()-series.min()) ) 

In the coding block below, why don't you use what we've just learned about transformations to subtract the mean from each Group1 cell and assign the result to a new series called `Group1LessMean`, for which we then print out the pretty-printed summary?

Why do you think I have asked you to copy the new column (`df.Group1LessMean`) to a new variable called `series`? What if I asked you to do the same for Group 2? Constructively lazy!

In [None]:
# Calculate and assign a transformed variable
# to the data frame...
df['Group1LessMean'] = ???

# Copy the new column to a temporary variable 
# called 'series'.
???

# *Why* would we use this 'series' variable instead
# of 'hard-coding' Group1LessMean in all of the 
# steps below?

# Now we can do pretty numbers!
print "Prettily formatted metrics from " + series.name + "..."
print "\tMedian:  {0:> 8.1f}".format(series.median())
print "\tLQ:      {0:> 8.1f}".format(series.quantile(0.25))
print "\tUQ:      {0:> 8.1f}".format(series.quantile(0.75))
print "\tRange:   {0:> 8.1f}".format(series.max()-series.min())

If you compare the results how important are the changes? Which metrics have changed, and which remain unaffected?

### String Formatting for Pretty-Printed Numbers 

Notice also the `<string>.format()` command I’ve used here: `{0:> 8.1f}`. In order to understand how this works for formatting the results in a nice, systematic way you will need to read  [the documentation](http://www.python.org/dev/peps/pep-3101/)

The 'pep' tells you that:
* `{0}` tells Python to grab the first value inside the parentheses (`format(... values ...)`) and to stick it into the string at this point, but `:...` tells Python that we also format the string in a particular way specified in the `...`.
* `>` tells Python that the string should be right-aligned.
* The space (' ') next to the > says that any 'fill' should be done with whitespace (you could also do it with a 0).
* `8.1f` tells Python to treat anything it gets as a float (even if the variable is an int) and to format it for having 1 significant digit after the full-stop, and a total of 8 digits in all (which ties us back to the right-alignment up above). If you give it a number that has more than 3 digits to the left of the full-stop then it will still print them out,.

Here are some suggestions to better-understand what’s going on:
* Try changing the > to a < to see what happens to the alignment (then change it back)
* Try changing the .1 to a .0 to see what happens to the alignment and precision
* Try changing the .0 to a .6 to see what happens to the alignment and precision
Do this in the coding area below.

In [None]:
print "\tMedian:  {0:???f}".format(series.median())

## Logarithmic Transformation

Let’s do one more simple transformation: taking the natural logarithm of the Group1 values. If you don’t remember what a logarithm is try these:

[![From reality to make-believe](http://img.youtube.com/vi/zzu2POfYv0Y/0.jpg)](https://www.youtube.com/watch?v=zzu2POfYv0Y)

[![From reality to make-believe](http://img.youtube.com/vi/akXXXx2ahW0/0.jpg)](https://www.youtube.com/watch?v=akXXXx2ahW0)

[![From reality to make-believe](http://img.youtube.com/vi/0fKBhvDjuy0/0.jpg)](https://www.youtube.com/watch?v=0fKBhvDjuy0)

The last video was made by Ray & Charles Eames, two of the 20th Century’s most famous designers.

Note also that logarithms are non-linear transformations -- can you think why this is?

### Logarithmic Transforms in Pandas

To create a new series in the data frame containing the natural log of the original value it’s a similar process to what we've done before; since pandas doesn't provide a log-transform operator (i.e. you can’t call `df.Group1.log()` and it makes no sense why it would) we need to use the `numpy` package again:
```python
import numpy as np
df['Group1Log'] = pd.Series(np.log(df.Group1+1))
```
Try performing the transformation and then printing out the same summary measures as above in the coding area below. Is it more clear to you now why a log-transform is a non-linear transformation?

**Also**: can you think why we added 1 to every Group1 value in the data set _before_ taking the log?

Perhaps it would be a little easier to see it visually?

In [None]:
%matplotlib inline
df.Group1.plot.hist()

In [None]:
df.Group1Log.plot.hist()

# Standardisation

## Proportional Standardisation

Clearly, a proportion (e.g. a percentage) is one way of standardising data since, unless you're measuring change, it limits the range to between 0% and 100%. Programmers and statisticians almost always write a proportion in a decimal format so the range is between 0.0 and 1.0. 

Mathematically, however, the notation is a little more forbidding:
$$
p_{i} = \frac{x}{\sum_{i=1}^{n} x}
$$
However, it's important that you begin to familiarise yourself with the mathematical notation since many papers on computational geography will make use of this form. Let's break it down:

1. It's a fraction: the observation _x_ divided by, errrr, something involving x
2. The numerator is easy
3. The denominator is hard
4. The key is in the $i=1$ and $n$, which tells us that the sum is for 'all i' (i.e. summing for every _x_-value in the data set) 

In other words, we take the `sum()` of the column of `x` as the divisor for *each* observation of `x`. 

To put it in terms of our heights data:

1. Take each person's height (which we call '_x_')
2. Add up (sum: $\sum$) every _x_ (which runs from 1.._n_) in the data set
3. And divide \#1 by \#2

But the point is that that notation applies for every data set, not just our heights data. That's where the mathematical formula is more useful – it's not linked to the specifics of this particular data set.

### Proportional Standardisation in Pandas

We can calculate the proportion of people in each area who come from Group 1 using a similar format to what we've seen before in the Transformation section:
```
df['Group1Pct'] = <Group 1>/<Sum of Group 1>
```
You might recognise that this is that dictionary style of key/value pairs again, so `'Group1Pct'` is the key, and new data series is the value. You can see the link here between the mathematical notation and the computational operation, right?

Try printing out various summary metrics for the new column and comparing them to the raw values. 

In [None]:
df['Group1Pct'] = ???
df.Group1Pct.describe()

### Think

Is this form of proportional standardisation very useful? Can you think why we might not be _that_ interested in the share of _all_ Group1 people in each LSOA? 

So if that's _not_ what we want, then what proportion _do_ we want? I'd suggest having a look at the other columns that are available to us... 

Try updating the code block above so that it gives you a more useful answer. *Hint*: the right answer in _this particular case_ will have the following output:
```python
count   42619.000
mean        0.132
std         0.085
min         0.000
25%         0.065
50%         0.114
75%         0.183
max         0.550
Name: Group1Pct, dtype: float64
```

### Why Use Proportions?

Right, we've done something, but _why_ did we do it?

Let's have a look at a chart:

In [None]:
# This should have already run once above
# %matplotlib inline

df.Group1Pct.plot.hist()

Right, now we know that _nowhere_ has a share of Group 1 residents above about 50% (though that is _high_!), and that most LSOAs in England & Wales have a share that is substantially lower: in the range 0–15%. This already tells us quite a bit about the overall distribution of Group 1 households, but we can go a lot further...

## Z-Score Standardisation

The z-score is a common type of standardisation, but it's a little more complex than a simple proportion; however, both are designed to enable us to compare data across different groups of observations. We can easily compare two percentages to know which one is more, and which less (e.g. I got 80% on one exam and 70% on the other).

But let's think about a slightly different question: 'which exam did I do _better_ on?' What if you got 80% on an exam where everyone else got 85%? Suddenly that doesn't look quite so good right? And what if your 70% was on an exam where the average mark was 50%? Suddenly that looks a lot better, right?

The z-score is designed to help you perform this comparison directly.

As a reminder, the z-score looks like this:
$$
z=(x-\mu)/\sigma
$$

That's: `(<data> - <mean>)/<standard deviation>`

### Z-Score Standardisation in Pandas

Let’s figure out how to translate this into a new columns called `Group1ZStd`!

We’ve already done the first part of this calculation up above in `Group1LessMean`. That series is the same as $x-\mu$, so all we need to do is divide by the standard deviation.

So one way to do this is:
```python
df['Group1ZStd'] = 
     df.Group1LessMean/df.Group1.std()
```
That works exactly the same way as what we did when we subtracted the mean in the first place: we’re just taking the results from the previous equation and passing them on to this one. But now that you've seen how we can 'chain' together method calls, then you should know that we could _also_ do it this way:
```python
df['Group1ZStd'] = 
     (df.Group1 - df.Group1.mean()) / df.Group1.std()
```
Do you see how we can begin to build increasingly complicated equations into the process of creating a new data series?

### Using the Z-Score

Let's start to bring the idea of the 'data generating processes' to life. The first thing to do with the z-score is to look at what it implies:

1. Subtracting the mean implies that the mean _is a useful measure of centrality_: in other words, the only reason to subtract the mean is if the mean is _meaningful_. If you are dealing with highly skewed data then the mean is not going to be very useful and, by implication, neither is the z-score.
2. Dividing by the Standard Deviation implies that this _is a useful measure of distribution_: again, if the data is heavily skewed or follows some exponential distribution then the standard deviation is not going to very useful as an analytical tool.

So the z-score is _most_ relevant when we are dealing with something that look vaguely like a standard normal distribution (which has mean=0 and standard deviation=1). In those cases, anything with a z-score more than 1.96 standard deviations from the mean is in the 5% significance zone. 

But remember: we can't really say _why_ one particular area has a high concentration of well-off individuals or why one individual is over 2m tall. All we are saying is that this standardised value is a pretty unlikely outcome _when compared to our expectation that people are randomly distributed across the region_ or _that people have randomly-distributed heights of mean 'x' and standard deviation 'y'_. 

Of course, we _know_ that people aren't randomly distributed around the country in the same way that we know that height isn't genuinely random becuase of the influence of genetics, nutrition, etc. But we need a way to pick out what counts as _**significant**_ over- or under-concentration (or height) from the wide range of 'a bit more' or 'a bit less' than 'normal'. **_If_** a normal distribution does a good job of representing the overall distribution of heights (_whatever_ the reason) then someone of 2m is highly unlikely but we can't say _how_ unlikely until we've placed them on the distribution.

Let's put it another way:
* Is 10% of wealthy individuals in a small area a high concentration?
* How about 20%?
* 30%?

The only way to answer that question is to use something like the z-score since it standardises all of the values _against the average_. If wealthy people were distributed at random then we would _expect_ that most areas would have about the average concentration. Some areas will have a few more. Some areas a few less. But according to the way that the standard normal distribution works, _nowhere_ should have a z-score of 10. Or 20, since that is 20 standard deviations from the mean and just shouldn't exist in the lifetime of the universe. So if we see that kind of result then we know two things:

1. That our assumption that normal distribution is a reasonable representation of reality breaks down at some point.
2. That there are _some_ areas with _highly significant_ over- or under-representation by wealthy residents.

But that's ok, because we're trying to set an expectation of what we think we'll see so that we can pick out the significant outliers.

### Setting Expectations

And here we get to the crux of the issue, most frequentist statistics boils down to this: subtracting **what you expected** from **what you got**, and then dividing by **some measure of spread** to control for the range of the data. We then look at what's _left_ to see if we think the gap between expectations and observations is _meaningful_ or whether it falls within the range of 'random noise'.

It should be obvious that it's the _**expected**_ part of that equation that is absolutely crucial: we come up with a process that generates data that _represents_ the most important aspects of reality, but we shouldn't imagine that our process has explained them. It's the first step, not the last.

### Illustrating Z-Score Selection

So let's see how this works... let's take two of the NS-SeC groups: Group 1 and Group 4. And let's investigate their distributions and whether they contain obvious outliers (extremes that shouldn't exist if they were to follow a normal – which is to say: random – distribution). But we _can't_ do this with the raw counts because the raw counts are truncated (you can't have -1 Group 1 households, only positive counts). We'll use the z-score:

In [None]:
df['Group1ZStd'] = (df.Group1 - df.Group1.mean()) / df.Group1.std()
df['Group4ZStd'] = (df.Group4 - df.Group4.mean()) / df.Group4.std()

In [None]:
df.Group1ZStd.plot.hist(bins=20)

In [None]:
df.Group4ZStd.plot.hist(bins=20)

Just from _looking_ at those you can tell that the mean ($\mu$) is going to be more relevant to Group 4 than to Group 1: it will be closer to the 'middle' of the data. Here we have a choice to make: we know that the mean wasn't the best measure of the middle of the data for Group 1 but does that mean we can't use the z-score for our analysis?

Yes and no. More on this later, but first let's use the z-score _as if_ there were no 'problems' and see what we get when we select for 99.9% significance... but already we have to make a choice! Are you interested in a 1-way difference, or a 2-way difference? 
* If you are only interested in areas that are _significantly higher_ (your theory is that Group1 clusters, you're not interested in anything else) then you would want the 1-tailed z-score (a.k.a. cumulative probability distribution) and you can find a calculator to convert 99.9% significant to the z-score [here](https://www.fourmilab.ch/rpkp/experiments/analysis/zCalc.html).
* If you are interested in areas at *both* ends of the distribution (significant _over_ and _under_ representation) then you would want to the 2-tailed z-score and you can find some useful numbers [here](http://www.stat.umn.edu/geyer/3011/examp/conf.html).

This is closely related to the idea of the Confidence Interval, of which [more here](https://www.mathsisfun.com/data/confidence-interval-calculator.html). It is _well_ worth familiarising yourself with the CI (and the concept of effect size) over the coming weeks.

Let's see what we get with an _actual_ random normal distribution of the same size as our LSOA data. We can do that with [`numpy`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.normal.html) (you might want to add some comments as you make sense of this code!):

In [None]:
import numpy as np

# Generate a pandas series containing a normal distribution
# of the same size as the LSOA data frame
nDist = pd.Series(np.random.normal(size=df.shape[0]))

# How many of them fall outside the 99.9% CI? (1-tailed?)
outliers = nDist[ nDist > 3.09 ]

print("We selected " + str(outliers.shape[0]) + " of " + str(nDist.shape[0]))
print("\tThat's {0: .4f}%".format(float(outliers.shape[0])/nDist.shape[0]))

In [None]:
# And plot, just to show that it _was_ a normal distribution
nDist.plot.hist(bins=20)

Now let's compare that to what we get with our NS-SeC data:

In [None]:
dfG1 = df[df.Group1ZStd >= 3.09]
dfG4 = df[df.Group4ZStd >= 3.09]

print("We selected " + str(dfG1.shape[0]) + " of " + str(df.shape[0]))
print("\tThat's {0: .4f}%".format(float(dfG1.shape[0])/df.shape[0]))

print("We selected " + str(dfG4.shape[0]) + " of " + str(df.shape[0]))
print("\tThat's {0: .4f}%".format(float(dfG4.shape[0])/df.shape[0]))

So in both cases we got _quite_ a few more 'significantly' over-represented LSOAs than we would have expected if the real distribution had closely matched a random normal one. Without even looking at the plots, we could tell that both of these distributions are positively skewed but with the plots it's super-obvious. So can we use the z-score to make these selections... given that we can tell the mean is not _necessarily_ the best measure of central tendency?

### Z-Scores for  Log-Transformed Data Series

In [None]:
df.Group1Log.plot.hist(bins=20)

That's still not perfect but it's a lot better than the hugely skewed data we had before! If you _really_ care, then there's a [Box-Cox transform](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox.html) in `scipy.stats` to get something that is almost _exactly_ normal.

Now, if we calculate a z-score from the log-transformed data to select areas that fall outside the 95% CI (2-tailed) then we get:

In [None]:
df['Group1LogZStd'] = (df.Group1Log - df.Group1Log.mean()) / df.Group1Log.std()
dfG1LogZ = df[df.Group1LogZStd >= 1.65]

print("We selected " + str(dfG1LogZ.shape[0]) + " of " + str(df.shape[0]))
print("\tThat's {0: .4f}%".format(float(dfG1LogZ.shape[0])/df.shape[0]))

## Location Quotient Standardisation (LQ)

This 'LQ' is used in geography to measure the concentration of 'something' in a sub-region to the overall concentration of that 'something' in the larger region of which that sub-region is a part. That's a bit of a mouthful, but one of the most common applications of hte LQ is in economic geography where we are trying to find concentrations (or absences) of employment in a sector that seem significant. 

For example, let's pretend that we have a (very small) country composed of three regions:
- Region 1: 400 employees; 200 in steel
- Region 2: 200 employees; 150 in steel
- Region 3: 300 employees; 80 in steel
  
**Question**: which region has the greatest concentration of steelworkers?

**Answer**: depends what you mean by concentration. 

There are _more_ steelworkers in Region 1 overall, but their _density_ is higher in Zone 2 (because `150/200 > 200/400`). We use the density (which is a simple proportion) to ensure that we are comparing like-for-like: larger areas can be compared to smaller areas. The proportion therefore controls for the fact that each of the regions has a different number of total employees. 

Let's put it another way: cities like London and New York are gigantic. If you want to know where there are the _most_ bankers or _most_ bagel factories then your answer will almost always be... London and New York. But what if there's a small town where 95% of people are bankers, and another town where 80% of people are bakers? Surely that's pretty interesting too, no? What's going on that a place can support way more people in these professions that we'd expect...

_Expect_, there's that word again. How do we define what we _expect_ so that we can compare it to what we _got_? Well, the mean is one way of defining our expectations -- if you had to guess the height of a new student in your class, your _expectation_ would be based on the heights of the existing students, and the best possible guess that you could make would be the _average_ height of all students because the majority of students are about that tall.

In the case of our little steel-producing country, we would need to define our expectation a little differently. We're already using a proportion (steelworkers in a region/all workers in a region) to control for the fact that our regions have different sizes. Why not use a _second_ proportion (all steelworkers in the country/all workers in the country) to set our expectations about how concentrated steel employment will be in each area?

That's the LQ.

$$
LQ = \frac{{Employment}_{sR}/{Employment}_{eR}}{{Employment}_{sA}/{Employment}_{eA}}
$$

In this formula $sR$ is the count of steelworkers in a Region, and $eR$ is the count of employees in all industries. Similarly, $sA$ is the count of steelworkers in All Areas (i.e. the country), while $eA$ is total employment in the country.

What we do by using the proportion across the entire country is to set our _expectation_ that employment should be distributed evenly across all regions. If steelworkers don't have any specialist needs then each region would be expected to have a proportion of steelworkers in line with the proportion of steelworkers in the country as a whole. So if we find areas that are way above or below this then that might be something worth digging into.

### Calculating the LQ in Pandas

To illustrate this, I've set up our three region steel-producing country in a pandas data frame. Try running the code below, but you should know _now_ that I've made a deliberate mistake...

In [None]:
# Notice how we can create a data frame 'by hand'
d = {
  'allemp': pd.Series([400,200,300], index=['Region1','Region2','Region3']),
  'steel' : pd.Series([200,150,80], index=['Region1','Region2','Region3']),
}
df3 = pd.DataFrame(d)

# Look at what we've done
print df3

print "\n"

print "Proportion in each region: "
print df3.steel / df3.allemp

print "\n"

print "Proportion in entire country: "
print df3.steel.sum() / df3.allemp.sum()

print "\n"

print "LQs for each region (prop. region/prop. country): "
print (???) / (???)


#### Debugging

Huh, what's going on here? Why is the LQ `inf` (i.e. infinite)? That's not very helpful! This requires a little bit of debugging that revisits something you did in the first couple of weeks! You'll need to step through the code to see where things are 'going wrong' before fixing the code block so that you get three LQs in a range between 0.558 and 1.57.

Sometimes it can be really useful to manually double-check the results of a calculation using a calculator or just by typing the numbers into a separate line in Python's interpreter -- if you were to get the LQ wrong here then all of the analysis that you do later would be wrong too. And while the replicability of code helps (i.e. you can fix the mistake and then re-run the entire analysis very quickly), you should never assume that you got it exactly right until you’ve double-checked.

### Interpreting Your Results

How do we interpret the LQ results? Let's start with the two simplest results: 0 and 1. 

* If the LQ is 0 then that means that the denominator (top half of the LQ formula) was 0; there is no employment in the sector of interest in that region. 
* If the LQ is 1 then that means that the top and bottom of the LQ equation were the same; the density of employment in the region is _the same_ as the density in the country as a whole. Anything more than one means a greater density than in the country as a whole.[1]

[1] _Note_: you don't have to do this analysis at the country level, your 'country' could be a city and your regions be neighbourhoods or districts... basically, you can use the LQ any time you have smaller zones nested within a larger one and want to see how the smaller zones vary against the overall regional average.

From there, the most straightforward way to think about the LQ more deeply is, perhaps surprisingly, using numbers. What do we get if the LQ is: $\frac{0.5}{0.25}$? In this case, we'd be saying that 50% of employment in the zone is in our sector of interest, while in the region as a whole we find 25% of employment to be in that sector. That gives us a LQ of 2, and we can read that directly as being _twice_ as concentrated in the zone as the region as a whole! If we had the reverse: $\frac{0.25}{0.5}$ then we'd get 0.5 as the result and we'd know that employment was _half_ as concentrated...

There's (honestly) not much of a difference between the LQ plots and the z-score plots, so why is this useful? Well, there's one _very_ useful feature of the LQ: the value '1' is special in the LQ, because it means 'about what we expected'. So if the LQ were 'as expected' then we'd see variation around a LQ of 1. If there's positive skew then it means that, relatively speaking, there are a small number of areas with very high concentrations of that group. But it's a lot easier to interpret the LQ than it is the z-score even if, functionally, there's not a huge difference between them.

But using the normal distribution as our reference distribution allows us to move beyond saying "It's more than 1, so there's a concentration" to saying "Beyond X we are seeing some unusually high concentrations that wouldn't exist _if_ the groups were distributed at random."

OK? 